Python爬虫 | 世昱的blog

笔记

Python

发布日期: 2020-02-08

更新日期: 2021-01-08

文章字数: 4.1k

阅读时长: 20 分

阅读次数:

BeautifulSoup库

BeautifulSoup库是解析、遍历、维护“标签树”的功能库

准备食材~html

import requests
>>>r=requests.get("http://python123.io/ws/demo.html")
>>>r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>>demo=r.text
>>>from bs4 import BeautifulSoup
>>>soup=BeautifulSoup(demo,"html.parser")    #"html.parser"是解析器的一种
>>>print(soup.prettify()) ##将html打印
<bound method Tag.prettify of <html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>>

print(soup.prettify())就像是把食材整齐得放在案板上，看看食材都有哪些

BeautifulSoup类

BeautifulSoup类原理

也就是BeautifulSoup类对应一个html/xml文档的全部内容

$$
html文档=标签树=BeautifulSoup类
$$

三者是等价的。

from bs4 import BeautifulSoup
soup=BeautifulSoup("<html>data</html>","html.parser")
soup2=BeautifulSoup(open("D://demo.html"),"html.parser")

BeautifulSoup库的参数——解析器

准备怎么做

一共四种解析

解析器	使用方法	条件
bs4的HTML解析器	BeautifulSoup(mk,”html.parser”)	安装bs4库
lxml的HTML解析器	BeautifulSoup(mk,”lxml”)	pip install lxml
lxml的XML解析器	BeautifulSoup(mk,”xml”)	pip install lxml
html5lib的HTML解析器	BeautifulSoup(mk,”html5lib”)	pip install html5lib

BeautifulSoup库的基本元素

基本元素	说明使用
Tag标签	标签格式：.
Name标签的名字	标签名字，格式：.name
Attributes标签的属性	标签属性，格式 : .attrs
NavigableString标签内非属性的字符串	标签非属性字符串，格式： .string
comment标签内字符串的注释部分	注释部分，格式： .string

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> r.text
'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'
>>> demo=r.text
>>> from bs4 import BeatuifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.title    #title标签

<title>This is a python demo page</title>
>>> tag=soup.a    #返回第一个a标签

>>> tag    #只会返回第一个a标签

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>>

这锅汤，有食材，有柴火，demo就像是食材，html.parser就像是做这食材的柴火和作法。

开始煲汤~

soup=BeautifulSoup(demo,"html.parser")

来尝尝这锅汤怎么样

>>> soup.a.name
'a'
>>> soup.a.parent.name
'p'
>>> soup.a.parent.parent.name
'body'
>>>

解析标签属性

>>> tag=soup.a #返回第一个a标签
>>> tag #只会返回第一个a标签
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> tag.attrs    ##以字典返回，可以字典的方式来进行信息提取

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
>>>

提取属性中的信息

>>> tag.attrs['class']
['py1']
>>> tag.attrs['href']
'http://www.icourse163.org/course/BIT-268001'
##查看tag标签属性的类型
>>> type(tag.attrs)
<class 'dict'>    #字典类型
#注意：无论有无属性、总能获得一个字典。若无，则返回一个空字典。

#标签的类型 
>>> type(tag)
<class 'bs4.element.Tag'>
>>>

NavigableString （string）

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

注意：p标签中还包含b标签，说明NavigableString是可以跨越多个标签的

>>> newsoup=BeautifulSoup("<b><!--这是comment--></b><<p>这不是一个comment</p>","html.parser")
>>> newsoup.b.string    #其注释被去掉了

'这是comment'
>>> type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'这不是一个comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>

##所以要根据类型来判断

HTML的遍历

下行遍历

contents返回的是列表类型，而后两个返回的是迭代类型，只能用在循环中

PS：descendants会遍历该节点的所有子孙节点，其他俩则只遍历其下儿子节点

下行遍历示例

>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents    ##返回列表格式，可以列表格式解析。

[<title>This is a python demo page</title>]
>>> soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
>>> len(soup.body.contents)    ##有五个儿子节点

5
>>> soup.body.contents[1]  ##检索第二个儿子节点
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.body.contents[0]
'\n'
##\n 代表回车
##说明，对于一个标签的儿子节点，不仅包括标签节点，也包括字符串节点。\n也是其一个节点。

遍历儿子节点

可先查看其结构

>>> print(soup.body.prettify())
<body>
 <p class="title">
  <b>
   The demo python introduces several python courses.
  </b>
 </p>
 <p class="course">
  Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
  <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
   Basic Python
  </a>
  and
  <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
   Advanced Python
  </a>
  .
 </p>
</body>

>>> for child in soup.body.children:
    print(child)
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

上行遍历

上行遍历示例

>>> soup.title.parent

<head><title>This is a python demo page</title></head>
>>> soup.html.parent    ##html标签的父亲就是它自己本身

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.parent    ##soup无父亲，没有显示

>>>

遍历上行所有的名字

>>> soup.parent
>>> for parent in soup.a.parents:   ##能够对soup的a标签所有的先辈的名字进行打印
    if parent is None:    ##soup本身的parent先辈不存在，所以做判断，是否为空。
        print(parent)
    else:
        print(parent.name)

p
body
html
[document]

平行遍历

平行遍历是发生在同一个父节点下的各节点间

>>> soup.a.next_sibling   
' and '

NavigableString也是节点，所以一个节点的上下平行游不一定都是标签，所以有时候需要做类型判断

>>> soup.a.next_sibling.next_sibling  ##a标签的下一个标签的下一个标签
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
>>> soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
>>> soup.a.previous_sibling.previous_silbing
Traceback (most recent call last):
 File "<pyshell#60>", line 1, in <module>
 soup.a.previous_sibling.previous_silbing
 File "C:\Users\YU\AppData\Local\Programs\Python\Python37-32\lib\site-packages\bs4\element.py", line 742, in __getattr__
 self.__class__.__name__, attr))
AttributeError: 'NavigableString' object has no attribute 'previous_silbing'
>>> soup.a.previous_sibling.previous_sibling
>>>
 >>> soup.a.parent
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>>

遍历前后续节点

>>> for sibling in soup.a.next_siblings:
    print(sibling)    #遍历后续节点

 and 
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
.
>>> for sibling in soup.a.previous_siblings:
    print(sibling)    #遍历前续节点

Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

>>>

基于bs4库的html格式输出

prettify()

为整个html标签后增加换行符：/n。目的是使得html更“友好”

>>> soup.prettify()    #整个html的标签增加了换行符

'<html>\n <head>\n  <title>\n   This is a python demo page\n  </title>\n </head>\n <body>\n  <p class="title">\n   <b>\n    The demo python introduces several python courses.\n   </b>\n  </p>\n  <p class="course">\n   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\n   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">\n    Basic Python\n   </a>\n   and\n   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">\n    Advanced Python\n   </a>\n   .\n  </p>\n </body>\n</html>'
>>> print(soup.prettify())   #用print打印出来

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

也可以单独对一个标签进行“美化”⭐

>>> soup.a   #原来长这样

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> print(soup.a.prettify())   #后来长这样了，美丽者也！

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>

编码问题

bs4中将html转换为UTF-8格式的编码，支持中文，python3.0也支持中文，而用python2.0则需要对编码进行转换。

>>> soup0=BeautifulSoup("<p>中文</p>","html.parser")
>>> soup0.p.string
'中文'
>>> print(soup0.p.prettify())
<p>
 中文
</p>

小结

bs4库的基本元素

Tag、Name、Attributes、NavigateString、Comment

bs4库的遍历功能

下行遍历

.contents

.children

.descendants

上行遍历

.parent

.parents

平行遍历

.next_sibling

.previous_sibling

.next_siblings

.previous_siblings

信息标记

国际公认的一般信息标记形式

XML、JSON、YAML

XML是基于html发展来的通用表达形式。通过标签内容表达信息。

JSON

有类型的键值对，key:value。好处是可直接在JavaScript中使用

“key”:“value”

“key”:["value1","value2"]

"key":{"subkey","subvalue"}

"name":"李世昱"
“age”:23     //这是有类型的
"lishiyu"：["李世昱"，23]
“names”：{
    “websitename”:“lishiyu.vip”,
    "realname":"李世昱"
}

YAML

无类型的键值对，key:value

没有双引号
通过缩进的形式表达所属关系
“-“表达并列关系
```
name:
    -李世昱
    -张**
```
用”|“表示整块数据，用”#“注释
键值对之间可以嵌套

三种比较

XML最早的通用信息标记语言，可扩展性好，但繁琐。一般应用于Internation 信息交互传递

JSON信息有类型，适合程序处理，较XML简洁。一般程序接口处理应用。

YAML信息无类型，文本信息比例最高，可读性好。多用于系统配置文件。

信息提取方法

方法一：完整解析信息的标记形式，再提取关键信息
XML JSON YAML
需要标记解析器，例如：bs4库的标签树遍历
优点：信息解析准确
缺点：提取过程繁琐，速度慢

方法二：无视标记形式，直接搜索关键信息
搜索
对信息的文本查找函数即可
优点：提取过程简洁，速度较快
缺点：提取结果准确性与信息内容相关

融合方法

实例

提取HTML中所有的URL链接

>>> import requests
>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> from bs4 import BeautifulSoup
>>> soup=BeautifulSoup(demo,"html.parser")
>>> for link in soup.find_all('a')
SyntaxError: invalid syntax
>>> for link in soup.find_all('a'):
    print(link.get('href'))

http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001
>>>

find_all()函数

<>.find_all(name, attrs, recursive, string, **kwargs)
返回一个列表类型，存储查找的结果

第一个参数—— name

对标签名称的检索字符串

#查询a标签
>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>


#同时查询a标签与b标签
>>> soup.find_all(['a','b'])
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>>

find_all(True)

会将显示所有标签

>>> for Tag in soup.find_all(True):
    print(Tag.name)

html
head
title
body
p
b
p
a
a
>>>

检索所有包含b字母的标签

>>> import re    #正则表达式库，暂不详解

>>> for tag in soup.find_all(re.compile('b')):
    print(tag.name)

body
b

第二个参数——attr

标签中属性是否含有某字符

∙attrs: 对标签属性值的检索字符串，可标注属性检索

>>> soup.find_all('p','course')   #p标签中包含“coursre”
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
>>>

对属性做约束

>>> soup.find_all(id='link1')   #id属性等于1的标签

[Basic Python]
>>> soup.find_all(id="link")
[]



>>> soup.find_all(id=re.compile('link'))   #使用正则表达式来查找id中包含link的标签
[Basic Python, Advanced Python]
>>>

第三个参数——recursive

是都对子孙全部检索，布尔型，默认True

也就是，若设为False，则只检索其下一个子孙节点

>>> soup.find_all('a')
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
>>> soup.find_all('a',recursive=False)
[]
>>>

第四个参数——string

string: <>…</>中字符串区域的检索字符串

>>> soup
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
>>> soup.find_all(string='Basic Python')
['Basic Python']
>>> 


#使用正则表达式检索所有包含‘Python’字符的String域
>>> import re
>>> soup.find_all(string=re.compile('Python'))
['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']
>>>

简写

(..) 等价于 .find_all(..)
soup(..) 等价于 soup.find_all(..)

七个扩展方法

<>.find() 搜索且只返回一个结果，同.find_all()参数
<>.find_parents() 在先辈节点中搜索，返回列表类型，同.find_all()参数
<>.find_parent() 在先辈节点中返回一个结果，同.find()参数
<>.find_next_siblings() 在后续平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_next_sibling() 在后续平行节点中返回一个结果，同.find()参数
<>.find_previous_siblings() 在前序平行节点中搜索，返回列表类型，同.find_all()参数
<>.find_previous_sibling() 在前序平行节点中返回一个结果，同.find()参数

中国大学排名爬取实例

代码部分

import requests
from bs4 import BeautifulSoup
import bs4

#getHTMLText函数获取url链接，返回该链接的HTML内容
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "url获取网页失败"

#fillUnivlist解析出HTML需要的有价值的内容，将其放在一个列表里
def fillUnivlist(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[2].string])

#输出结果。将列表输出。
def printUnivList(ulist, num):
    #\t表示空四个字符，也称缩进，相当于按一下Tab键
    print("{:^10}\t{:^6}\t{:^10}".format("排名", "学校名称", "总分"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))

#调用函数
def main():
    uinfo = []
    url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    fillUnivlist(uinfo, html)
    printUnivList(uinfo, 20)

#使用主函数
main()