BeautifulSoup 接口介绍之1

BeautifulSoup 是一个用于解析 HTML 和 XML 文档的 Python 库，它能够从网页中提取数据，非常适合网页抓取(web scraping)和数据挖掘任务。

主要功能

解析 HTML/XML：将复杂的HTML文档转换为复杂的树形结构
导航文档树：通过标签名、属性等方式查找元素
搜索文档树：使用find()和find_all()等方法搜索特定内容
修改文档树：可以修改标签、属性或删除元素
输出格式化：将解析后的文档重新格式化为标准HTML

使用场景

数据提取(从HTML中提取特定信息)
网页内容分析
自动化测试
快速原型开发

基本使用示例

安装

```python
pip install beautifulsoup4
```

基础解析示例

```python
from bs4 import BeautifulSoup
import requests

# 获取网页内容
url = "https://www.sina.com.cn"
response = requests.get(url)
html_content = response.text

# 创建BeautifulSoup对象
soup = BeautifulSoup(html_content, 'html.parser')

# 获取标题
print(soup.title)          # 获取<title>标签
print(soup.title.string)   # 获取<title>标签的文本内容

# 获取所有链接
for link in soup.find_all('a'):
    print(link.get('href'))
```

常用方法及示例

find 查找返回一个匹配html标签

函数原型：find(name , attrs , recursive , string , **kwargs )

find() 方法找不到目标时，返回 None。

# 通过id查找
header = soup.find(id="header")

find_all() 方法搜索当前 tag 的所有子节点，并判断是否符合过滤器的条件。

返回多个匹配标签:

find_all(name , attrs , recursive , string , **kwargs )

name: 标签名。传一个值给 name 参数，就可以查找所有名字为 name 的 tag。
class_: 搜索有指定CSS类名的 tag.
limit : 限制查找返回的数量。

```python

通过标签名查找

allparagraphs = soup.findall('p') # 获取所有
标签

通过类名查找

importantitems = soup.findall(class_="important")

通过属性查找

images = soup.find_all('img', src=True) # 所有有src属性的img标签 for img in images: print(img['src']) # 获取src属性 print(img.get('alt', 'No alt text')) # 安全获取属性

select 通过CSS选择器查找，返回所有匹配的Tag 列表。

CSS选择器

results = soup.select('div.content > p.intro') # 使用CSS选择器语法
select_one() 通过CSS选择器查找，返回匹配的第一个 Tag 或 None。用于查找单个元素

specialh1 = soup.selectone('h1.special')

实际应用示例：提取新闻标题和链接

from bs4 import BeautifulSoup
import requests

url = "https://www.sina.com.cn/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取新闻标题和链接
for item in soup.select('.top_newslist'):
    title = item.select_one('li > a')
    print(f"标题: {title.text}")
    print(f"链接: {title['href']}\n")

BeautifulSoup 接口介绍之1

主要功能

使用场景

基本使用示例

安装

基础解析示例

常用方法及示例

通过标签名查找

通过类名查找

通过属性查找

CSS选择器

实际应用示例：提取新闻标题和链接