python3从入门到精通(十二): BeautifulSoup模块

原创已于 2026-06-10 20:18:01 修改 · 置顶 · 525 阅读

11 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

收录于

python

于 2026-01-15 23:24:40 首次发布

一、BeautifulSoup 定义

BeautifulSoup（简称：bs4）是 Python 的第三方库，专门用于解析 HTML / XML 文档。它能把杂乱无章的网页源码转换成结构化的树形 Python 对象，让我们可以像操作 “树节点” 一样精准查找、提取、修改文档内容，彻底解决了原生字符串处理 HTML 的繁琐问题。

二、安装 BeautifulSoup 和解析器

# 安装核心库:
python -m pip install beautifulsoup4

BeautifulSoup依赖解析器才能工作，常用解析器对比：

解析器名称	依赖（是否需额外安装）	速度	容错性（处理畸形 HTML）	支持的文档类型	核心特点
html.parser	Python 内置（无需安装）	中等	一般	HTML	无需额外依赖，适合简单 HTML，对畸形标签（如未闭合）处理差
lxml	需安装（pip install lxml）	最快	好	HTML/XML	速度最快，容错性好，推荐生产环境使用
lxml-xml	需安装 lxml	最快	好	XML	专门解析 XML，严格遵循 XML 规范
html5lib	需安装（pip install html5lib）	最慢	最好	HTML	完全模拟浏览器解析，能处理任意畸形 HTML（如缺/）

三、解析后的核心对象

bs4 解析 HTML/XML 后会生成 4 类核心对象

3.1、文档对象：BeautifulSoup对象

代表整个 HTML/XML 文档的根节点，是解析的入口，所有标签 / 文本都嵌套在这个对象下，它继承了Tag对象的所有特性（可视为 “顶级 Tag”），同时有自己的专属属性 / 方法

BeautifulSoup 对象的方法分为两类：
1、继承自Tag 的通用方法，和普通Tag用法完全一致，用于全局查找 / 操作文档中的标签
2、专属方法（仅根对象有），用于文档级操作，如创建标签、格式化输出、编码处理

# 语法:
* BeautifulSoup(arg, 'lxml')
  - arg: 待解析的文本
  - lxml: 解析器

# BeautifulSoup 对象的核心属性:
* name: 固定返回 "[document]"，用于标识这是文档根对象（区别于普通Tag）
* attrs: 返回空字典 (根对象无 HTML 属性）
* contents: 返回文档根的"直接子节点"列表
  - 通常是<html>标签及子标签 + 可能的注释 / 空白
* children: 返回文档根的"直接子节点"迭代器（省内存，替代 contents）
  - 通常是<html>标签及子标签 + 可能的注释 / 空白
* descendants: 返回文档中"所有子孙节点"的迭代器（递归遍历整个文档）
* parent: 根对象无父节点，固定返回 None
* parents: 空迭代器（无父节点）
* string: 根对象包含整个文档，无直接文本，返回 None
* text: 获取根对象内所有标签里的文本

# BeautifulSoup 对象的专属方法和属性:
* prettify(encoding=None): 查看格式化后的文档,按缩进格式化输出HTML
  - encoding: 指定编码，如: 'utf-8'
* new_tag(): 创建新的 Tag 对象，可使用"append"添加到文档中
  - name: 标签名
  - attrs: 属性字典
* encode(): 将文档转为字节串（bytes），处理编码问题
  - encoding: 指定编码，如: 'utf-8'
* decode(): 将字节串转回字符串（str），需先"encode"
  - encoding: 指定编码，如: 'utf-8'
* replace_with(): 替换整个文档（"慎用,极少用"）
  - soup.replace_with(BeautifulSoup('<p>新文档</p>', 'lxml'))
* original_encoding: (属性)自动检测的文档原始编码
  - soup.original_encoding -> 'utf-8'

# BeautifulSoup 对象的通用方法:
* find(): 全局查找"第一个"匹配的标签
  - name: 标签名
  - class_: 类名
  - text: 匹配标签内的文本
  - attrs: 属性字典
  - kwargs: 直接传属性（如id='main'）
  - recursive: 是否递归查找子标签
    - 默认True，设为False仅查找直接子标签
* find_all(): 全局查找"所有"匹配的标签，返回列表
  - name: 标签名
  - class_: 类名
  - text: 匹配标签内的文本
  - attrs: 属性字典
  - kwargs: 直接传属性（如id='main'）
  - limit: 限制返回数量，如limit=2只返回前 2 个
  - recursive: 是否递归查找子标签
    - 默认True，设为False仅查找直接子标签
* select(): 全局用"CSS选择器"查找"所有匹配"的标签，返回列表
  - selector: CSS选择器
    - 标签选择器: 'div'、'a'
    - 类选择器: .content（匹配class=content）
    - ID选择器: #main（匹配 id=main）
    - 后代选择器: div p（div下的所有p标签）
    - 子选择器: div > a（div的直接子a标签）
    - 属性选择器: a[href^="https"]（href以https开头的a标签）
* select_one(): 全局用"CSS选择器"查找"第一个匹配"的标签
  - selector: CSS选择器，同select()
* get_text(): 提取整个文档的所有文本（去重/去空白）
  - separator='\t': 分隔符
  - strip=True: 去除首尾空格和换行
* find_parent(): 根对象无父节点，返回 None（极少用）
* find_parents(): 返回空列表（极少用）
* find_next_sibling(): 根对象无兄弟节点，返回 None（极少用）
* has_attr(): 根对象无属性，固定返回 False（无意义）

# 适用场景:
初始化解析流程，作为所有查找/提取操作的起点

3.2、标签对象：Tag对象

对应 HTML/XML 中的单个标签，如：<div>、<a>、<p> 标签，是 bs4 中最常用的对象
应用场景：提取标签的属性（如链接、图片 src）、文本，修改标签结构

# 获取方式:
* 通过soup.标签名, find(), find_all()等方式获取; 支持修改标签名、属性、文本

# 核心属性:
"核心属性 Tag / BeautifulSoup 对象通用"
* name: 获取/修改标签名称
* text: 获取标签里的所有文本
* attrs: 获取标签所有属性，返回一个字典，支持通过键直接操作属性
* string: 获取标签内直接的文本，无嵌套标签/注释时生效，否则返回None
* strings: 返回迭代器，迭代获取标签内所有文本，含嵌套标签，保留空白和换行
* stripped_strings: 返回迭代器，迭代获取标签内所有文本，"并自动去除首尾空行/换行"(首选)
* parent: 获取标签的直接父标签
  - 根标签<html>的parent是 BeautifulSoup 对象
* parents: 返回迭代器，迭代获取所有父标签
  - 从直接父节点到根节点，含 BeautifulSoup 对象
  - for p in tag.parents: print(p.name)
* next_sibling: 获取下一个同级标签，可能是空白文本节点
  - 过滤空白: tag.next_sibling.strip() if tag.next_sibling else None
* previous_sibling: 获取上一个同级标签，可能是空白文本节点
  - 过滤空白: tag.previous_sibling.strip() if tag.next_sibling else None
* children: 返回迭代器，迭代获取标签的直接子节点（Tag/文本，不含孙节点）
  - <div><p>文本</p></div> → 子节点是<p>标签
* descendants: 返回迭代器，迭代获取标签的所有子孙节点，递归遍历，含所有嵌套标签/文本

# 核心方法:
# 1.查找类方法:
* find(name, class_, text, attrs, recursive=True):
  - 作用: 全局查找"第一个匹配"的子标签
  - name: 标签名，可传列表，查找列表中的任意一个标签
  - class_: 类名
  - text: 匹配标签内的文本
  - attrs: 属性字典
  - kwargs: 直接传属性（如id='main'）
  - recursive: 是否递归查找子标签
    - 默认True，设为False仅查找直接子标签

* find_all(name, class_, text, limit, recursive=True):
  - 作用: 查找所有匹配的子标签，返回列表
  - name: 标签名
  - class_: 类名
  - text: 匹配标签内的文本
  - attrs: 属性字典
  - kwargs: 直接传属性（如id='main'）
  - limit: 限制返回数量，如limit=2只返回前 2 个
  - recursive: 是否递归查找子标签
    - 默认True，设为False仅查找直接子标签

* select(selector): 
  - 作用: 用"CSS选择器" 查找"所有匹配的子标签"，返回列表
  - selector: CSS选择器
    - 标签选择器: 'div'、'a'
    - 类选择器: .content（匹配class=content）
    - ID选择器: #main（匹配 id=main）
    - 后代选择器: div p（div下的所有p标签）
    - 子选择器: div > a（div的直接子a标签）
    - 属性选择器: a[href^="https"]（href以https开头的a标签）

* select_one(selector): 
  - 作用: 用"CSS选择器"查找"第一个匹配的子标签"
  - selector: CSS选择器，同select()

* find_parent(text, attrs, recursive=True): 
  - 作用: 查找当前标签的"第一个匹配的父标签"
  - 参数同find()
* find_parents(text, attrs, recursive=True): 
  - 作用: 查找当前标签的"所有匹配的父标签" "从近到远"
  - 参数同find()
* find_next_sibling(text, attrs, recursive=True): 
  - 作用: 查找当前标签的"下一个匹配的兄弟标签"
  - 参数同find()
* find_previous_sibling(text, attrs, recursive=True): 
  - 作用: 查找当前标签的"上一个匹配的兄弟标签"
  - 参数同find()
* find_next(text, attrs, recursive=True): 
  - 作用: 查找当前标签之后的第一个匹配标签 "不限层级"
  - 参数同find()

# 2.修改类方法:
* get(key, default=None): 安全获取属性，属性不存在返回默认值
* set(key, value): 设置/修改标签属性
* append(child): 给标签追加子节点
* insert(position, child): 在标签内指定位置插入子节点
  - position: 索引
* replace_with(new_obj): 替换当前标签/文本为新内容（新标签/字符串）
* decompose(): 彻底删除当前标签，含所有子节点，从文档中移除
* extract(): 提取当前标签，返回该标签，从文档中移除，可复用
* clear(): 清空标签内所有内容，保留标签本身
* unwrap(): 移除当前标签，保留其内容（解包）

# 3.输出/转换类方法（提取文本/导出标签） 
* text()/get_text(strip=True, separator=''): 获取标签内所有文本，"并去除首尾空白"
  - separator='\t': 分隔符
  - strip=True: 去除首尾空白
* prettify(): 格式化输出标签
* encode(): 将标签转为字节串（bytes），处理编码问题
* decode(): 将字节串转回字符串（需先encode）

# 4.判断类方法
* has_attr(key): 判断标签内是否包含指定属性
* is_empty_element: 判断是否是HTML空标签

# text vs get_text() vs string 区别: 
* string: 仅标签内只有 "纯文本" 时有效（无嵌套、无注释），否则返回None
* text: 等价于get_text()，无参数时提取所有文本并拼接
* get_text(strip=True): 去除文本前后空白
* get_text(separator='\n'): 用指定分隔符分隔子标签文本

3.3、可导航字符串：NavigableString对象

代表 Tag 对象内的纯文本内容（非标签、非注释），是字符串的子类，支持标签导航操作

# 用法:
* 通过 tag.string 获取；可通过str()转换为普通 Python 字符串

# 使用场景:
* 提取标签内的纯文本，区分"标签"和"文本内容"

3.4、注释对象：Comment对象

NavigableString 的子类，专门代表 HTML 注释（），避免把注释误判为普通文本

# 用法:
* 通过isinstance()判断是否为注释；直接取值即可获取注释内容

# 使用场景:
* 过滤/提取 HTML 中的注释内容，避免注释干扰文本提取

四、核心属性和方法

4.1、所有对象的常用属性综合案例

from bs4 import BeautifulSoup
from bs4 import Comment

# 示例HTML文档
html_doc = """
<html>
    <head>
        <title>测试页面</title>
    </head>
    <body>
        <div id="main" class="container">
            <p class="content">这是第一段文本</p>
            <h1 class="content">这是第二段文本<!-- 这是注释 --></h1>
            <a href="https://example.com" class="link">示例链接1</a>
            <a href="/internal/page" class="link">示例链接2</a>
        </div>
    </body>
</html>
"""
print("###########  获取 BeautifulSoup 对象  ###########")
soup = BeautifulSoup(html_doc, "lxml")

# print("按缩进格式化输出: ", soup.prettify())
print("文档根对象名: ", soup.name)
print("根对象属性：", soup.attrs)

print("遍历直接子节点：")
for child in soup.children:
    print(f"  -->节点类型：{type(child)}，节点名：{child.name if hasattr(child, 'name') else '文本'}")

# 3. 遍历所有子孙节点（整个文档的所有标签/文本）
print("前3个子孙节点：")
for i, desc in enumerate(soup.descendants):
    if i >= 3:
        break
    print(desc)

print("获取根对象的父节点: ", soup.parent)
print("获取根对象的所有祖父节点: ", list(soup.parents))
print("获取根对象的直接文本: ", soup.string)
print("获取根对象内所有标签里的文本: ", soup.get_text(strip=True))

print("##############  获取 Tag 对象   ##############")
# 1、通过 soup.标签名 直接获取第一个匹配的标签
title_tag = soup.title

# 2、通过 find 获取指定标签
div_tag = soup.find('div', class_='container')

# name
print("获取title标签名称: ", title_tag.name)
title_tag.name = 'h1'  # 修改标签名称，将 title 修改为 h1
print("获取修改后的title标签名称: ", title_tag.name)

# text
print("text获取div标签内所有文本，不去除首尾空格: ", div_tag.text)

# attrs
print("获取div标签的所有属性: ", div_tag.attrs)  # {'id': 'main', 'class': ['container']}
print("获取div标签的class属性: ", div_tag['class'])  # ['container']
div_tag['class'] = ['new_container']  # 修改属性
print("获取修改后div标签的class属性: ", div_tag['class'])

# string
print("string获取title标签内直接的文本: ", title_tag.string)

# strings
# 迭代获取标签内所有文本，含没有文本的标签和换行
content_iter = soup.div.strings
for item in content_iter:
    print("strings迭代获取div标签中的文本: ", item)

# stripped_strings
# 获取div标签内所有文本, 返回一个迭代器
div_all_text = div_tag.stripped_strings
for content in div_all_text:
    print("stripped_strings迭代获取div标签内所有文本: ", content)

print("获取div标签内所有文本，去除首尾空格: ", div_tag.get_text(strip=True))
print("获取div标签内所有文本，指定分隔符，并且去除首尾空格: ", div_tag.get_text(separator="\t", strip=True))

# parent
parent_tag = div_tag.parent
print("parent获取div标签的直接父标签: \n", parent_tag)

# parents 迭代获取所有父标签
for every_parent_tag in div_tag.parents:
    print("parents获取div标签的所有父标签: ", every_parent_tag)

# next_sibling 获取下一个同级标签
next_tag = div_tag.next_sibling.strip() if div_tag.next_sibling else None
print("next_sibling获取div标签的下一级标签: ", next_tag)

# previous_sibling
pre_tag = div_tag.previous_sibling.strip() if div_tag.previous_sibling else None
print("previous_sibling获取div标签的上一级标签: ", pre_tag)

print("##############  获取 NavigableString 对象   ##############")
# 获取 p 标签的 NavigableString 对象
p_tag = soup.find('p', class_='content')
# 注意：仅当标签内无嵌套子标签/注释时，string才有效
p_text = p_tag.string
print(f"NavigableString对象: {p_text}, 类型是: {type(p_text)}")
p_text = str(p_text)
print(f"类型修改后的NavigableString对象: {p_text}, 类型是: {type(p_text)}")

# 修改文本内容
p_tag.string = "这是修改后的第一段文本"
print(f"文本修改后的NavigableString对象: {p_tag.text}, 类型是: {type(p_tag.text)}")

print("##############  获取 Comment 对象   ##############")
# 找到包含注释的 h1 标签，提取其所有子节点
h1_tag = soup.find('h1', class_='content')
for child in h1_tag.children:
    # print(child)
    # 判断子节点是否为注释对象的实例
    if isinstance(child, Comment):
        print("注释内容: ", child)
        print("注释类型: ", type(child))  # <class 'bs4.element.Comment'>
    else:
        print("普通文本: ", child)

4.2、BeautifulSoup对象的核心和专属方法

from bs4 import BeautifulSoup
from bs4 import Comment
# 示例HTML文档
html_doc = """
<div class="container">
    <a href="link1.html" class="nav">首页</a>
    <a href="link2.html" class="nav">产品</a>
    <a href="link3.html" class="nav">关于我们</a>
    <p id="info">联系电话：123456</p>
</div>
"""
##############  获取 BeautifulSoup 对象
soup = BeautifulSoup(html_doc, "lxml")

# --------------------------
# 专属方法（文档级操作）
# --------------------------
# 1、按缩进格式化输出
print(soup.prettify())

# 2、创建新的标签
new_div = soup.new_tag("div", class_="footer")
new_div.string = "版权所有 © 2026"
# 找到body（若没有则找html），添加新标签
if soup.body:
    soup.body.append(new_div)
else:
    soup.html.append(new_div)  # 将新创建的标签添加到文档中
# print(soup.prettify())

# 3、将文档转为字节串，用于保存到文件
html_bytes = soup.encode("utf-8")
# print(html_bytes)
print("文档字节串长度：", len(html_bytes))

# 4、将字节串转回字符串
html_str = soup.encode("utf-8").decode("utf-8")
# print(html_str)

# --------------------------
# 通用查找方法（最常用）
# --------------------------
# 1、查找第一个 class属性为nav的a标签对象
first_nav = soup.find('a', class_="nav")
print("第一个class属性为nav的标签文本: ", first_nav.string)

# 2、全局查找所有class为nav的a标签
find_all_nva = soup.find_all('a', class_="nav")
print("所有导航链接：")
for link in find_all_nva:
    print(f"文本：{link.get_text()}，链接：{link.get('href')}")

# 3、用CSS选择器查找id为info的p标签
info_tag = soup.select_one('#info')
print("\n信息文本：", info_tag.get_text(strip=True))

# 4、提取整个文档的所有文本（去空白）
all_text = soup.get_text(strip=True, separator=' | ')
print("\n文档所有文本：", all_text)

4.3、Tag对象的核心方法

from bs4 import BeautifulSoup
import re

# 示例HTML文档
html_doc = """
<html>
    <head>
        <title>测试页面</title>
    </head>
    <body>
        <div class="container">
            <a href="link1.html" class="nav">首页</a>
            <a href="link2.html" class="nav">产品</a>
            <a href="link3.html" class="nav">关于我们</a>
            <p id="info">联系电话：123456</p>
        </div>
        <div id="main" class="container1">
            <p class="content">这是第一段文本<!-- 这是注释 --></p>
            <p class="content">这是第二段文本<span class="highlight">（高亮部分）</span></p>
            <a href="https://example.com" class="link external">示例链接1</a>
            <a href="/internal/page" class="link internal">示例链接2</a>
        </div>
        <div class="sidebar">
            <p class="tip">侧边栏提示</p>
        </div>
    <body>
</html>
"""
##############  获取 BeautifulSoup 对象
soup = BeautifulSoup(html_doc, "lxml")

# --------------------------
# 1. 查找类方法
# --------------------------
print("=" * 60)
# 查找第一个 class属性为nav的a标签对象
first_nav = soup.find(name='a', class_="nav")
print(f"第一个class属性为nav的标签: '{first_nav}'，标签中文本: {first_nav.string}")

p3 = soup.find(string=re.compile('第二段'))
print(p3.parent.text)  # 输出：这是第二段文本（高亮部分）

# 全局查找所有class为nav的a标签
find_all_nva = soup.find_all('a', class_="nav")
print("所有class属性为nav的标签: ")
for link in find_all_nva:
    print(f"  标签中文本: {link.get_text(strip=True)}，链接: {link.get('href', '')}")

external_a = soup.find_all('a', href=re.compile('^https'))
print("按属性匹配href以https开头: ", external_a[0].text)

# 查找多个标签（p和a）
p_a_tags = soup.find_all(['p', 'a'])
print("查找多个p和a标签: ", len(p_a_tags))

# 限制返回数量
first_1_a = soup.find_all('a', limit=1)
print("限制返回数量: ", len(first_1_a))

# 用CSS选择器查找id为info的p标签
info_tag = soup.select_one('#info')
print("select_one查找第一个匹配到的子标签: ", info_tag.get_text(strip=True))

# 用CSS选择器查找所有class为nav的标签
find_all_nav2 = soup.select('.nav')
print("select查找所有匹配到的子标签: ", find_all_nav2)

# ID选择器（#main）
main_div = soup.select('#main')[0]  # 返回列表，取第一个
print("select ID选择器: ", main_div.attrs)  # 输出：{'id': 'main', 'class': ['container']}

# 类选择器（.content）+ 后代选择器（#main .content）
content_ps = soup.select('#main .content')
print("select 类选择器: ", content_ps)

# 属性选择器（a[href^="https"]）
external_a = soup.select('a[href^="https"]')[0]
print("select 属性选择器: ", external_a['href'])

# 子选择器（div > a）
direct_a = soup.select('div > a')
print("select 子选择器: ", direct_a)  # 输出：2（div的直接子a标签）

# 找到第一个a标签的父div（container）
a_tag = soup.find('a')
print(a_tag)
parent_div = a_tag.find_parent('div', class_='container')
print("父div：", parent_div)  # 输出：container

# 找到所有祖先标签（html/body/div）
all_parents = a_tag.find_parents()
print("所有祖先标签名：", [p.name for p in all_parents if p.name])  # 输出：['div', 'body', 'html']

# 找到第一个p标签的下一个同级p标签（item-2）
first_p = soup.find('p')
next_p = first_p.find_next_sibling('p')
print("下一个同级p：", next_p['data-id'])  # 输出：1002

# 找到第二个p标签的上一个同级p标签（item-1）
prev_p = next_p.find_previous_sibling('p')
print("上一个同级p：", prev_p['data-id'])  # 输出：1001

# --------------------------
# 2. 修改类方法
# --------------------------
print("=" * 60)
# 安全获取属性
param = first_nav.get("href", '无链接')
print("get方法安全获取href属性: ", param)

# 创建新的标签
new_div = soup.new_tag("div", class_="footer")
new_div.string = "版权所有©2026"
# 找到body（若没有则找html），添加新标签
if soup.body:
    soup.body.append(new_div)
else:
    soup.html.append(new_div)  # 将新创建的标签添加到文档中
# print("append添加新的标签: \n", soup.prettify())

# 在指定位置添加新的标签
if soup.body:
    soup.body.insert(0, new_div)
else:
    soup.html.insert(0, new_div)
# print("insert在指定位置添加新的标签: \n", soup.prettify())

# 替换当前标签/文本为新内容
p_tag = soup.find('p')
p_tag.replace_with("联系手机号码：123456")
# print(soup.prettify())

# --------------------------
# 3. 输出/转换类方法
# --------------------------
print("=" * 60)
# 按缩进格式化输出
# print(soup.prettify())

# 提取整个文档的所有文本（去空白）
all_text = soup.get_text(strip=True, separator=' | ')
print("\n文档所有文本：", all_text)

# 将文档转为字节串，用于保存到文件
html_bytes = soup.encode("utf-8")
# print(html_bytes)
print("文档字节串长度：", len(html_bytes))

# 将字节串转回字符串
html_str = soup.encode("utf-8").decode("utf-8")
# print(html_str)
# --------------------------
# 4. 判断类方法
# --------------------------
print("=" * 60)
# 判断标签内是否包含指定属性
if p_tag.has_attr('id'):
    print("p标签是否有id属性: ", p_tag.get('id'))

# 判断是否为空标签（先创建img标签）
img_tag = soup.new_tag('img', src='test.jpg')
print("img是否是空标签: ", img_tag.is_empty_element)

七、实战示例

7.1、提取商品列表

from bs4 import BeautifulSoup

# 模拟商品列表HTML
goods_html = """
<div class="goods-list">
    <div class="goods-item">
        <h3 class="name">Python编程入门</h3>
        <span class="price">¥59.9</span>
    </div>
    <div class="goods-item">
        <h3 class="name">数据结构与算法</h3>
        <span class="price">¥79.0</span>
    </div>
</div>
"""
# 解析并提取数据
soup = BeautifulSoup(goods_html, "lxml")
goods_item = soup.find_all('div', class_='goods-item')
print("商品列表：")
for item in goods_item:
    name = item.find('h3', class_='name').get_text(strip=True)
    price = item.find('span', class_='price').get_text(strip=True)
    print(f"商品：{name}，价格：{price}")

7.2、提取表格数据

html = """
<table>
  <tr>
    <th>姓名</th>
    <th>年龄</th>
    <th>城市</th>
  </tr>
  <tr>
    <td>张三</td>
    <td>25</td>
    <td>北京</td>
  </tr>
  <tr>
    <td>李四</td>
    <td>30</td>
    <td>上海</td>
  </tr>
</table>
"""

soup = BeautifulSoup(html, 'lxml')
table = soup.find('table')

# 提取表头
headers = [th.get_text() for th in table.find_all('th')]

# 提取数据行
data = []
for row in table.find_all('tr')[1:]:  # 跳过表头
    row_data = [td.get_text() for td in row.find_all('td')]
    data.append(row_data)

print(headers)
print(data)

7.3、处理嵌套结构

html = """
<div class="article">
  <h1>文章标题</h1>
  <div class="content">
    <p>第一段内容</p>
    <p>第二段内容包含<a href="/link">链接</a></p>
  </div>
  <div class="comments">
    <div class="comment">
      <span class="author">用户1</span>
      <p>评论内容1</p>
    </div>
    <div class="comment">
      <span class="author">用户2</span>
      <p>评论内容2</p>
    </div>
  </div>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

# 提取文章标题
title = soup.select_one('.article h1').get_text()

# 提取所有段落内容
paragraphs = [p.get_text() for p in soup.select('.content p')]

# 提取所有评论
comments = []
for comment in soup.select('.comment'):
    author = comment.select_one('.author').get_text()
    content = comment.select_one('p').get_text()
    comments.append({'author': author, 'content': content})

print(f"标题: {title}")
print(f"段落: {paragraphs}")
print(f"评论: {comments}")

7.4、处理编码问题

# 指定编码
soup = BeautifulSoup(html_doc, 'lxml', from_encoding='utf-8')

# 输出时指定编码
print(soup.prettify(encoding='utf-8'))

7.5、处理动态加载内容

# BeautifulSoup无法处理JavaScript动态加载的内容
# 需要配合Selenium或requests-html等工具
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

八、爬虫爬取豆瓣案例

8.1、使用requests库同步爬取

import os
import re
import time
import requests
import random
from bs4 import BeautifulSoup

"""
同步爬取豆瓣250部电影，并生成csv文件
url="https://movie.douban.com/top250?start="
"""
CSV_TITLE = ["排名, 电影名称, 英文名称, 其他名称, 评分, 评价人数, 导演, 演员, 年份, 地区, 类型, 经典台词\n"]

def douban_250_html(film_url: str, start: int) -> str:
    """
    获取250部电影的html网页
    :param: film_url
    :return: html
    """
    time.sleep(random.uniform(0.1, 0.3))
    timeout = random.uniform(1, 5)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
     AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36 Edg/139.0.0.0"
    }
    try:
        response = requests.get(f"{film_url}?start={start}", headers=headers, timeout=timeout)
        if response.status_code != 200:
            raise RuntimeError(f"Requests Failed  with status code {response.status_code}")
        return response.text
    except requests.exceptions.Timeout as e:
        raise requests.exceptions.Timeout(f"Request Timeout, please check！out time is {timeout}") from e
    except requests.exceptions.RequestException as e:
        raise requests.exceptions.RequestException(f"Request Failed with {str(e)}") from e


# 2、解析html信息
movies_list = []
def parse_html(content):
    """
    解析html信息，并且将电影信息放入列表里
    :param content:
    :return: movies_list
    """

    # 创建soup对象
    soup = BeautifulSoup(content, 'html.parser')
    # 获取ol标签
    ol_tag = soup.find('ol', class_='grid_view')
    # 获取所有的li标签
    li_tag_list = ol_tag.find_all('li')

    for li_tag in li_tag_list:
        movies_info = {} # 存放每部电影信息的字典
        # 获取排名
        em_tag = li_tag.find('em')
        movies_info["排名"] = em_tag.get_text(strip=True) if em_tag else ""
        # 提取电影名称
        chinese_span_tag = li_tag.find('span', class_='title')
        movies_info["电影名称"] = chinese_span_tag.get_text(strip=True) if chinese_span_tag else ""
        # 获取电影英文名称，这里通过获取中文名称的span标签的下一个兄弟span标签
        english_span_tag = chinese_span_tag.find_next_sibling('span', class_='title')
        movies_info["英文名称"] = english_span_tag.get_text(strip=True) if english_span_tag else "《无英文名称》"
        # 获取电影别名
        other_span_tag = li_tag.find('span', class_='other')
        movies_info["其他名称"] = other_span_tag.get_text(strip=True) if other_span_tag else ""
        # 获取评分
        rate_span_tag = li_tag.find('span', class_='rating_num')
        movies_info["评分"] = rate_span_tag.get_text(strip=True) if rate_span_tag else ""
        # 获取评价
        evaluate_pattern = r"<span>(\d+)人评价</span>"
        evaluate_nums = re.findall(evaluate_pattern, str(li_tag))[0]
        movies_info["评价人数"] = evaluate_nums if evaluate_nums else 0
        movies_list.append(movies_info)
        # find找到第一个匹配的p标签
        p_tag = li_tag.find('p')
        p_content = p_tag.get_text(strip=True).replace("/", "")
        # print(p_content)
        # 获取导演
        movie_directors = re.search(r'导演: (.*?)\xa0', p_content)
        movies_info["导演"] = movie_directors.group(1) if movie_directors else ""
        # 获取演员
        movie_actors = re.search(r'主演: (.*?)(.*?)(...\d{4})', p_content)
        movies_info['主演'] = movie_actors.group(2) if movie_actors else ""
        # 获取年份
        movie_year = re.search(r'(\d{4})', p_content)
        movies_info["年份"] = movie_year.group(1) if movie_year else ""
        # 获取地区
        movie_country = re.search(r'\d{4}\s\s(.*?)\xa0', p_content)
        movies_info["地区"] = movie_country.group(1) if movie_country else ""
        # 获取类型
        movies_type = re.search(r'\d{4}\s\s(.*?)\s\s(.*)', p_content)
        movies_info["类型"] = movies_type.group(2) if movies_type else ""
        # 获取经典台词
        quota_p_tag = li_tag.find('p', class_='quote')
        if quota_p_tag:
            quote_span_tag = quota_p_tag.find('span')
            movies_info["经典台词"] = quote_span_tag.get_text(strip=True) if quote_span_tag else ""
        else:
            movies_info["经典台词"] = "暂无经典台词"


# 3、分析数据
def analysis_data(movie_info_list):
    print("Start Analysing Movies Data")
    every_movie_list = []
    if not movie_info_list:
        raise RuntimeError(f"No Movies Data, please check!")
    for index, every_movie_info in enumerate(movie_info_list):
        print("正在处理第{index}部电影，电影名称是: “{movie_info}”.".format(index=index+1, movie_info=every_movie_info['电影名称']))
        # print(every_movie_info)
        # 1. 新增：每部电影单独用一个临时列表存字段值
        temp_movie_list = []
        for key, value in every_movie_info.items():
            if isinstance(value ,str):
                if "英文名称" in key:
                    value = value.replace("\xa0", "").replace('/','')
                if "其他名称" in key:
                    value = value.replace("\xa0", "").replace('/','')
                    value = re.sub(r'\s+', '', value)
                value = value.strip()
            else:
                value = str(value)

            # 处理值中包含逗号的情况（CSV中逗号会分隔字段，需用双引号包裹）
            if "," in value:
                value = f'"{value}"'

            temp_movie_list.append(value)

        # 2. 新增：单部电影的字段值拼接成一行，末尾加换行符
        movie_line = ",".join(temp_movie_list) + "\n"
        every_movie_list.append(movie_line)    # 把带换行的行加入总列表

    CSV_TITLE.extend(every_movie_list)


# 4、写入文件
def write_file(file_path):
    with open(file_path, 'w', encoding='utf-8') as f:
        f.writelines(CSV_TITLE)


def main(film_url, file_path):
    for num in range(0, 250, 25):
        movies_html = douban_250_html(film_url, num)
        parse_html(movies_html)

    # 数据处理，并存放excel表格
    analysis_data(movies_list)
    write_file(file_path)


if __name__ == '__main__':
    print("Start Downloading Douban Movie... ")
    url = "https://movie.douban.com/top250"
    current_path = os.path.dirname(os.path.abspath(__file__))
    csv_path = os.path.join(current_path, "movies.csv")
    start_time = time.time()
    main(url, csv_path)
    end_time = time.time()
    print(f"End Downloading Douban Movie, 耗时: {end_time - start_time:.2f} seconds.")

# 另一种写法
# 2. 解析HTML
    soup = BeautifulSoup(html, "html.parser")
    # 定位所有电影项（每个li对应一部电影）
    movie_items = soup.find_all("li", class_="")  # 每个电影的li标签
    
    # 3. 提取每部电影的信息
    movie_list = []
    for item in movie_items:
        # 初始化单部电影的信息字典
        movie_info = {}
        
        # 提取排名（<em>标签里的数字）
        rank = item.find("em").text.strip()
        movie_info["排名"] = int(rank)
        
        # 提取标题（主标题+副标题）
        title_tag = item.find("span", class_="title")
        main_title = title_tag.text.strip() if title_tag else ""
        # 提取外文标题（第二个title标签）
        other_title_tag = item.find_all("span", class_="title")[1] if len(item.find_all("span", class_="title"))>1 else None
        other_title = other_title_tag.text.strip().replace("/", "").strip() if other_title_tag else ""
        movie_info["主标题"] = main_title
        movie_info["外文标题"] = other_title
        
        # 提取评分（rating_num类的span）
        rating = item.find("span", class_="rating_num").text.strip()
        movie_info["评分"] = float(rating)
        
        # 提取评价人数（正则匹配"XXX人评价"）
        rating_people_tag = item.find("span", text=re.compile(r"\d+人评价"))
        rating_people = re.findall(r"(\d+)人评价", rating_people_tag.text)[0] if rating_people_tag else "0"
        movie_info["评价人数"] = int(rating_people)
        
        # 提取导演、主演、年份、地区、类型（p标签里的文本）
        info_p = item.find("p", class_="").text.strip()
        # 拆分导演主演行和年份地区类型行
        info_lines = [line.strip() for line in info_p.split("\n") if line.strip()]
        if len(info_lines) >= 1:
            # 处理导演主演
            director_actor_line = info_lines[0]
            director_match = re.findall(r"导演: (.*?)\s+主演:", director_actor_line)
            director = director_match[0].strip() if director_match else director_actor_line.replace("导演: ", "").strip()
            actor_match = re.findall(r"主演: (.*)", director_actor_line)
            actor = actor_match[0].strip() if actor_match else ""
            movie_info["导演"] = director
            movie_info["主演"] = actor
            
            # 处理年份、地区、类型
            if len(info_lines) >= 2:
                year_area_type_line = info_lines[1]
                # 正则提取年份
                year_match = re.findall(r"(\d{4})", year_area_type_line)
                year = year_match[0] if year_match else ""
                # 提取地区和类型（按"/"拆分）
                rest_info = year_area_type_line.replace(year, "").strip().split("/")
                area = rest_info[0].strip() if len(rest_info)>0 else ""
                movie_type = rest_info[1].strip() if len(rest_info)>1 else ""
                movie_info["年份"] = year
                movie_info["地区"] = area
                movie_info["类型"] = movie_type
        
        # 提取经典台词（quote类的span）
        quote_tag = item.find("span", class_="inq")
        quote = quote_tag.text.strip() if quote_tag else ""
        movie_info["经典台词"] = quote

8.2、使用aiohttp和asyncio库异步IO并发

import asyncio
import aiohttp
import os
import re
import time
import requests
import random
import aiofiles
from bs4 import BeautifulSoup

CSV_TITLE = ["排名, 电影名称, 英文名称, 其他名称, 评分, 评价人数, 导演, 演员, 年份, 地区, 类型, 经典台词\n"]

async def async_douban_250_html(session, film_url, start):
    await asyncio.sleep(random.uniform(0.1, 0.3))
    timeout = random.uniform(1, 5)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
         AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36 Edg/139.0.0.0"
    }
    try:
        async with session.get(f"{film_url}?start={start}", headers=headers, timeout=timeout) as response:
            if response.status != 200:
                raise RuntimeError(f"Requests Failed  with status code {response.status_code}")
            return await response.text()
    except requests.exceptions.Timeout as e:
        raise requests.exceptions.Timeout(f"Request Timeout, please check！out time is {timeout}") from e
    except requests.exceptions.RequestException as e:
        raise requests.exceptions.RequestException(f"Request Failed with {str(e)}") from e


# 2、解析html信息
movies_list = []
async def parse_html(content):
    """
    异步解析html信息，并且将电影信息放入列表里
    :param content:
    :return:
    """

    # 创建soup对象
    soup = BeautifulSoup(content, 'html.parser')
    # 获取ol标签
    ol_tag = soup.find('ol', class_='grid_view')
    # 获取所有的li标签
    li_tag_list = ol_tag.find_all('li')

    for li_tag in li_tag_list:
        movies_info = {} # 存放每部电影信息的字典
        # 获取排名
        em_tag = li_tag.find('em')
        movies_info["排名"] = em_tag.get_text(strip=True) if em_tag else ""
        # 提取电影名称
        chinese_span_tag = li_tag.find('span', class_='title')
        movies_info["电影名称"] = chinese_span_tag.get_text(strip=True) if chinese_span_tag else ""
        # 获取电影英文名称，这里通过获取中文名称的span标签的下一个兄弟span标签
        english_span_tag = chinese_span_tag.find_next_sibling('span', class_='title')
        movies_info["英文名称"] = english_span_tag.get_text(strip=True) if english_span_tag else "《无英文名称》"
        # 获取电影别名
        other_span_tag = li_tag.find('span', class_='other')
        movies_info["其他名称"] = other_span_tag.get_text(strip=True) if other_span_tag else ""
        # 获取评分
        rate_span_tag = li_tag.find('span', class_='rating_num')
        movies_info["评分"] = rate_span_tag.get_text(strip=True) if rate_span_tag else ""
        # 获取评价
        evaluate_pattern = r"<span>(\d+)人评价</span>"
        evaluate_nums = re.findall(evaluate_pattern, str(li_tag))[0]
        movies_info["评价人数"] = evaluate_nums if evaluate_nums else 0
        movies_list.append(movies_info)
        # find找到第一个匹配的p标签
        p_tag = li_tag.find('p')
        p_content = p_tag.get_text(strip=True).replace("/", "")
        # print(p_content)
        # 获取导演
        movie_directors = re.search(r'导演: (.*?)\xa0', p_content)
        movies_info["导演"] = movie_directors.group(1) if movie_directors else ""
        # 获取演员
        movie_actors = re.search(r'主演: (.*?)(.*?)(...\d{4})', p_content)
        movies_info['主演'] = movie_actors.group(2) if movie_actors else ""
        # 获取年份
        movie_year = re.search(r'(\d{4})', p_content)
        movies_info["年份"] = movie_year.group(1) if movie_year else ""
        # 获取地区
        movie_country = re.search(r'\d{4}\s\s(.*?)\xa0', p_content)
        movies_info["地区"] = movie_country.group(1) if movie_country else ""
        # 获取类型
        movies_type = re.search(r'\d{4}\s\s(.*?)\s\s(.*)', p_content)
        movies_info["类型"] = movies_type.group(2) if movies_type else ""
        # 获取经典台词
        quota_p_tag = li_tag.find('p', class_='quote')
        if quota_p_tag:
            quote_span_tag = quota_p_tag.find('span')
            movies_info["经典台词"] = quote_span_tag.get_text(strip=True) if quote_span_tag else ""
        else:
            movies_info["经典台词"] = "暂无经典台词"

# 3、分析数据
async def analysis_data(movie_info_list):
    print("Start Async Analysing Movies Data")
    every_movie_list = []
    if not movie_info_list:
        raise RuntimeError(f"No Movies Data, please check!")
    for index, every_movie_info in enumerate(movie_info_list):
        print("正在处理第{index}部电影，电影名称是: “{movie_info}”.".format(index=index+1, movie_info=every_movie_info['电影名称']))
        # print(every_movie_info)
        # 1. 新增：每部电影单独用一个临时列表存字段值
        temp_movie_list = []
        for key, value in every_movie_info.items():
            if isinstance(value ,str):
                if "英文名称" in key:
                    value = value.replace("\xa0", "").replace('/','')
                if "其他名称" in key:
                    value = value.replace("\xa0", "").replace('/','')
                    value = re.sub(r'\s+', '', value)
                value = value.strip()
            else:
                value = str(value)

            # 处理值中包含逗号的情况（CSV中逗号会分隔字段，需用双引号包裹）
            if "," in value:
                value = f'"{value}"'

            temp_movie_list.append(value)

        # 2. 新增：单部电影的字段值拼接成一行，末尾加换行符
        movie_line = ",".join(temp_movie_list) + "\n"
        every_movie_list.append(movie_line)    # 把带换行的行加入总列表

    CSV_TITLE.extend(every_movie_list)


async def write_file(csv_path):
    """异步写入文件"""
    async with aiofiles.open(csv_path, 'w', encoding='utf-8-sig') as file:
        await file.writelines(CSV_TITLE)


async def main(film_url, csv_path):
    # 1. 创建会话（复用连接池，提升性能）
    async with aiohttp.ClientSession() as session:
        # 2. 创建所有请求任务（并发执行）
        tasks = []
        for start in range(0, 250, 25):
            task = asyncio.create_task(async_douban_250_html(session, film_url, start))
            tasks.append(task)
        # 3. 并发执行所有请求，等待全部完成
        print("开始并发请求所有页面...")
        html_list = await asyncio.gather(*tasks)   # 核心：并发执行
        for html in html_list:
            await parse_html(html)

        # 数据处理，并存放excel表格
        await analysis_data(movies_list)
        await write_file(csv_path)


if __name__ == '__main__':
    print("Start Async Downloading Douban Movie... ")
    current_path = os.path.dirname(os.path.abspath(__file__))
    file_path = os.path.join(current_path, "movie_async.csv")
    url = "https://movie.douban.com/top250"
    start_time = time.time()
    asyncio.run(main(url, file_path))
    end_time = time.time()
    print(f"异步抓取电影信息总耗时： {end_time - start_time:.2f}秒")