urllib.parse 中的 urlparse 和 urljoin 函数

urlparse 和 urljoin 是 Python 标准库 urllib.parse 模块中用于处理 URL 的两个重要函数。

urlparse 函数

urlparse 函数用于将 URL 字符串解析为各个组成部分。

基本用法

from urllib.parse import urlparse

result = urlparse('https://www.cctv.com:8090/path/to/hello.html?q=python#book1')

print(result.scheme)     
print(result.netloc)     
print(result.path)    
print(result.query)     
print(result.fragment)

内容如下： https www.cctv.com:8090 /path/to/hello.html q=python book1

返回的 Result 对象包含以下属性：

scheme: 协议 (如 'http', 'https')
netloc: 网络位置 (域名和端口)
path: 路径部分
query: 查询字符串
fragment: 片段标识符 (即 # 后面的部分)

示例

print(result.scheme)    # 'https'
print(result.netloc)    # 'www.example.com'
print(result.path)      # '/path/to/page'
print(result.query)     # 'query=python'
print(result.fragment)  # 'fragment'

urljoin 函数

urljoin 函数用于将一个基础 URL 和另一个 URL 组合成一个绝对 URL。

基本用法

from urllib.parse import urljoin

base_url = 'https://www.example.com/path/to/'
relative_url = 'subpage.html'
absolute_url = urljoin(base_url, relative_url)

特点

如果第二个参数是绝对 URL，则直接返回第二个参数
如果第二个参数是相对 URL，则基于第一个参数进行组合
正确处理路径中的 . 和 ..

示例

from urllib.parse import urljoin
print(urljoin('https://cctv.com/a/b', 'c/d'))  
print(urljoin('https://cctv.com/a/b', '/c/d')) 
print(urljoin('https://cctv.com', 'https://zaobao.com')) 
print(urljoin('https://cctv.com/a/b', '../c'))

输出如下:

https://cctv.com/a/c/d
https://cctv.com/c/d
https://zaobao.com
https://cctv.com/c

实际应用场景

网页爬虫：解析和组合 URL
Web 开发：处理用户提供的 URL
API 开发：构建完整的请求 URL

这两个函数配合使用可以很好地处理 URL 的解析和构建，是 Python 中处理 URL 的基础工具。