使用 Python3 urllib 模拟 HTTP 请求

2019-05-19 技术博客 Python 0 评论字数统计: 2k(字) 阅读时长: 9(分)

使用 Python3 Built-in 的 urllib 可以发送 HTTP 请求，虽然不如 requests 方便，但是用来做一些 parse 工作还是相当不错的。

urllib

urllib有四个模块：

request，处理请求；
error，处理错误；
parse，对请求内容解码和编码；
robotparser，检查robot.txt。

发送请求

request用于发送请求。

`urlopen()`

示例：

import urllib.request

res = urllib.request.urlopen("https://www.python.org/")
print(res.read().decode('utf8')) # get html
print(type(res)) # <class 'http.client.HTTPResponse'>
print(res.getheader("Server")) # 'nginx'
print(res.getheaders()) # Response Headers
print(res.status) # 200

这个方法也支持表单提交。

import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({"word": "hi"}), encoding='utf8')
res = urllib.request.urlopen("https://httpbin.org/post", data=data)
print(res.read().decode('utf8'))

`Request`

Request类可以生成完整的请求实例，添加 Headers，指定 Method。

url = "http://httpbin.org/post"
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6)",
    "Host": 'httpbin.org'
}
dict = {'name': 'Amber'}
data = bytes(urllib.parse.urlencode(dict), encoding='utf8')
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')
res = urllib.request.urlopen(req)
print(res.read().decode('utf-8'))

高级用法

生成一个请求，不仅仅会需要 URI、参数、方法和请求头，还需要处理 Cookie、代理等等。生成这样的高级请求就要用到Handler。urllib.request.BaseHandler提供了最基本的方法，可以集成它来自己创建Handler，有一些内置的：

HTTPDefaultErrorHandler
HTTPRedirectHandler
HTTPCookieProcessor
ProxyHandle
HTTPPasswordMgr
HTTPBasicAuthHandler

还有一些其他的，可以查看文档。

另外一个重要的类是OpenerDirector，简称Opener。Opener可以使用open()方法，返回response。我们需要利用Handler来构建Opener。

下面是一个利用HTTPCookieProcessor读写 Cookie 的例子：

def build_cookie(cookie_file, login_req=None, override=False):
    """
    将 Cookies 写入 cookie_file，或从其中读出 Cookies
    :param cookie_file: Cookie 文件路径
    :param login_req: 建立 Cookie 的请求
    :param override: 是否覆盖原先的 Cookie 文件
    :return: cookie_opener
    """
    import os
    from http.cookiejar import LWPCookieJar
    from urllib.request import HTTPCookieProcessor, build_opener
    if (not os.path.exists(cookie_file)) or override:
        if login_req is None:
            raise Exception("请给出登陆请求！")
        cookie = LWPCookieJar(cookie_file)
        cookie_opener = build_opener(HTTPCookieProcessor(cookie))
        cookie_opener.open(login_req)
        cookie.save(ignore_expires=True, ignore_discard=True)
    else:  # 直接读取
        cookie = LWPCookieJar()
        cookie.load(cookie_file, ignore_discard=True, ignore_expires=True)
        cookie_opener = build_opener(HTTPCookieProcessor(cookie))
    return cookie_opener

代理

下面是一个利用ProxyHandler实现代理访问的方法：

def build_proxy(proxy):
    """
    根据 proxy 参数获取代理访问
    :param proxy: dict，key 为协议，value 为代理地址
    :return: proxy_opener
    """
    from urllib.request import ProxyHandler, build_opener
    return build_opener(ProxyHandler(proxy))

处理异常

urllib提供了URLError和HTTPError，所有request返回的异常都在URLError中。

from urllib import request, error

try:
    response = request.urlopen('https://xxx.com/')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

解析链接

URL 中有一些很重要的信息，包括协议、域名、资源地址、参数、查询以及分段。使用parse模块可以对 URL 进行解析。

urlparse可以将 URL 分为协议、域名、资源地址、参数、查询以及分段六块，urlunparse可以将一个六元素列表组织成一个 URL。
urlsplit无视参数项，分解为协议、域名、资源地址、查询以及分段五块，urlunsplit将一个五元素列表组织成一个 URL。
urlencode将一个字典编码为一个 GET 查询字符串。parse_qs将一个 GET 查询字符串解码为字典。
quote将字符串进行 URL 序列化，unquote反之。

分析`Robots`协议

robotparse模块解析robots.txt，判断爬虫是否有权限对某个页面进行爬取。

def can_fetch(ua, url, robots_url):
    """
    查看 UA 是否可以爬取该 URL
    :param ua: 爬虫 User-Agent
    :param url: 需要爬取的页面地址
    :param robots_url: 主站的 robots.txt
    :return: Bool
    """
    from urllib import parse, robotparser
    robots = robotparser.RobotFileParser(robots_url)
    robots.read()
    return robots.can_fetch(ua, url)

requests

基本用法

requests可以直接发起对应的 HTTP Method 请求：

import requests
url = "https://www.baidu.com/"
requests.get(url)
requests.post(url)
requests.put(url)
requests.delete(url)
requests.options(url)

GET

对于 GET 参数，可以直接将字典附加到requests.get()的params参数。对于 API 返回的 JSON，可以直接调用response.json()。可以直接附加 Headers。response的text属性是 Unicode 解码，content属性是二进制。封装的response中还可以获取 Cookies，History，URL，Status_Code

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
res = requests.get('https://www.zhihu.com/', headers=headers)

POST

import requests

data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.json())

高级用法

文件上传

import requests

files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

import requests

res = requests.get("https://www.baidu.com/")
print(res.cookies)
for k, v in r.cookies.items():
    print(k, '=', v)

在requests中使用 Cookie，可以像往常一样直接放在 Headers 中，不过稍微有些难看，下面是完整的使用 Cookie 的方式：

import requests
from http.cookiejar import LWPCookieJar

jar = requests.cookies.RequestsCookieJar()
cookie = LWPCookieJar()
cookie.load("cookie.txt")
for i in cookie:
    jar.set(i.name, i.value)

Session

上面的维持 Cookie 发起访问的方式相当麻烦，所以requests提供了更好的方法：Session。对于浏览器来说，会话（Session）是无比平常的东西。如果留心的话，在解析网络时经常能发现 Request Headers 总会有keep-alive。事实上，会话是依赖于 Cookie 的，由于 HTTP 是无状态协议，所以使用 Cookie 来标记长连接，也就是会话。

import requests

s = requests.Session()
s.get("https://www.baidu.com")

会话可以看作是urllib的一个 Opener，在一次会话中，会保留 Cookie、代理，因而非常方便。

SSL 证书

requests提供了 SSL 证书验证的功能，如果遇到无法验证的 SSL 证书，可以添加verify=False跳过验证。

代理

import requests

proxies = {
    'http': 'http://127.0.0.1:1087',
    'https': 'https://127.0.0.1:1087'
}

proxies_socks = {
    'http': 'socks5://127.0.0.1:1086',
    'https': 'socks5://127.0.0.1:1086'
}

requests.get("https://www.google.com", proxies=proxies)

Prepared Request

有可能我们需要一个结构化的请求来多次复用，这时候我们可以使用Request类来生成一个底层的请求对象。一个请求对象可以自己与处理，但是当使用Session时，就需要使用Session去预处理。

from requests import Request, Session

url = 'http://httpbin.reg/post'
data = {
    'name': 'germey'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
s = Session()
req = Request('POST', url, data=data, headers=headers)
# pre_req = req.prepare()
pre_req = s.prepare_request(req)
res = s.send(pre_req)
print(res.text)

正则

完整讲述正则表达式过于复杂，不如自己理解。常用的正则表达式工具列在下方：

这里我们来说一些 Python3 re 模块的一些使用。

常用函数

修饰符

re.S：使.匹配换行符在内的所有字符；
re.I：大小写不敏感；
re.M：多行匹配，影响^和$；
re.L：本地化识别匹配；
re.U：根据 Unicode 解析字符；
re.X：可以更灵活地写正则表达式。

一般常用re.S和re.I。

`match()`

从字符串起始处开始匹配，返回一个match对象，可以使用.group()获取匹配到的内容。

import re

url = re.match(".*?<title>(.*?)</title>", res.text)
print(url.group(1)) # Google

`search()`

和match()很像，区别是不是固定从开头开始查找。

import re

res = re.search('<title>(.*?)</title>', res.text)
print(url.group(1))

`findall()`

获取所有正则能匹配到的内容。

import re

res = re.findall('<a.*?href="(.*?)"', res.text)
print(res)

`sub()`

用正则表达式统一替换。

import re

content = '123abc456'
content = re.sub('\d+', '', content)
print(content) # abc

`compile()`

获取一个组织完成的正则表达式对象，便于复用。

1
2
3

import re

pattern = re.compile('ppp.*?ppp', re.S)

本文链接： https://sumsc-caa.github.io/2019/05/18/python3-urllib-http/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

苏州大学计算机爱好者协会（微软学生俱乐部）微软开源学习社群联络社团

计算机爱好者协会是苏州大学计算机科学与技术学院的学术性社团，为2015年与微软亚洲研究院联合成立的苏州大学微软学生俱乐部，现为微软开源学习社群的联络社团。欢迎加入俱乐部QQ群：497516494 | 关注微信公众号：SUMSTC

使用 Python3 urllib 模拟 HTTP 请求

urllib

发送请求

urlopen()

Request

高级用法

Cookie

代理

处理异常

解析链接

分析Robots协议

requests

基本用法

GET

POST

高级用法

文件上传

Cookie

Session

SSL 证书

代理

Prepared Request

正则

常用函数

修饰符

match()

search()

findall()

sub()

compile()

苏州大学计算机爱好者协会（微软学生俱乐部）微软开源学习社群联络社团

`urlopen()`

`Request`

分析`Robots`协议

`match()`

`search()`

`findall()`

`sub()`

`compile()`