上篇,我们介绍了 Python 基础 HTTP 库 urllib 的基本使用,在使用上还是比较麻烦的,本篇,我们来看一下第三方 HTTP 库 Requests 是如何简化我们的操作的。
Requests
Requests 是基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库,他比 urllib 更加方便,节约我们大量的工作。
安装
请求
基本 GET 请求
import requests response = requests.get('http://httpbin.org/get') print(response.status_code) print(response.text) print(type(response.text))
|
带有参数 GET 请求
import requests response = requests.get('http://httpbin.org/get?foo=bar') print(response.text)
params = {'foo': 'bar'} response = requests.get('http://httpbin.org/get', params=params) print(response.text)
|
json 解析
import requests params = {'foo': 'bar'} response = requests.get('http://httpbin.org/get', params=params) print(response.json()) print(type(response.json()))
|
二进制数据
import requests response = requests.get('http://github.com/favicon.ico') print(response.content) print(type(response.content))
with open('favicon.ico', 'wb') as f: f.write(response.content)
|
import requests
headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36', 'Cookie': 'thw=cn; v=0; t=f49d7a120cef747be295966efd96e846; cookie2=53b3047ac2325fbeb90d604ab16d4e58;...' } response = requests.get('http://zhihu.com/explore', headers=headers) print(response.text)
|
带有参数 POST 请求
import requests
response = requests.post('http://httpbin.org/post', data={'foo': 'bar'}) print(response.text)
response = requests.post('http://httpbin.org/post', data='foo') print(response.text)
response = requests.post('http://httpbin.org/post', json={'foo': 'bar'}) print(response.text)
|
响应
import requests
response = requests.get('http://www.baidu.com') print(type(response))
print(response.status_code)
print(response.text)
print(response.content)
print(response.cookies)
print(response.headers)
print(response.encoding)
print(response.apparent_encoding)
|
状态码
import requests
response = requests.get('http://www.baidu.com')
if not response.status_code == requests.codes.ok: pass else: print('Successfuly')
try: response = requests.get('http://httpbin.org/post', timeout=10) response.raise_for_status() response.encoding = reqponse.apparent_encoding print(response.text) except: print('出现异常')
|
高级操作
文件上传
import requests
url = 'http://httpbin.org/post' files = {'icon_file': open('github.ico', 'rb')} response = requests.post(upload_url, files=files)
print(response.text)
|
会话维持
import requests
requests.get('http://httpbin.org/cookies/set/foo/bar') response = requests.get('http://httpbin.org/cookies')
print(response.text)
session = requests.session() session.get('http://httpbin.org/cookies/set/foo/bar') response = session.get('http://httpbin.org/cookies')
print(response.text)
|
证书验证
import requests
resresponsep = requests.get('https://www.12306.cn', verify=False) print(response.status_code)
|
代理设置
import requests
proxies = { 'http': 'http://112.85.173.34:9999' }
response = requests.get('http://httpbin.org/get', proxies=proxies) print(response.text)
|
超时设置
import requests from requests.exceptions import ReadTimeout
try: response = requests.get('http://httpbin.org/get', timeout=0.1) print(response.status_code) except ReadTimeout: print('TIME OUT')
|
认证设置
import requests from requests.auth import HTTPBasicAuth
response = requests.get('http://auth_demo.com', auth=HTTPBasicAuth('user', 123456))
|
异常处理
requests 库总共有 6 种异常:
- requests.ConnectionError: 网络连接错误异常,如 DNS 查询失败,拒绝连接等
- requests.HTTPError: HTTP 错误异常
- requests.URLRequired: URL 缺失异常
- requests.TooManyRedirects: 超过最大重定向次数
- requests.ConnectTimeout: 连接远程服务器超时异常
- requests.Timeout: 请求 URL 超时异常
import requests from requests.exceptions import ReadTimeout, HTTPError, RequestException
try: response = requests.get('http://httpbin.org/get', timeout=0.1) print(response.status_code) except ReadTimeout: print('TIME OUT') except HTTPError: print('HTTP Error') except RequestException: print('Error')
|
Robots 协议
Robots 协议告知所有爬虫网站的爬取策略,要求爬虫遵守。
Robots 协议放置在网站根目录下的 robots.txt 中,如: www.zhihu.com/robots.txt,告知网站的爬取规则。
User-agent: Googlebot Disallow: /login Disallow: /logout Disallow: /resetpassword Disallow: /terms Disallow: /search Disallow: /notifications Disallow: /settings Disallow: /inbox Disallow: /admin_inbox Disallow: /*?guide*
...
User-Agent: * Disallow: /
|
特别注意: 不遵守 Robots 协议可能存在法律风险。