爬虫day02:爬虫请求库之requests库

淘小欣 · 发表于 2021-5-25 00:14

本帖最后由淘小欣于 2021-5-29 22:16 编辑

02.爬虫请求库之requests库

一、requests模块介绍

1.介绍：

使用requests可以模拟浏览器发送HTTP的请求
不仅仅用来做爬虫，服务之间的调用也使用它
HTTP情趣请求头，请求体，请求地址都可以使用这个模块
requests是基于python urllib2模块封装的，这个模块用起来比较繁琐

注意：requests库发送请求将网页内容下载下来以后，并不会执行js代码，这需要我们自己分析目标站点然后发起新的request请求

2.安装

pip3 install requests

3.各种请求方式

各种请求方式：常用的就是requests.get()和requests.post()

>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')

二、requests模块发送get请求

1.基本get请求

import requests
# response是响应对象，http响应封装了，响应体
response = requests.get('https://weread.qq.com/')
# 把响应体的数据转成了字符串
print(response.text)

# 示例 向百度发送请求

res=requests.get('https://www.baidu.com/')
# print(res.text)
# 将爬取的数据写入到文件中
with open('baidu.html','wb') as f:
        f.write(res.content) # 响应体二进制内容

2.带参数的GET请求->params

将数据拼在路径中
在请求头中携带user-agent（客户端类型），referer

2.1 请求地址中携带数据方式一: 直接携带 (中文一般不会进行url编码, 会出现编码问题)

import requests

header = {
    # 模拟浏览器
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    # 可以解决防盗链问题，没有可以不写
    'referer': ''
}
res = requests.get('https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3', headers=header)
print(res.text)

注意：

如果你发送请求去一个地址，拿不到数据或者拿到的数据不对的原因是什么？你模拟的不像浏览器，把请求头的数据和该带的带上

2.2 使用params来传递get请求参数（中文自动进行url编码）

import requests

header = {
    # 模拟浏览器
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
res = requests.get('https://www.baidu.com/s', params={'wd': '爸爸打我'}, headers=header)
with open('baidu.html', 'wb') as f:
    f.write(res.content)  # 响应体二进制内容

2.3 使用urllib模块中文转码和编码

from urllib.parse import urlencode, unquote

# 把中文转成%的形式
res = urlencode({'wd': '我真帅'}, encoding='utf-8')
print(res)  # wd=%E6%88%91%E7%9C%9F%E5%B8%85

# 把%形式转成中文
res = unquote('%E6%88%91%E7%9C%9F%E5%B8%85', encoding='utf-8')
print(res)  # 我真帅

2.4 带参数的GET请求请求头

通常我们在发送请求时都需要带上请求头，请求头是将自身伪装成浏览器的关键，常见的有用的请求头如下

user-agent：客户端
referer：大型网站通常都会根据该参数判断请求的来源
Cookie:未认证的cookie，认证过的cookie

3.请求中带cookie

cookie经常用，作者把cookies当作一个参数使用

import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
    'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661; UM_distinctid=173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b;'
}

# 注意: url不要写错成了 'http://127.0.0.1:8050/index'
response = requests.get('http://127.0.0.1:8050/index/', headers=headers)
print(response.text)
# 服务端获取: 
# 注意: 放在headers中, cookie对应的value如果有等于号, 那么等于号左边作为cookie的key, 右边作为cookie的value. 如果没有那么, key为空字符串. value为值.
# Dict or CookieJar:是一个对象，登录成功以后拿cookie得到的就是一个cookieJar对象
"""
{
    'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3': '1596202661', 
    'UM_distinctid': '173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b'
} 
"""

3.2 方式二：存放在指定的cookies参数中

import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
}
cookies = {
    'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661; UM_distinctid=173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b;'
}

# 注意: url不要写错成了 'http://127.0.0.1:8050/index'
response = requests.get('http://127.0.0.1:8050/index/', headers=headers, cookies=cookies)
print(response.text)

# 服务端获取: 
# 注意: cookies直接指定. 字典中的key对应的服务端cookie获取的key. 字典中value冒号分隔的等于号左边作为cookie的key, 右边作为value
'''
{
    'cookie': 'Hm_lvt_b72418f3b1d81bbcf8f99e6eb5d4e0c3=1596202661', 
    'UM_distinctid': '173a517b3192a1-07d51af695c35e-3a65420e-1fa400-173a517b31a70b'
}

4.小结

# headers参数: 
    1. 模拟浏览器: user-agent
    2. 解决防盗链: referer

# 解决url编码问题
    1. params参数默认解决url编码问题 
    2. urllib模块

# 携带cookie
    第一种方式: 存放在headers中
    客户端发送: {'cookie': 'key=value;key1=value1'}
    服务端获取: {key: value, key1: value1} 

    第二种方式: 指定cookie参数 (提示: 可以存放dict 和 CookieJar对象)
        客户端发送: {key: value, key1: value1}
        服务端获取: {key: value, key1: value1}   

response.text  文本
response.content  二进制
response.iter_content()  迭代器

三、基于POST请求

1.携带数据发送POST请求

1.1 携带数据：urlencoded

import requests

response = requests.post('http://127.0.0.1:8050/index/', data={'name': 'shawn'})
print(response.text)

# 服务端获取
'''
request.body: b'name=shawn'
request.POST: <QueryDict: {'name': ['shawn']}>
'''

1.2 携带数据：json

import requests

response = requests.post('http://127.0.0.1:8050/index/', json={'name': 'shawn'})
print(response.text)

# 服务端获取
'''
request.body: b'{"name": "shawn"}'
request.POST: <QueryDict: {}>
'''

2.自动携带cookie

import requests

session = requests.session()      # 注意: 是session()方法, 不是sessions()
session.post('http://127.0.0.1:8050/login/', json={'username': 'yang', 'password': '123'})          # 假设这个请求登录了
response = session.get('http://127.0.0.1:8050/index/')  # 现在不需要手动带cookie, session会自动处理
print(response)

3.自定义请求头

requests.post(url='xxxxxxxx',
              data={'xxx': 'yyy'})  # 没有指定请求头,# 默认的请求头:application/x-www-form-urlencoed

# 如果我们自定义请求头是application/json,并且用data传值, 则服务端取不到值
requests.post(url='',
              data={'': 1, },
              headers={
                  'content-type': 'application/json'
              })

requests.post(url='',
              json={'': 1, },
              )  # 默认的请求头:application/json

4.模拟登陆某网站

import requests

data = {
    'username': '用户名',
    'password': '密码',
    'captcha': '9eee',
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = requests.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
print(res.cookies.get_dict())

res1 = requests.get('http://www.aa7a.cn/', cookies=res.cookies.get_dict())

print('用户名' in res1.text)

如何携带data数据，如何携带cookies

cookies:CookieJar或者字典

requests.session() 自动携带cookies

import requests
# 拿到一个session对象，发送请求时，跟使用reqesuts一样，只不过它自动处理了cookie
session=requests.session()
data = {
    'username': '用户名',
    'password': '密码',
    'captcha': '9eee',
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
print(res.cookies.get_dict())

res1 = session.get('http://www.aa7a.cn/')

print('用户名' in res1.text)

5.小结

# 携带数据:
    携带json数据: json={}
    携带urlencoded数据: data={}

# 自动携带cookie:     
    session = requests.session()
    res = session.post(认证url)
    res1 = session.get(访问url)

# 自定义请求头:
    默认: application/x-www-form-urlencoed
    headers={'content-type': 'application/json'}

四、响应Response：requests模块响应对象

# 1 响应对象
import requests
respone=requests.get('http://www.jianshu.com')
# # respone属性
print(respone.text)   # 把body体中数据转成字符串格式
print(respone.content) # body体中的二进制格式

print(respone.status_code) # 响应状态码
print(respone.headers)     # 响应头
print(respone.cookies)     # 响应的cookie，如果登录了，这个cookie就是登录的cookie
print(respone.cookies.get_dict()) # cookiejar对象---》字典对象
print(respone.cookies.items())   # 跟字典一样

print(respone.url)              # 请求的地址
print(respone.history)          # 列表，访问一个网址，重定向了，列表中放这两个地址

print(respone.encoding)        # 响应的编码格式（一般都是utf-8）

# 如果是图片，视频，保存到本地
# response.iter_content(): 可以循环它，而不是循环response.content,循环它一点点存
res=requests.get('xxx')
for line in res.iter_content():
    f.write(line)

# 2  编码问题(一般不存在，如果存在)
response.encoding='gb2312' # 改成网站编码方式即可
import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的，而requests的默认编码为ISO-8859-1，如果不设置成gbk则中文乱码
print(response.text)

# 3 获取二进制内容
import requests

response=requests.get('https://wx4.sinaimg.cn/mw690/005Po8PKgy1gqmatpdmhij309j070dgj.jpg')

with open('a.jpg','wb') as f:
    # f.write(response.content)
    # 推荐用这个
    for line in response.iter_content():
        f.write(line)

# 4 json格式解码
import requests
import json

res = requests.get('https://api.luffycity.com/api/v1/course/actual/?category_id=1')
# print(json.loads(res.text))
print(res.json()['code'])

小结

# response对象方法: 
    响应文本                      response.text      
    响应二进制数据                 response.content
    响应状态码                    response.status_code
    响应头                        response.headers
    响应CookieJar对象             response.cookies
    响应cookie字典                response.cookies.get_dict()
    响应cookie列表套元组           response.cookies.items()
    响应重定向之前的response对象    response.history
    响应url地址                   response.url
    响应编码                      response.encoding
    响应数据的迭代器               response.iter_content()

# 解决响应内容编码: 
    手动: response.encoding = '你知道你获取url资源的编码'
    自动: response.encoding = response.apparent_encoding

# 解析json:
    1. json模块解析
        json.loads(response.text)
    2. requests提供的json()方法解析
        response.json()

五、案例

案例一：爬取好看视频

分析出视频地址

https://vd2.bdstatic.com/mda-mcbkh5a50wx55wpi/1080p/h264_cae/1620464746505493348/mda-mcbkh5a50wx55wpi.mp4

示例代码

import requests

res = requests.get(
    'https://vd2.bdstatic.com/mda-mcbkh5a50wx55wpi/1080p/h264_cae/1620464746505493348/mda-mcbkh5a50wx55wpi.mp4')
with open('study.mp4', 'wb') as f:
    for line in res:
        f.write(line)

案例二：爬取梨视频

分析出爬取视频的地址

https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0

示例代码

import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=9&start=0')
# print(res.text)

video_ids = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)
# print(video_ids)
for video_id in video_ids:
    video_url = 'https://www.pearvideo.com/' + video_id
    # print(video_url)
    real_video_id = video_id.split('_')[-1]
    # print(real_video_id)
    # print(video_url)
    # res_detail=video_detail=requests.get(video_url)
    # print(res_detail.text)
    # break
    # 直接发送ajax请求，拿到json格式数据--》json格式数据中就有mp4
    header = {
        # 解决跨域问题
        'Referer': video_url
    }
    res_json = requests.get('https://www.pearvideo.com/videoStatus.jsp?contId=%s' % real_video_id, headers=header)
    # print(res_json.json())
    mp4_url = res_json.json()['videoInfo']['videos']['srcUrl']
    mp4_url = mp4_url.replace(mp4_url.split('/')[-1].split('-')[0], 'cont-%s' % real_video_id)

    print(mp4_url)
    video_res = requests.get(mp4_url)
    name=mp4_url.split('/')[-1]
    with open('video/%s'%name, 'wb') as f:
        for line in video_res.iter_content():
            f.write(line)

# https://video.pearvideo.com/mp4/third/20210509/  cont-1728918  -15454898-094108-hd.mp4  能播放
# https://video.pearvideo.com/mp4/third/20210509/  1621312234758 -15454898-094108-hd.mp4

案例三：自动登录某网站

import requests
# 拿到一个session对象，发送请求时，跟使用reqesuts一样，只不过它自动处理了cookie
session=requests.session()
data = {
    'username': '用户名',
    'password': '密码',
    'captcha': '9eee',
    'ref': 'http://www.aa7a.cn/',
    'act': 'act_login',
}
res = session.post('http://www.aa7a.cn/user.php', data=data)
# print(res.text)
# {"error":0,"ref":"http://www.aa7a.cn/"} 登录成功
# 取到cookie--》登录成功的cookie

# CookieJar 对象
# print(res.cookies.get_dict())

res1 = session.get('http://www.aa7a.cn/')

print('用户名' in res1.text)

六、高级用法

1.SSL Cerf Verification（携带证书，很少见）

import requests

response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)  # 不验证证书,报警告,返回200

# 使用证书，需要手动携带
import requests

response = requests.get('https://www.12306.cn',
                        cert=('/path/server.crt',
                              '/path/key'))
print(response.status_code)

2.超时设置

import requests

# 两种超时:float or tuple
# timeout = 0.001  # 代表接收数据的超时时间
timeout = (0.0001, 0.002)  # 0.1代表链接超时  0.2代表接收数据的超时时间
response = requests.get('https://www.baidu.com',
                        timeout=timeout)
print(response.text)
print(response.status_code)
# 注意: 超时以后抛出异常. 
'''
# 0.0001表示链接超时
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.baidu.com', port=443): Read timed out. (read timeout=0.0001)
'''

3.认证设置

官网链接：http://docs.python-requests.org/en/master/user/authentication/

'''
认证设置:登陆网站是,弹出一个框,要求你输入用户名密码（与alter很类似），此时是无法获取html的
但本质原理是拼接成请求头发送
        r.headers['Authorization'] = _basic_auth_str(self.username, self.password)
一般的网站都不用默认的加密方式，都是自己写
那么我们就需要按照网站的加密方式，自己写一个类似于_basic_auth_str的方法
得到加密字符串后添加到请求头
        r.headers['Authorization'] =func('.....')
'''

# 看一看默认的加密方式吧，通常网站都不会用默认的加密设置
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('xxx', auth=HTTPBasicAuth('user', 'password'))
print(r.status_code)

# HTTPBasicAuth可以简写为如下格式
import requests

r = requests.get('xxx', auth=('user', 'password'))
print(r.status_code)

4.异常处理

import requests
from requests.exceptions import *#可以查看requests.exceptions获取异常类型

try:
    response = requests.get('https://www.baidu.com', timeout=(0.001, 0.002))
    print(response.status_code)
except ReadTimeout:
    print('读取超时!')
except ConnectionError: #网络不通
    print('连接失败!')
except Timeout:
    print('超时')
except RequestException:
    print("请求异常")
except Exception as e:
    print(e)

5.使用代{过}{滤}理

HTTP代{过}{滤}理| 百度百科链接

'''
代{过}{滤}理：网上免费的（不稳定，自己玩）  收费的（稳定，公司都会买）
代{过}{滤}理：高匿：隐藏访问者ip
透明：不隐藏访问者ip   http的请求头中：X-Forwarded-For---》django中从META中取

每次访问，随机使用代{过}{滤}理
从网上找很多免费的代{过}{滤}理，放到列表中，每次随机取一个

使用第三方开源的代{过}{滤}理池：python+flask写的，自己搭建一个免费的代{过}{滤}理池  https://github.com/jhao104/proxy_pool

'''
# 服务端
from django.shortcuts import HttpResponse
def test_ip(request):
    ip = request.META.get('REMOTE_ADDR')
    return HttpResponse(f'你的ip是{ip}')

def upload_file(request):
    file = request.FILES.get('myfile')

    with open(file.name, 'wb') as f:
        for line in file:
            f.write(line)
    return HttpResponse()

urlpatterns = [
    path('test_ip/', test_ip),
    path('upload_file/', upload_file),
]

# 客户端
import requests

ip = requests.get('http://118.24.52.95:5010/get/').json()['proxy']
print(ip)
proxies = {
    'http': ip
}
respone = requests.get('http://101.133.225.166:8088/test_ip/', proxies=proxies)
print(respone.text)

6.上传文件

import requests
respone=requests.post('http://101.133.225.166:8088/upload_file/',files={'myfile':open('1 requests高级用法.py','rb')})
print(respone.text)

七、小结

# SSL认证
    verify=False不校验
    verify=True校验.  cert=(证书格式)

# 代{过}{滤}理
    # HTTP代{过}{滤}理
    proxies={'http': 'IP:PORT'}

    # socks代{过}{滤}理: 安装requests[socks]
    proxies={'http': 'socks://IP:PORT'}

# 超时设置
    timeout=(连接超时时间, 接受数据超时时间)
    抛出异常: ReadTimeOut

# 认证设置
    from requests.auth import HTTPBasicAuth
    auth=HTTPBasicAuth('user', 'password')

# 异常处理
    from requests.exceptions import *
    ReadTimeOut       连接 或者 获取数据超时
    TimeOut           超时
    ConnectionError   连接错误
    RequestException  请求异常 

# 上传文件
    files={key: file, key1: file1}

mosou · 发表于 2021-5-25 07:49

出点py 处理 Excel数据的教程撒

hagas520 · 发表于 2021-5-25 08:48

坐等处理 excel 的文章，现在处理excel 确实比较需要

caballarri · 发表于 2021-5-25 09:01

学习了，谢谢楼主分享

RootPotence · 发表于 2021-5-25 09:35

这不关注一波

好学 · 发表于 2021-5-25 09:35

十分不错，感谢python大佬

Gordon_c · 发表于 2021-5-25 09:52

最近也在学习爬虫，这个文章写的很错。
请教一下，有些py模块方法看不懂，怎么办？

szwangbin001 · 发表于 2021-5-25 09:55

感谢分享

璐璐诺 · 发表于 2021-5-25 10:20

前来学习大佬文章

不苦小和尚 · 发表于 2021-5-25 10:31

挺详细的，谢谢分享

帐号		自动登录	找回密码
密码			注册[Register]

[讨论] 爬虫day02:爬虫请求库之requests库