[python]分析Ajax请求，多进程爬取【今日头条】的车模MM

Nicobuss · 发表于 2018-7-1 23:20

本帖最后由 wushaominkk 于 2018-7-2 09:26 编辑

上次和大家分享了一个简单的利用requests库和正则表达式爬取猫眼TOP100的代码（上次地址：

）

反响不错，所以今天我们爬取个难一点的，有吸引力的。
那就是美女啦，废话不多说，先上图

有木有动力了呢~

此次使用的IDLE依然是Pycharm
代码已上传到GitHub：

要说明的是，这次爬取的页面是Ajax加载的，里面是有常规正则或者BeautifulSoup库是爬取不到滴，所以此次的难度可是上升了好几个档次哦。

使用到的库：

import requests
import urllib.parse
from requests.exceptions import RequestException
import json
import re
from bs4 import BeautifulSoup
from json.decoder import JSONDecodeError
import os
from hashlib import md5
from multiprocessing import Pool

然后上代码：

[Asm] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

# 此爬虫适合爬取利用Ajax加载的页面
[/size]
import requests
import urllib.parse
from requests.exceptions import RequestException
import json
import re
from bs4 import BeautifulSoup
from json.decoder import JSONDecodeError
import os
from hashlib import md5
from multiprocessing import Pool
 
#此库为了在Pool()中引入多参数，本次用不上
#from pathos.multiprocessing import ProcessingPool as Pool
 
#定义一个头部
header = {
    'user-agent': "Baiduspider+"
}
 
# 获取索引页的html
def get_one_index(offset, keyword, code='utf-8'):
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': '20',
        'cur_tab': 1
    }
    # urlencode函数可将字符串以URL编码，用于编码处理，返回字符串
    url = 'https://www.toutiao.com/search_content/' + '?' + urllib.parse.urlencode(data)
 
    try:
        r = requests.get(url, headers=header)
        r.raise_for_status()
        r.encoding = 'utf-8'
        # 返回的是json
        return r.text
    except RequestException:
        print('get_one_index 出现错误')
        return None
 
 
# 解析索引页的html文件，提取出图片的url地址（注意迭代输出）
def parse_one_html(html):
    try:
        # 将已编码的 JSON 字符串解码为 Python 对象
        html = json.loads(html)
        # 判断html是否有"data"这个键
        if html and 'data' in html.keys():
            for item in html.get('data'):
                yield item.get("article_url")
    except JSONDecodeError:
        pass
 
# 获取在parse_one_html()函数中获取的url的html
def get_page_html(url):
    try:
        r = requests.get(url, headers=header)
        if r.status_code == 200:
            return r.text
        return None
    except RequestException:
        print('get_one_index 出现错误', url)
        return None
 
# 解析get_page_html()函数中的html里的图片地址
def parse_html_by_1(html):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select('title')[0].get_text()
    if title:
        print(title)
    #使用正则表达式得到图片的url(此处获取的url需要处理)
    image_pattern = re.compile('gallery: JSON.parse\("(.*?)"\)', re.S)
    image_content = re.search(image_pattern, html)
 
    #如果页面的图片是滑动切换的，且用regex可以得到url的话：
    if image_content:
        data = json.loads(image_content.group(1).replace('\\', ''))
        if data:
            if data and 'sub_images' in data.keys():
                sub_images = data.get('sub_images')
                images = [item.get('url') for item in sub_images]
                for image in images:
                    get_image_html(image)
    #如果页面的图片是滑屏查看的，则需要换一种正则公式，即parse_html_by_2
    else:
        parse_html_by_2(html)
 
#如果图片页面是滑屏查看的，则使用一下的方法
def parse_html_by_2(html):
    pattern = re.compile('src&#x3D;&quot;(.*?)&quot;',re.S)
    images =re.findall(pattern,html)
    for image in images:
        print(image)
        get_image_html(image)
 
#下载图片至本地的"F://photos"文件夹中
def get_image_html(url):
    print('\r 开始下载:%s'%url,end='')
    r = requests.get(url)
    if r.status_code == 200:
        write_image(r.content)
    return None
 
#文件的写入
def write_image(content):
    file_path = '{0}/{1}.{2}'.format('f://photos/', md5(content).hexdigest(), 'jpg')
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)
            f.close()
 
 
def main(i):
    index_html = get_one_index(i*20, '模特')
    for item in parse_one_html(index_html):
        if item:
            page_html = get_page_html(item)
            parse_html_by_1(page_html)
 
 
if __name__ == "__main__":
    #使用多进程爬取
    pool = Pool()
 
    page = int(input('请输入页数：'))
    offset = [i for i in range(page+1)]
    pool.map(main,offset)
 
    pool.close()
    pool.join()

如果您有好的建议，记得留言哦。
而如果有疑问，我尽量回答，毕竟我也在学习~:lol

Nicobuss · 发表于 2018-7-10 00:01

RuiBox 发表于 2018-7-9 22:21
感谢楼主分享，我有个很低级的问题
[mw_shl_code=python,true]url = 'https://www.toutiao.com/search_con ...

浏览器按F12（浏览器推荐谷歌），然后页面刷新，然后根据图中所示选择标签，就可以了。

Nicobuss · 发表于 2018-7-2 15:29

萌萌哒的小白发表于 2018-7-1 23:29
楼主请教个问题,像页面的登录是否能通过代码直接实现登录,不用ui界面?

可以，网上给出的方法是通过cookie，所以遇到需要登录的，我就这样：
导入此库from http.cookie import CookieJar
然后利用requests的Session()方法创立一个回话
之后自己定义一个headers（编写user-agent）和一个data（里面填写自己的账号和密码）
然后通过requests的post方法，发起一个请求就可以了。

我这边有个以前做的小项目，你可以参考下：https://github.com/Newbie-97/get-resumes-of-university
做的比较粗糙，见谅。

萌萌哒的小白 · 发表于 2018-7-1 23:29

楼主请教个问题,像页面的登录是否能通过代码直接实现登录,不用ui界面?

chinaweikedong · 发表于 2018-7-1 23:33

正好我也在学python的爬虫，实在是感谢了，以前没遇到过 ajax加载的页面，先收下了，以后多多交流

chinaweikedong · 发表于 2018-7-1 23:37

我以前也用pycharm，但是编码的问题还有就是很卡，让我都快疯了，现在用了vsc，感觉风一般的流畅

hbkccccc · 发表于 2018-7-2 00:12

感谢了以后多多交流

清华高材生 · 发表于 2018-7-2 00:45

感谢楼主这么详尽的教程

835296693 · 发表于 2018-7-2 08:41

谢谢楼主的分享

wushaominkk · 发表于 2018-7-2 09:26

期待您更好的作品!

Nicobuss · 发表于 2018-7-2 15:30

chinaweikedong 发表于 2018-7-1 23:33
正好我也在学python的爬虫，实在是感谢了，以前没遇到过 ajax加载的页面，先收下了，以后多多交流

可以，一起加油~

帐号		自动登录	找回密码
密码			注册[Register]

[Python 原创] [python]分析Ajax请求，多进程爬取【今日头条】的车模MM

免费评分

浏览过的版块