吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 8696|回复: 64
上一主题 下一主题
收起左侧

[Python 转载] 利用爬虫爬取套图网站美女的所有写真(带多线程)

[复制链接]
跳转到指定楼层
楼主
zrq648022547 发表于 2022-8-16 23:40 回帖奖励
本帖最后由 zrq648022547 于 2022-8-16 23:43 编辑

====================================
==初次学习python,写的比较烂,暂时写了两种方式==
==指定某美女,爬取全部所有的图册                          ==
==指定图集首页地址,爬取单图册图片                      ==
==写的不好,烦请大神不吝赐教,在此谢过               ==
====================================
代码一:爬取指定美女图册
缺点:
1、虽然写入了多线程爬取,但是测试貌似还是比较慢;
2、爬取的图片名称不能按照序号命名
3、未处理反扒机制
[Python] 纯文本查看 复制代码
# 多进程异步并发
import random
import requests
from bs4 import BeautifulSoup
import os
import time
import threading
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import concurrent.futures

USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
    "UCWEB7.0.2.37/28/999",
    "NOKIA5700/ UCWEB7.0.2.37/28/999",
    "Openwave/ UCWEB7.0.2.37/28/999",
    "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
    # iPhone 6:
    "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]

headers = {
    'User-Agent': random.choice(USER_AGENTS),
    "referer": "https://www.xiurenji.vip/"
}


def get_item(item, main_title):
    title = item['title']
    threads = []
    item_links = [start_url + item['href']]
    # print(f'{title}>>>{item_links}')
    for item_link in item_links:
        threads.append(threading.Thread(target=get_images,args=(title, item_link, main_title)))
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()


def get_images(title, item_link, main_title):
    try:
        item_res = requests.get(url=item_link, headers=headers,timeout=30)
        item_res.encoding = 'gzip'
        item_soup = BeautifulSoup(item_res.text, 'lxml')
        img_list = item_soup.select('.content_left p img')
        # print(img_list)
        folder = main_folder + '/' + main_title + '/' + title + '/'
        if not os.path.exists(folder):
            os.makedirs(folder)
        else:
            pass
        try:
            for img in img_list:
                img_link = start_url + img['src']
                # print(img_link)
                with open(folder + img_link.split('/')[-1], 'wb') as f:
                    starttime = time.time()
                    image = requests.get(url=img_link, headers=headers,timeout=30).content
                    f.write(image)
                    time.sleep(1)
                    endtime = time.time()
                    print(f'正在保存>>>{title}' + '>>>' + img_link.split('/')[-1] + '>>>用时%.3f'%(endtime - starttime), 'seconds')
        except IndexError:
            pass
        next = item_soup.select_one('.page a:last-of-type')
        if 'class="current"' in str(next):
            pass
        else:
            item_link = start_url + next['href']
            with concurrent.futures.ThreadPoolExecutor(max_workers=3) as pool:
                pool.submit(get_images,title, item_link, main_title)
    except IndexError:
        pass


if __name__ == '__main__':
    start = time.time()
    start_url = 'https://www.xrmn5.cc'  # 网站根地址
    main_title = '秀人网'
    main_folder = "单项目2\\"  # 主文件夹路径(请填入你自己的文件夹路径)
    main_url = 'https://www.xrmn5.cc/younisi.html'
    itemlist_res = requests.get(url=main_url, headers=headers,timeout=30)
    itemlist_res.encoding = 'gzip'
    itemlist_soup = BeautifulSoup(itemlist_res.text, 'lxml')
    itemlist = itemlist_soup.select('.list_n2 a')
    # print(itemlist)
    for item in itemlist:
        get_item(item, main_title)
    end = time.time()
    print('总共用时:',end - start,'seconds',end='')

代码二:爬取美女单图集
缺点:
1、文件夹名称需要手动指定
2、爬取的图册首页地址需要手动指定
3、爬取的图片和页面显示的图片数量不符合,有的页面图片漏掉了,不知道为啥
[Python] 纯文本查看 复制代码
import requests
import parsel
import random
import os
import datetime
import time


starttime = datetime.datetime.now()
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
    "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
    "UCWEB7.0.2.37/28/999",
    "NOKIA5700/ UCWEB7.0.2.37/28/999",
    "Openwave/ UCWEB7.0.2.37/28/999",
    "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
    # iPhone 6:
    "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25",
]

headers = {
    'User-Agent': random.choice(USER_AGENTS),
    # 'Connection': 'close'
    "referer":"https://www.xrmn5.cc"
}

# 0.创建文件夹
directory = '[YouMi尤蜜荟]Vol.809_女神尤妮丝Egg红色轻透上衣配红短裙半脱露红色内衣诱惑写真60P\\'
if os.path.exists(directory):
    pass
else:
    os.mkdir(directory)
# 1.确定爬取的网站

for page in range(0, 30):
    try:
        if page >= 1:
            base_url = 'https://www.xrmn5.com/YouMi/2022/202211014_{}.html'.format(page)
        else:
            base_url = 'https://www.xrmn5.com/YouMi/2022/202211014.html'
        # print('==========正在爬取第{}页数据============='.format(page))
        # 2.发送请求
        response = requests.get(url=base_url, headers=headers)
        response.encoding = 'UTF-8'  # 自动识别响应体的编码
        # print(response)
        html_data = response.text
        # print(html_data)
        # 解析详情页图片地址
        response_1 = requests.get(base_url, headers=headers).text
        html_1 = parsel.Selector(response_1)
        # print(html_1)
        # 解析图册中图片地址
        for i in range(1,4):
            img_list_1 = html_1.xpath('//*[@class="content_left"]/p/img[{}]/@src'.format(i)).extract_first()
            img_list = img_list_1.replace('uploadfile','Uploadfile')
            img_url = 'https://p.xrmn5.com/' + str(img_list)
            # print(img_list_1)
            # 请求图片地址
            img_data = requests.get(img_url, headers=headers).content
            img_name_1 = str(int(page) + 1)  # 图片文件名称
            # print(img_name_1)
            # 4.数据保存
            with open(directory + img_name_1 + '-' + str(i) + '.jpg', 'wb') as f:
                f.write(img_data)
                time.sleep(10)
            print('#####################正在爬取第', page + 1, '页,第', int(i),'张图片#####################')
    except IndexError:
        continue
endtime = datetime.datetime.now()
print('####################下载完成,共用时', (endtime - starttime).seconds, '秒###################')

免费评分

参与人数 14吾爱币 +14 热心值 +12 收起 理由
wangdachui1988 + 1 热心回复!
Ds618 + 1 + 1 我很赞同!
ciye7 + 1 + 1 我很赞同!
zhaoqingdz + 1 + 1 谢谢@Thanks!
ChenSSS + 1 + 1 谢谢@Thanks!
yystrive + 1 谢谢@Thanks!
a22488 + 1 + 1 谢谢@Thanks!
trashes + 1 + 1 谢谢@Thanks!
咕嚕靈啵 + 1 + 1 我很赞同!
zhangzsf + 1 + 1 谢谢@Thanks!
businiao10000 + 1 + 1 热心回复!
KKA + 1 + 1 谢谢@Thanks!
bingshuir + 1 + 1 谢谢@Thanks!
YMH0417 + 1 + 1 用心讨论,共获提升!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

推荐
水上凌波 发表于 2022-8-17 00:03
https://www.3gbizhi.com/,能爬吗
推荐
rangersxiaoyan 发表于 2022-8-24 22:26
zrq648022547 发表于 2022-8-24 00:56
研究哈协程吧,我也在看

[Python] 纯文本查看 复制代码
# coding=utf-8
import aiohttp
import asyncio

# 图片链接地址
urls = [
    "https://p.xiurenb.cc/uploadfile/202207/25/1A93926296.jpg",
    "https://p.xiurenb.cc/uploadfile/202207/25/1193926263.jpg",
    "https://p.xiurenb.cc/uploadfile/202207/25/BA93926177.jpg",
    "https://p.xiurenb.cc/uploadfile/202207/25/EE93926237.jpg",
    "https://p.xiurenb.cc/uploadfile/202207/25/9C93926238.jpg",
    "https://p.xiurenb.cc/uploadfile/202207/25/8A93926261.jpg",

]


async def download(url, name):
    async with aiohttp.ClientSession() as session:  # 等价于request
        async with session.get(url) as resp:  # 等价于resp=request.get()
            with open(name, 'wb') as f:
                f.write(await resp.content.read())  # 读取内容也是io操作,需要await挂起,拿取页面源代码resp.text()
    print(name, 'ok')


async def main():
    # 初始化任务列表
    tasks = []
    # 展开图片链接列表
    for i in range(len(urls)):
        url = urls[i]
        name = f'{i}.jpg'
        # 创建协程并发多任务,添加到任务列表
        tasks.append(asyncio.create_task(download(url, name)))
    # 执行并发任务
    await asyncio.wait(tasks)


if __name__ == '__main__':
    # 初始化主进程
    loop = asyncio.get_event_loop()
    # 运行主体函数
    loop.run_until_complete(main())

这是我的协程变种,可以参考一下,共同努力。
3#
隐身三娃 发表于 2022-8-16 23:57
共同学习,图片漏掉和不能按照序号命名,问题多数是出在多线程上,我看到的问题有两点
第一:多线程未设置数量,会增加服务器的阻拦机率
第二:在爬取数据处理的时候,建议加上try来处理一些报错和数据的处理
个人意见,共同学习
4#
icodeme 发表于 2022-8-17 00:18
可以,感谢大佬分享,希望继续完善
5#
decdeva 发表于 2022-8-17 01:41
哥们你是真滴秀
6#
ysjd22 发表于 2022-8-17 08:05
ppt。视频素材爬取的软件有没有》?
7#
wallacebai 发表于 2022-8-17 08:12
感谢分享, 学习了。
8#
烟花非易冷 发表于 2022-8-17 08:19
爬取内容挺不错的,哈哈
9#
guohuanxian 发表于 2022-8-17 08:27
感谢分享
10#
excess1989 发表于 2022-8-17 08:32
牛逼老司机
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则 警告:本版块禁止灌水或回复与主题无关内容,违者重罚!

快速回复 收藏帖子 返回列表 搜索

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-4-19 04:49

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表