吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 7879|回复: 44
收起左侧

[Python 转载] python爬虫-美女图片

  [复制链接]
Qnly_genius 发表于 2018-12-10 23:25
本帖最后由 Qnly_genius 于 2018-12-11 12:03 编辑


程序用pyinstaller打包了下


链接: https://pan.baidu.com/s/1uQTtpIwOpBMoMH5nO3PJrA 提取码: unuq


python3.6.2
所需模块:requests,lxml
[Python] 纯文本查看 复制代码
pip install requests lxml

在代码所在文件夹,新建“美图”文件夹,执行代码即可,大神勿喷!
代码:
[Python] 纯文本查看 复制代码
#-*- coding:utf-8 -*-

import os
import time
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'Referer': 'http://www.mmjpg.com/tag/xinggan'
}

## tag/xinggan 性感首页
url = "http://www.mmjpg.com/tag/xinggan"

html = requests.get(url).text
soup1 = etree.HTML(html)
## 性感的页数
all_page = int(soup1.xpath('/html/body/div[3]/div[1]/div[2]/a[8]/@href')[0].split('/')[-1])

for page in range(all_page):
    url = "http://www.mmjpg.com/tag/xinggan/%d"%(page+1)
    html = requests.get(url).text
    soup1 = etree.HTML(html)

    # 性感每一页总共15个人物
    for i in range(15):
        path = "/html/body/div[3]/div[1]/ul/li[%d]/a/@href"%(i+1)
        # 每个人物的 首页 url
        tep_url = soup1.xpath(path)
        # 人物id  referer需要人物的id
        id = int(tep_url[0].split('/')[-1].replace('.jpg', ''))
        dir_name = '美图/'+ str(id)
        os.mkdir(dir_name)
        # 人物标题
        title= soup1.xpath('/html/body/div[3]/div[1]/ul/li[%d]/span[1]/a/text()'%(i+1))[0]
        # 首页的内容
        pic_page = requests.get(tep_url[0]).text
        # 解析首页内容
        soup2 = etree.HTML(pic_page)
        # 该人物的图片数量
        page_num = int(soup2.xpath('//*[@id="page"]/a[7]/text()')[0])
        # 获取人物首页图片的url
        # [img]http://fm.shiyunjj.com/2018/1502/1ie6.jpg[/img]
        pic_url = soup2.xpath('//*[@id="content"]/a/img/@src')[0]
        # 1ie6.jpg
        detail_url_end = pic_url.split('/')[-1]
        # http://fm.shiyunjj.com/2018/1502/
        detail_url_top = pic_url.replace(detail_url_end, '')
        # 下载图片
        for i in range(page_num):
            # referer
            detail_url = "http://www.mmjpg.com/mm/%d/%d" % (id, i + 1)
            headers['Referer'] = detail_url

            # 获取图片链接
            html_detail = requests.get(detail_url).text
            soup3 = etree.HTML(html_detail)
            pic = soup3.xpath('//*[@id="content"]/a/img/@src')[0]
            
            # 下载图片
            with open(dir_name+'/'+str(i+1)+'.jpg', 'wb') as f:
                print('正在下载:', dir_name+'/'+str(i+1)+'jpg')
                f.write(requests.get(pic, headers=headers).content)
                

            # time.sleep(0.5)

免费评分

参与人数 5吾爱币 +3 热心值 +4 收起 理由
在干嘛 + 1 + 1 谢谢@Thanks!
potatoes + 1 自从学了python,身体一天不一天系列
admh + 1 谢谢@Thanks!
olivier + 1 我很赞同!
画上凤求凰 + 1 + 1 热心回复!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

Dirichlets 发表于 2018-12-27 10:29
【报错如下】前面可以正常执行,后面超时断了,不知道是不是我网络的问题
...
正在下载: 美图/831/31jpg
正在下载: 美图/831/32jpg
Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 79, in create_connection
    raise err
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 69, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1229, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1275, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1224, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1016, in _send_output
    self.send(msg)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 956, in send
    self.connect()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 196, in connect
    conn = self._new_conn()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 180, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 445, in send
    timeout=timeout
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.mmjpg.com', port=80): Max retries exceeded with url: /mm/830 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:/PyCharm_Projects/图像/meitu.py", line 38, in <module>
    pic_page = requests.get(tep_url[0]).text
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 512, in request
    resp = self.send(prep, **send_kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 622, in send
    r = adapter.send(request, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 513, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.mmjpg.com', port=80): Max retries exceeded with url: /mm/830 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))

Process finished with exit code 1
caozb 发表于 2018-12-12 00:36
可以自动建立文件夹啦
#-*- coding:utf-8 -*-

import os
import time
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'Referer': 'http://www.mmjpg.com/tag/xinggan'
}

## tag/xinggan 性感首页
url = "http://www.mmjpg.com/tag/xinggan"

html = requests.get(url).text
soup1 = etree.HTML(html)
## 性感的页数
all_page = int(soup1.xpath('/html/body/div[3]/div[1]/div[2]/a[8]/@href')[0].split('/')[-1])

for page in range(all_page):
    url = "http://www.mmjpg.com/tag/xinggan/%d"%(page+1)
    html = requests.get(url).text
    soup1 = etree.HTML(html)

    # 性感每一页总共15个人物
    for i in range(15):
        path = "/html/body/div[3]/div[1]/ul/li[%d]/a/@href"%(i+1)
        # 每个人物的 首页 url
        tep_url = soup1.xpath(path)
        # 人物id  referer需要人物的id
        id = int(tep_url[0].split('/')[-1].replace('.jpg', ''))

              
        path1=os.getcwd()
        isExists=os.path.exists(path1+'\\美女')
        
        if not isExists:
        # 如果不存在则创建目录
        # 创建目录操作函数
            os.makedirs(path1+'\\美女')
                  
        
        dir_name = '美女/'+ str(id)
        os.mkdir(dir_name)
        # 人物标题
        title= soup1.xpath('/html/body/div[3]/div[1]/ul/li[%d]/span[1]/a/text()'%(i+1))[0]
        # 首页的内容
        pic_page = requests.get(tep_url[0]).text
        # 解析首页内容
        soup2 = etree.HTML(pic_page)
        # 该人物的图片数量
        page_num = int(soup2.xpath('//*[@id="page"]/a[7]/text()')[0])
        # 获取人物首页图片的url
        #
        pic_url = soup2.xpath('//*[@id="content"]/a/img/@src')[0]
        # 1ie6.jpg
        detail_url_end = pic_url.split('/')[-1]
        # http://fm.shiyunjj.com/2018/1502/
        detail_url_top = pic_url.replace(detail_url_end, '')
        # 下载图片
        for i in range(page_num):
            # referer
            detail_url = "http://www.mmjpg.com/mm/%d/%d" % (id, i + 1)
            headers['Referer'] = detail_url

            # 获取图片链接
            html_detail = requests.get(detail_url).text
            soup3 = etree.HTML(html_detail)
            pic = soup3.xpath('//*[@id="content"]/a/img/@src')[0]
            
            # 下载图片
            with open(dir_name+'/'+str(i+1)+'.jpg', 'wb') as f:
                f.write(requests.get(pic, headers=headers).content)
                print('正在下载:', dir_name+'/'+str(i+1)+'jpg')

            # time.sleep(0.5)
long860226 发表于 2018-12-10 23:35
许小诺always 发表于 2018-12-10 23:36
感觉在看黄图
为海尔而战 发表于 2018-12-10 23:41

网址就在代码里啊~
YAO21 发表于 2018-12-10 23:41
感谢分享
beisimm 发表于 2018-12-10 23:48
你可以调用os库判断有没有文件夹如果没有创建, 然后保存进文件夹中.
sexyeyes 发表于 2018-12-11 00:03
准备开始学习python。
olivier 发表于 2018-12-11 00:25
好像不能用,求楼主debug
Traceback (most recent call last):
  File "C:/Users/Python/Desktop/Python/美女图/mmjp.py", line 21, in <module>
    for page in all_page:
TypeError: 'int' object is not iterable

Process finished with exit code 1
沉默挺好的 发表于 2018-12-11 00:30
真的能用吗?
caozb 发表于 2018-12-11 01:04
改成--------for page in range(all_page)
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则 警告:本版块禁止灌水或回复与主题无关内容,违者重罚!

快速回复 收藏帖子 返回列表 搜索

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-6-13 03:44

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表