python爬虫-美女图片

Qnly_genius · 发表于 2018-12-10 23:25

本帖最后由 Qnly_genius 于 2018-12-11 12:03 编辑

程序用pyinstaller打包了下

链接: https://pan.baidu.com/s/1uQTtpIwOpBMoMH5nO3PJrA 提取码: unuq

python3.6.2
所需模块：requests，lxml

[Python] 纯文本查看 复制代码

pip install requests lxml

在代码所在文件夹，新建“美图”文件夹，执行代码即可，大神勿喷！
代码：

[Python] 纯文本查看 复制代码

#-*- coding:utf-8 -*-

import os
import time
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
    'Referer': 'http://www.mmjpg.com/tag/xinggan'
}

## tag/xinggan 性感首页
url = "http://www.mmjpg.com/tag/xinggan"

html = requests.get(url).text
soup1 = etree.HTML(html)
## 性感的页数
all_page = int(soup1.xpath('/html/body/div[3]/div[1]/div[2]/a[8]/@href')[0].split('/')[-1])

for page in range(all_page):
    url = "http://www.mmjpg.com/tag/xinggan/%d"%(page+1)
    html = requests.get(url).text
    soup1 = etree.HTML(html)

    # 性感每一页总共15个人物
    for i in range(15):
        path = "/html/body/div[3]/div[1]/ul/li[%d]/a/@href"%(i+1)
        # 每个人物的 首页 url
        tep_url = soup1.xpath(path)
        # 人物id  referer需要人物的id
        id = int(tep_url[0].split('/')[-1].replace('.jpg', ''))
        dir_name = '美图/'+ str(id)
        os.mkdir(dir_name)
        # 人物标题
        title= soup1.xpath('/html/body/div[3]/div[1]/ul/li[%d]/span[1]/a/text()'%(i+1))[0]
        # 首页的内容
        pic_page = requests.get(tep_url[0]).text
        # 解析首页内容
        soup2 = etree.HTML(pic_page)
        # 该人物的图片数量
        page_num = int(soup2.xpath('//*[@id="page"]/a[7]/text()')[0])
        # 获取人物首页图片的url
        # [img]http://fm.shiyunjj.com/2018/1502/1ie6.jpg[/img]
        pic_url = soup2.xpath('//*[@id="content"]/a/img/@src')[0]
        # 1ie6.jpg
        detail_url_end = pic_url.split('/')[-1]
        # http://fm.shiyunjj.com/2018/1502/
        detail_url_top = pic_url.replace(detail_url_end, '')
        # 下载图片
        for i in range(page_num):
            # referer
            detail_url = "http://www.mmjpg.com/mm/%d/%d" % (id, i + 1)
            headers['Referer'] = detail_url

            # 获取图片链接
            html_detail = requests.get(detail_url).text
            soup3 = etree.HTML(html_detail)
            pic = soup3.xpath('//*[@id="content"]/a/img/@src')[0]
            
            # 下载图片
            with open(dir_name+'/'+str(i+1)+'.jpg', 'wb') as f:
                print('正在下载：', dir_name+'/'+str(i+1)+'jpg')
                f.write(requests.get(pic, headers=headers).content)
                

            # time.sleep(0.5)

Dirichlets · 发表于 2018-12-27 10:29

【报错如下】前面可以正常执行，后面超时断了，不知道是不是我网络的问题
...
正在下载：美图/831/31jpg
正在下载：美图/831/32jpg
Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 79, in create_connection
raise err
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\connection.py", line 69, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 354, in _make_request
conn.request(method, url, **httplib_request_kw)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1229, in request
self._send_request(method, url, body, headers, encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1275, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1224, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 1016, in _send_output
self.send(msg)
  File "D:\Program Files\Python\3.7.0\lib\http\client.py", line 956, in send
self.connect()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 196, in connect
conn = self._new_conn()
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connection.py", line 180, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 445, in send
timeout=timeout
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
  File "D:\Program Files\Python\3.7.0\lib\site-packages\urllib3\util\retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.mmjpg.com', port=80): Max retries exceeded with url: /mm/830 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "G:/PyCharm_Projects/图像/meitu.py", line 38, in <module>
pic_page = requests.get(tep_url[0]).text
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\sessions.py", line 622, in send
r = adapter.send(request, **kwargs)
  File "D:\Program Files\Python\3.7.0\lib\site-packages\requests\adapters.py", line 513, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.mmjpg.com', port=80): Max retries exceeded with url: /mm/830 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000001CC08DD2908>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。'))

Process finished with exit code 1

caozb · 发表于 2018-12-12 00:36

可以自动建立文件夹啦
#-*- coding:utf-8 -*-

import os
import time
import requests
from lxml import etree

headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'Referer': 'http://www.mmjpg.com/tag/xinggan'
}

## tag/xinggan 性感首页
url = "http://www.mmjpg.com/tag/xinggan"

html = requests.get(url).text
soup1 = etree.HTML(html)
## 性感的页数
all_page = int(soup1.xpath('/html/body/div[3]/div[1]/div[2]/a[8]/@href')[0].split('/')[-1])

for page in range(all_page):
url = "http://www.mmjpg.com/tag/xinggan/%d"%(page+1)
html = requests.get(url).text
soup1 = etree.HTML(html)

# 性感每一页总共15个人物
for i in range(15):
      path = "/html/body/div[3]/div[1]/ul/li[%d]/a/@href"%(i+1)
      # 每个人物的首页 url
      tep_url = soup1.xpath(path)
      # 人物id  referer需要人物的id
      id = int(tep_url[0].split('/')[-1].replace('.jpg', ''))


      path1=os.getcwd()
      isExists=os.path.exists(path1+'\\美女')

      if not isExists:
      # 如果不存在则创建目录
      # 创建目录操作函数
         os.makedirs(path1+'\\美女')


      dir_name = '美女/'+ str(id)
      os.mkdir(dir_name)
      # 人物标题
      title= soup1.xpath('/html/body/div[3]/div[1]/ul/li[%d]/span[1]/a/text()'%(i+1))[0]
      # 首页的内容
      pic_page = requests.get(tep_url[0]).text
      # 解析首页内容
      soup2 = etree.HTML(pic_page)
      # 该人物的图片数量
      page_num = int(soup2.xpath('//*[@id="page"]/a[7]/text()')[0])
      # 获取人物首页图片的url
      #

      pic_url = soup2.xpath('//*[@id="content"]/a/img/@src')[0]
      # 1ie6.jpg
      detail_url_end = pic_url.split('/')[-1]
      # http://fm.shiyunjj.com/2018/1502/
      detail_url_top = pic_url.replace(detail_url_end, '')
      # 下载图片
      for i in range(page_num):
         # referer
         detail_url = "http://www.mmjpg.com/mm/%d/%d" % (id, i + 1)
         headers['Referer'] = detail_url

         # 获取图片链接
         html_detail = requests.get(detail_url).text
         soup3 = etree.HTML(html_detail)
         pic = soup3.xpath('//*[@id="content"]/a/img/@src')[0]

         # 下载图片
         with open(dir_name+'/'+str(i+1)+'.jpg', 'wb') as f:
            f.write(requests.get(pic, headers=headers).content)
            print('正在下载：', dir_name+'/'+str(i+1)+'jpg')

         # time.sleep(0.5)

long860226 · 发表于 2018-12-10 23:35

我想要网址~哈哈~

许小诺always · 发表于 2018-12-10 23:36

感觉在看黄图

为海尔而战 · 发表于 2018-12-10 23:41

long860226 发表于 2018-12-10 23:35
我想要网址~哈哈~

网址就在代码里啊～

YAO21 · 发表于 2018-12-10 23:41

感谢分享

beisimm · 发表于 2018-12-10 23:48

你可以调用os库判断有没有文件夹如果没有创建, 然后保存进文件夹中.

sexyeyes · 发表于 2018-12-11 00:03

准备开始学习python。

olivier · 发表于 2018-12-11 00:25

好像不能用，求楼主debug

Traceback (most recent call last):
File "C:/Users/Python/Desktop/Python/美女图/mmjp.py", line 21, in <module>
for page in all_page:
TypeError: 'int' object is not iterable

Process finished with exit code 1

沉默挺好的 · 发表于 2018-12-11 00:30

真的能用吗？

caozb · 发表于 2018-12-11 01:04

改成--------for page in range(all_page)

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] python爬虫-美女图片

免费评分

个人中心