吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2239|回复: 28
收起左侧

[Python 原创] 用Python的Scrapy爬彼岸网的图片

  [复制链接]
到底爱不爱我不 发表于 2024-5-9 15:44
  • 前置操作

[Python] 纯文本查看 复制代码
pip install scrapy

# 创建项目
scrapy startproject bian

# 在项目下创建爬虫文件
scrapy genspider -t crawl bian_pic [url]https://pic.netbian.com[/url]

  • 编写爬虫代码

[Python] 纯文本查看 复制代码
# settings.py
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"

ROBOTSTXT_OBEY = False

LOG_LEVEL = "ERROR"

CONCURRENT_REQUESTS = 32

ITEM_PIPELINES = {
    "bian.pipelines.BianPipeline": 300,
}

[Python] 纯文本查看 复制代码
# items.py

class BianItem(scrapy.Item):
    href = scrapy.Field()
    title = scrapy.Field()
    src = scrapy.Field()

[Python] 纯文本查看 复制代码
# bian_pic.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bian.items import BianItem


class BianPicSpider(CrawlSpider):
    name = "bian_pic"
    # allowed_domains = ["pic.netbian.com"]
    base_url = "https://pic.netbian.com"
    start_urls = [
        "https://pic.netbian.com/4kdongman",
        "https://pic.netbian.com/4kyouxi",
        "https://pic.netbian.com/4kmeinv",
        "https://pic.netbian.com/4kfengjing",
        "https://pic.netbian.com/4kyingshi",
        "https://pic.netbian.com/4kqiche",
        "https://pic.netbian.com/4krenwu",
        "https://pic.netbian.com/4kdongwu",
        "https://pic.netbian.com/4kzongjiao",
        "https://pic.netbian.com/4kbeijing",
        "https://pic.netbian.com/pingban",
        "https://pic.netbian.com/shoujibizhi",
    ]

    link = LinkExtractor(restrict_xpaths='//*[@class="page"]/a')
    rules = (Rule(link, callback="parse_item", follow=True),)

    def parse_item(self, response):
        a_list = response.xpath('//*[@class="slist"]/ul/li/a')
        for a in a_list:
            if a.xpath('./@target').extract_first():
                href = a.xpath('./@href').extract_first()
                item = BianItem()
                item["href"] = href
                yield scrapy.Request(url=self.base_url + href, callback=self.parse_detail)

    def parse_detail(self, response):
        src = response.xpath('//*[@id="img"]/img/@src').extract_first()
        title = response.xpath('//*[@id="img"]/img/@title').extract_first()
        item = BianItem()
        item["src"] = self.base_url + src
        item["title"] = title
        yield item

[Python] 纯文本查看 复制代码
# pipelines.py

class BianPipeline:
    fp = None

    def open_spider(self, spider):
        print("开始写入爬虫文件")
        self.fp = open("pic.txt", "w", encoding="utf-8")

    def process_item(self, item, spider):
        self.fp.write(item["title"] + " | " + item["src"] + "\n")
        return item

    def close_spider(self, spider):
        print("写入爬虫完成结束")
        self.fp.close()

  • 结语

[Python] 纯文本查看 复制代码
因为在公司无聊写的,所以爬到的数据直接写到文件中了,不敢download图片怕流量异常。有兴趣的可以在pipelines中写下载文件的方法

免费评分

参与人数 1吾爱币 +7 热心值 +1 收起 理由
苏紫方璇 + 7 + 1 欢迎分析讨论交流,吾爱破解论坛有你更精彩!

查看全部评分

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

甜萝 发表于 2024-5-9 15:47
感觉楼主可以把帖子移步到编程语言区
qwe5333515 发表于 2024-5-9 15:48
pastorcd 发表于 2024-5-9 15:51
 楼主| 到底爱不爱我不 发表于 2024-5-9 15:56
paypojie 发表于 2024-5-9 15:47
感觉楼主可以把帖子移步到编程语言区

是要在那个版块再发一贴吗?还是说该贴可以编辑?
modlive 发表于 2024-5-9 16:09
呃,还以为进错区了嘞,背手昂头假装懂路过……
stone102 发表于 2024-5-9 16:10
专业的水文?
niluelf 发表于 2024-5-9 16:11
其实可以@管理帮忙转移~这个帖子明显不是用来水的~
八月初三 发表于 2024-5-9 16:16
留在这也挺好  让水货们见见世面
xn2113 发表于 2024-5-9 16:25
paypojie 发表于 2024-5-9 15:47
感觉楼主可以把帖子移步到编程语言区

这不就在编程语言区嘛
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则

返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-12-15 20:50

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表