吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 1522|回复: 42
收起左侧

[求助] python抓取静态网页小说 re解析出来 章节名称 内容地址 如何下载到本地保存为TXT

[复制链接]
HadesGiant 发表于 2022-11-13 23:23
大致情况描述:
书接上回,本人萌新一枚 看到网上很多教python 抓取网页小说的教程就自己琢磨试试
一开始 下载安装  python然后遇到第三方库 导入有问题 等等 一堆总算能运行代码了 但是又卡住了······
我看的教程是 是利用四种 解析方式 requests re xpath ess 去解析获取小说内容 但是这个教程 我电脑 进行到 xpath ess 解析的时候就报错了 可能是我电脑库的我问题····
同样的代码输进去就是报错
04.png
05.png

扯远了 言归正传
=======分割线==========
1、需求
按照我现有代码 如何继续编辑 实现re 解析出来的小说地址 获取小说内容 并且保存为txt文件到本地
2、目前已知条件
python 版本python-3.9.0a1-amd64
小说章节地址:https://www.yushubo.net/list_other_80726.html
代码目前只写到了 re 解析出来的部分 往下不会了因为教程没有教·······
============分割线===============
[Python] 纯文本查看 复制代码
import re
url = 'https://www.yushubo.net/list_other_80726.html'#小说章节地址
#发送请求
import requests
data = requests.get(url=url).text
print(data)
# re解析
list_url = re.findall('<span><a href="(.*?)">(.*?)</a></span>', data)
for book_url in list_url:
    urls = 'https://www.yushubo.net/'+book_url[0] #目录信息
    name = book_url[1] #标题信息
    print(urls,name)

#<span><a href="/read_105142_1.html">第一章:刚醒来就要当巫?(求收藏)</a></span>
# []----列表
# ()----元组


===============分割线===========================
以上跑出来的效果如下
03.png
有没有大佬能接着写两句的 按照逻辑讲 目前我拿到了小说的每个章节内容地址 以及每个章节的名称 只要能够提取到文字 然后保存就行了
但是我是个小白 照葫芦画瓢会 讲道理太难的听不懂 大佬可以帮我续写实现需求 也可以告诉我接下来怎么做 我会操作 并且给你汇报进度
大佬你就当云监工了 好人一生平安 大富大贵{:1_919:}




本帖被以下淘专辑推荐:

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

天真Aro 发表于 2022-11-14 00:48
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.yushubo.net/list_other_80726.html'  # 小说章节地址
# 发送请求
data = requests.get(url=url).text
# print(data)
# re解析
list_url = re.findall('<span><a href="(.*?)">(.*?)</a></span>', data)
i = 0
for book_url in list_url:
    urls = 'https://www.yushubo.net' + book_url[0]  # 目录信息
    name = book_url[1]  # 标题信息
    # print(urls, name)
    soup = BeautifulSoup(requests.get(urls).text, 'lxml')
    content = soup.find('div', class_='article-con')
    print('下载'+name+'ing')
    with open(f'E:/little_story/{name}.txt', mode='a+', encoding='utf-8') as f:
        f.write(content.get_text())
    print('下载' + name + '完成')
调味包 发表于 2022-11-14 08:35
 楼主| HadesGiant 发表于 2022-11-14 08:50
天真Aro 发表于 2022-11-14 00:48
[md]```
import re
import requests

大佬牛 我看到又篇文章 有说用到 i 来代替什么解析的 但是我不知道代替哪部分 看到你这个我好像又懂了 哈哈哈哈 一会去试试 笔芯满满正能量~
cyy571 发表于 2022-11-14 08:53
不错,有的网站需要身份验证
lansemeiying 发表于 2022-11-14 09:03
需求中啊,呵呵
 楼主| HadesGiant 发表于 2022-11-14 09:11
kytion 发表于 2022-11-14 09:00
直接看不懂!牛!~~向你学习

慢慢学 我也是照葫芦画瓢 零基础起步
 楼主| HadesGiant 发表于 2022-11-14 09:27
天真Aro 发表于 2022-11-14 00:48
[md]```
import re
import requests

大佬 这个库我电脑也装不上 提示如下
[Python] 纯文本查看 复制代码
Microsoft Windows [版本 10.0.19044.2251]
(c) Microsoft Corporation。保留所有权利。

C:\Users\Le'novo>C:\Users\Le'novo\PycharmProjects\venv\Scripts\activate.bat

(venv) C:\Users\Le'novo>pip install BeautifulSoup
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting BeautifulSoup
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/40/f2/6c9f2f3e696ee6a1fb0e4d7850617e224ed2b0b1e872110abffeca2a09d4/BeautifulSoup-3.2.2.tar.gz (32 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-yyivd7j_'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\
  Complete output (6 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_d58efd557d52466aaef34141dd7fdf49\setup.py", line 3
      "You're trying to run a very old release of Beautiful Soup under Python 3. This will not work."<>"Please use Beautiful Soup 4, available through the pip package 'beautifulsoup4'."
                                                                                                      ^
  SyntaxError: invalid syntax
  ----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/40/f2/6c9f2f3e696ee6a1fb0e4d7850617e224ed2b0b1e872110abffeca2a09d4/BeautifulSoup-3.2.2.tar.gz#sha256=a04169602bff6e3138b1259dbbf491f5a27f9499dea9a8fbafd48843f9d89970 (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz (31 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_6bd99853e19641d98101a8f96dde0420\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_6bd99853e19641d98101a8f96dde0420\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-lge_pehj'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_6bd99853e19641d98101a8f96dde0420\
  Complete output (6 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_6bd99853e19641d98101a8f96dde0420\setup.py", line 22
      print "Unit tests have failed!"
            ^
  SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
  ----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/1e/ee/295988deca1a5a7accd783d0dfe14524867e31abb05b6c0eeceee49c759d/BeautifulSoup-3.2.1.tar.gz#sha256=6a8cb4401111e011b579c8c52a51cdab970041cc543814bbd9577a4529fe1cdb (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/33/fe/15326560884f20d792d3ffc7fe8f639aab88647c9d46509a240d9bfbb6b1/BeautifulSoup-3.2.0.tar.gz (31 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'C:\Users\Le'"'"'novo\PycharmProjects\venv\Scripts\python.exe' -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = "C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_c24f484783094be982f165c3233bf1c9\\setup.py"; __file__="C:\\Users\\Le'"'"'novo\\AppData\\Local\\Temp\\pip-install-pvseyqd8\\beautifulsoup_c24f484783094be982f165c3233bf1c9\\setup.py";f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\Le'"'"'novo\AppData\Local\Temp\pip-pip-egg-info-wjy50yf8'
       cwd: C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_c24f484783094be982f165c3233bf1c9\
  Complete output (6 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "C:\Users\Le'novo\AppData\Local\Temp\pip-install-pvseyqd8\beautifulsoup_c24f484783094be982f165c3233bf1c9\setup.py", line 22
      print "Unit tests have failed!"
            ^
  SyntaxError: Missing parentheses in call to 'print'. Did you mean print("Unit tests have failed!")?
  ----------------------------------------
WARNING: Discarding https://pypi.tuna.tsinghua.edu.cn/packages/33/fe/15326560884f20d792d3ffc7fe8f639aab88647c9d46509a240d9bfbb6b1/BeautifulSoup-3.2.0.tar.gz#sha256=0dc52d07516c1665c9dd9f0a390a7a054bfb7b147a50b2866fb116b8909dfd37 (from https://pypi.tuna.tsinghua.edu.cn/simple/beautifulsoup/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement BeautifulSoup (from versions: 3.2.0, 3.2.1, 3.2.2)
ERROR: No matching distribution found for BeautifulSoup
WARNING: You are using pip version 21.3.1; however, version 22.3.1 is available.
You should consider upgrading via the 'C:\Users\Le'novo\PycharmProjects\venv\Scripts\python.exe -m pip install --upgrade pip' command.

(venv) C:\Users\Le'novo>
 楼主| HadesGiant 发表于 2022-11-14 09:30
天真Aro 发表于 2022-11-14 00:48
[md]```
import re
import requests

我装上 bs4的包最新的报错如下
[Python] 纯文本查看 复制代码
C:\Users\Le'novo\PycharmProjects\venv\Scripts\python.exe "C:\Users\Le'novo\PycharmProjects\pythonProject\小说下载\小说\天真Aro.py" 
Traceback (most recent call last):
  File "C:\Users\Le'novo\PycharmProjects\pythonProject\小说下载\小说\天真Aro.py", line 3, in <module>
    from bs4 import BeautifulSoup
  File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\__init__.py", line 37, in <module>
    from .builder import (
  File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\__init__.py", line 627, in <module>
    from . import _lxml
  File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\_lxml.py", line 41, in <module>
    class LXMLTreeBuilderForXML(TreeBuilder):
  File "C:\Users\Le'novo\PycharmProjects\venv\lib\site-packages\bs4\builder\_lxml.py", line 42, in LXMLTreeBuilderForXML
    DEFAULT_PARSER_CLASS = etree.XMLParser
AttributeError: 'function' object has no attribute 'XMLParser'

进程已结束,退出代码1
我今天是大佬 发表于 2022-11-14 09:49
用anaconda吧, 省去很多环境麻烦
您需要登录后才可以回帖 登录 | 注册[Register]

本版积分规则 警告:本版块禁止回复与主题无关非技术内容,违者重罚!

快速回复 收藏帖子 返回列表 搜索

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-5-12 21:33

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表