[Python 源码]Python 简单小说爬取

pzx521521 · 发表于 2018-12-1 13:39

本帖最后由 pzx521521 于 2018-12-1 13:40 编辑

[Python 源码]Python 简单小说爬取

说明

很简单的一个爬取小说的脚本, 新手可以参考一下

需要的包

python3
import requests
from bs4 import BeautifulSoup

PS

原贴的分数太高了, 就写了一个
参考

Code


import requests
import os
from bs4 import BeautifulSoup
#以上作为基本引用

#全局变量
url_pre = "https://www.xiaoshuo2016.com"
start_url = "https://www.xiaoshuo2016.com/171445/44104877.html" #小说第一章对应的URL 例如：http://www.quanshuwang.com/book/44/44683/15379609.html

file_name = "wwyxhqc.txt"  #设置保存的文件名字 建议数字或者英文名字，例如 重生于非凡岁月 可设置为 csffsy.txt

#使用的时候只需要更改上面两个变量的值即可

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
count = 0 # 计数器 计数章节数

# function： 获取每章节的小说文字并写入文件中
def getContent(content_url):

    global count
    count = count +1 #计数器增加

    res = requests.get(content_url,headers = header,timeout = 10)
    res.encoding = 'gbk'

    soup = BeautifulSoup(res.text,'html.parser')
    title = soup.select('.articleTitle')[0].find('h2').text #获取章节题目
    content = soup.select('.articleCon')[0].text #获取章节内容
    both = title + content
    with open("result\\" + title+".txt", "w", encoding='utf-8') as f:
        f.write(both)
    print("已下载 第"+str(count)+"章") #输出到屏幕提示 状态
    # 获取下个章节URL
    next_url = soup.select('.page')[0].find('a').next_sibling.next_sibling.get('href');

    if(next_url == ''):
        return False
    return getContent(url_pre+next_url)

#MAIN
if __name__ == '__main__':
    File_Path = os.getcwd() + '\\result'  # 获取到当前文件的目录，并检查是否有report文件夹，如果不存在则自动新建report文件
    if not os.path.exists(File_Path):
        os.makedirs(File_Path)
    f = open(file_name, 'a+',encoding='utf-8')
    getContent(start_url)
    f.close()
    print('小说下载完成!')

Ozyzero · 发表于 2020-2-14 19:49

楼主，我爬到二十到四十几章中的某一章，程序就会报错停止，就很奇怪:

已下载第29章
/read/735/298564.html
Traceback (most recent call last):
  File "simple.py", line 61, in <module>
getContent(content_url=start_url,fi=f)
  File "simple.py", line 52, in getContent
return getContent(url_pre+next_url,fi)
  File "simple.py", line 52, in getContent
return getContent(url_pre+next_url,fi)
  File "simple.py", line 52, in getContent
return getContent(url_pre+next_url,fi)
  [Previous line repeated 26 more times]
  File "simple.py", line 30, in getContent
tit = soup.select('.title')[0].find('h1').text#获取章节题目
IndexError: list index out of range

君莫笑WXH · 发表于 2019-4-6 11:35

我这边报错
Traceback (most recent call last):
File "D:/PycharmProjects/untitled5/11.py", line 46, in <module>
getContent(start_url)
File "D:/PycharmProjects/untitled5/11.py", line 27, in getContent
title = soup.select('.articleTitle')[0].find('h2').text #获取章节题目
IndexError: list index out of range

止语 · 发表于 2018-12-1 13:49

巧了，我前不久也写了一个这样的虽然发到论坛了不过没那么多注释

董大大。 · 发表于 2018-12-1 13:59

感觉不错哦

forevergo · 发表于 2018-12-1 14:22

感谢分享

xw101099038 · 发表于 2018-12-1 15:04

什么小说都能下载吗

yudi1235 · 发表于 2018-12-1 15:42

正在学习爬虫，谢谢分享

lijt16 · 发表于 2018-12-1 18:03

多谢分享，学习了~

Harry不白 · 发表于 2018-12-1 19:03

收藏下。。。

林家小哥aa · 发表于 2018-12-1 19:34

感谢分享

su253 · 发表于 2018-12-1 19:36

感觉不错哦

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] [Python 源码]Python 简单小说爬取

[Python 源码]Python 简单小说爬取

说明

需要的包

PS

Code

免费评分

个人中心