云展网电子书如何扒下来

钱途迷茫 · 发表于 2022-12-31 18:13

求方法，不是代扒
以前抓到的链接是按顺序命名的，比较好弄，现在的链接全是无规则的了，一页页复制链接再保存工作量实在伤不起，求一个高效的解决方案
https://book.yunzhan365.com/zaidx/hxmv/mobile/index.html

5368 · 发表于 2022-12-31 18:13

哎~ ,你说的要方法的,
python打包exe有些杀毒软件可能会报毒的,
源码留在这:

[Python] 纯文本查看 复制代码

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

from PIL import Image
from io import BytesIO
import requests,re,os
 
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
 
ss=requests.session()
z={'title':'','page':0,'p':[]}
def getimg(n,s):
    u=f"https://book.yunzhan365.com/zaidx/hxmv/files/large/{s}"
    _s=ss.get(u,headers=headers)
    img=Image.open(BytesIO(_s.content))
    print(n,end=',', flush=True)
    return img
 
def x0(u):
    res0=ss.get(u,headers=headers)
    #title=re.findall('"title":"(.+?)"',res1.text)[0].encode().decode("unicode_escape")
    _c=re.findall('src="javascript/config.js\?(.+?)"></script>',res0.text,re.S)
    if _c:
        u0=f'https://book.yunzhan365.com/zaidx/hxmv/mobile/javascript/config.js?{_c[0]}'
        res1=ss.get(u0,headers=headers)
        title=re.findall('"title":"(.+?)"',res1.text)[0].encode().decode("unicode_escape")
        img=re.findall('"n":\[\"(.+?)\"\]',res1.text)
        print(f"{title} / 共{len(img)}页")
        for n,i in enumerate(img):
            _i=u=f"https://book.yunzhan365.com/zaidx/hxmv/files/large/{i}"
            img=getimg(n,i)
            z['p'].append(img)
        print('')
        print("开始制作并合并成PDF...")
        z['p'][0].save(f"./{title}.pdf", "PDF", resolution=100.0, save_all=True, append_images=z['p'][1:])
        print(f"{os.getcwd ()}\{title}.pdf")
    else:
        print(u,'识别错误')
if __name__ == '__main__':
    _id=input("输入书本网址:[例:https://book.yunzhan365.com/zaidx/hxmv/mobile/index.html]\n")
    x0(_id)

程序依赖:pip install pillow requests
然后运行python yunzhan.py就行

以下是打包exe的依赖和方法:
如果使用pyinstaller 打包请安装: pip install pyinstaller
然后执行:pyinstaller -F yunzhan.py
在dist目录内就是你需要的exe文件

5368 · 发表于 2022-12-31 18:24

哈哈,我来了
很简单了~
1.打开F12
2.刷新页面,后翻几页后,观察网络请求中最后请求的都是啥内容~
3.如果是图片格式一般会有一个所有页码图片的列表~
4.往前查看xhr,甚至路径特使的js,
5.在https://book.yunzhan365.com/zaidx/hxmv/mobile/javascript/config.js?VAWSLzFTvgCWKoouv0cMeg==中找到图片列表,拼接url,下载再转pdf即可.
6.或者复制一张图片的名称搜索下,看在那个请求中出现过~
7.如果需要我前几天写个python的代码,可以发来你看看,不过代码比较丑陋~

5368 · 发表于 2022-12-31 18:38

代码来了~

[Python] 纯文本查看 复制代码

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

from PIL import Image
from io import BytesIO
import requests,re
u='https://book.yunzhan365.com/zaidx/hxmv/mobile/javascript/config.js?VAWSLzFTvgCWKoouv0cMeg=='
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
#pdf保存的路径
PATH='/home/jet/jupyter/云展/人文社科.pdf'
ss=requests.session()
res1=ss.get(u,headers=headers)
z={'title':'','page':0,'p':[]}
def getimg(s):
    u=f"https://book.yunzhan365.com/zaidx/hxmv/files/large/{s}"
    _s=ss.get(u,headers=headers)
    img=Image.open(BytesIO(_s.content))
    return img
for i in re.findall('"n":\[\"(.+?)\"\]',res1.text):
    #拼接url
    _i=u=f"https://book.yunzhan365.com/zaidx/hxmv/files/large/{i}"
    img=getimg(i)
    z['p'].append(img)
z['p'][0].save(PATH, "PDF", resolution=100.0, save_all=True, append_images=z['p'][1:])

钱途迷茫 · 发表于 2022-12-31 19:46

5368 发表于 2022-12-31 18:38
代码来了~
[mw_shl_code=python,true]from PIL import Image
from io import BytesIO

有没有封装成exe的啊，没有python环境，跑不起来

baishushe1234 · 发表于 2022-12-31 21:40

路过学习下，感谢大佬。

钱途迷茫 · 发表于 2023-1-1 18:53

5368 发表于 2022-12-31 18:13
哎~ ,你说的要方法的,
python打包exe有些杀毒软件可能会报毒的,
源码留在这:

大佬，两个依赖都安装了，运行报错是怎么回事
C:\py\Scripts\python.exe C:\Users\11333\PycharmProjects\pythonProject\yunzhan.py
Traceback (most recent call last):
File "C:\Users\11333\PycharmProjects\pythonProject\yunzhan.py", line 1, in <module>
from PIL import Image
File "C:\Users\11333\anaconda3\lib\site-packages\PIL\Image.py", line 100, in <module>
from . import _imaging as core
ImportError: DLL load failed while importing _imaging: 找不到指定的模块。

钱途迷茫 · 发表于 2023-1-1 20:25

5368 发表于 2022-12-31 18:13
哎~ ,你说的要方法的,
python打包exe有些杀毒软件可能会报毒的,
源码留在这:

总算跑起来了，但是换个别的链接就下载不了了，源码里链接的关键信息也改掉了（/zaidx/hxmv/），还是不行
比如这个https://book.yunzhan365.com/ssmty/ndlx/mobile/index.html

5368 · 发表于 2023-1-1 21:28

钱途迷茫发表于 2023-1-1 20:25
总算跑起来了，但是换个别的链接就下载不了了，源码里链接的关键信息也改掉了（/zaidx/hxmv/），还是不行 ...

这本是pdf的,我看看能找到pdf的链接吗,找不到再用图片拼接pdf

5368 · 发表于 2023-1-1 22:57

本帖最后由 5368 于 2023-1-1 22:58 编辑

以下代码适配2种版本的书籍~

[Python] 纯文本查看 复制代码

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

from PIL import Image
from io import BytesIO
import requests,re,os,json
from urllib.parse import urlparse
 
u='https://book.yunzhan365.com/zaidx/hxmv/mobile/javascript/config.js?VAWSLzFTvgCWKoouv0cMeg=='
headers={'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
 
ss=requests.session()
z={'title':'','page':0,'p':[]}
def getimg(n,s):
    _s=ss.get(s,headers=headers)
    img=Image.open(BytesIO(_s.content))
    print(n,end=',', flush=True)
    return img
 
def x0(u):
    res0=ss.get(u,headers=headers)
    #title=re.findall('"title":"(.+?)"',res1.text)[0].encode().decode("unicode_escape")
    _c=re.findall('src="javascript/config.js\?(.+?)"></script>',res0.text,re.S)
    _u=urlparse(u).path.split('/')
    if _c:
        #https://book.yunzhan365.com/ssmty/ndlx/mobile/javascript/config.js?8aOwkJDwk7PdlwLE/fL1ig==
        u0=f'https://book.yunzhan365.com/{_u[1]}/{_u[2]}/mobile/javascript/config.js?{_c[0]}'
        res1=ss.get(u0,headers=headers)
        res=json.loads(re.findall('({.*});',res1.text)[0])   
        title=res['meta']['title']
        pagelist=[]
        if res.get('fliphtml5_pages',0):
            print("找到html5列表,获取图片中...")
            for i in res['fliphtml5_pages']:
                pagelist.append(i['n'][0])
        else:
            for i in range(1,res['bookConfig']['totalPageCount']+1):
                pagelist.append(f"{i}.jpg")
        print(f"{title} / 共{len(pagelist)}页")
        _path=res['bookConfig']['largePath'][0] if isinstance(res['bookConfig']['largePath'],list) else res['bookConfig']['largePath']
        _path=_path.replace('..','')
        for n,i in enumerate(pagelist,1):
            _i=f"https://book.yunzhan365.com/{_u[1]}/{_u[2]}{_path}{i}"
            img=getimg(n,_i)
            z['p'].append(img)
        print('')
        print("开始制作并合并成PDF...")
        z['p'][0].save(f"./{title}.pdf", "PDF", resolution=100.0, save_all=True, append_images=z['p'][1:])
        print(f"{os.getcwd ()}\{title}.pdf")
    else:
        print(u,'识别错误')
if __name__ == '__main__':
    _id=input("输入书本网址:[例:https://book.yunzhan365.com/zaidx/hxmv/mobile/index.html]\n")
    x0(_id)

之前发的代码合并copy出来时变量不太正常,以这个为准,这个代码也没有测试很多页面,只是适配了2种不同版本,如有发现其他新版本告诉我我再添加规则~

帐号		自动登录	找回密码
密码			注册[Register]

[经验求助] 云展网电子书如何扒下来

最佳答案

免费评分

免费评分

浏览过的版块