吾爱破解 - 52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 3720|回复: 25
收起左侧

[下载转存] 求大神分享一个政府网站下载PDF文件的方法

[复制链接]
RobLcci 发表于 2022-5-18 11:23
50吾爱币
因为工作原因,需要在财政部政府和社会资本合作中心下载大量项目PDF文件,但是网站的PDF文件只供预览,选择打印的文件也看不清楚,请问各位大神有好的办法吗?求教,谢谢了
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118114206088tgb00009m1axjy
https://www.cpppc.org:8082/inforpublic/homepage.html#/projectDetail/59b8bb2d044e478d8dbbf998118fe6ea
网站的网址已放上面了,再次感谢帮助的大神,谢谢

最佳答案

查看完整内容

看了下,机构库里面多了一层机构,下面的项目和【管理库项目、储备清单】,数据结构一样。 重新生成了exe,直接下载https://wwu.lanzouy.com/iHoDB053ivoh可以运行,文件都自动保存到下载文件夹里面。 更新下程序代码: [mw_shl_code=python,true] import multiprocessing import os import random import sys import time import requests import json from bs4 import BeautifulSoup import xmltodict as xmltodic ...

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

0xSui 发表于 2022-5-18 11:23
看了下,机构库里面多了一层机构,下面的项目和【管理库项目、储备清单】,数据结构一样。
重新生成了exe,直接下载https://wwu.lanzouy.com/iHoDB053ivoh可以运行,文件都自动保存到下载文件夹里面。
更新下程序代码:
[Python] 纯文本查看 复制代码
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
import multiprocessing
import os
import random
import sys
 
import time
import requests
import json
from bs4 import BeautifulSoup
import xmltodict as xmltodict
import urllib3
urllib3.disable_warnings()
from faker import Faker
fake = Faker(locale='zh_CN')
 
 
def get_headers():
    headers = {
        'Content-Type': 'application/json',
        'Referer': 'https://www.cpppc.org:8082/inforpublic/homepage.html',
        'User-Agent': fake.user_agent()
    }
    return headers
 
 
def get_payload(page_num, payload_type):
    payload = ''
    if payload_type == 'org':
        payload = json.dumps({
            "name": "",
            "industry": "",
            "pageNumber": page_num,
            "size": 5,
            "service_types": "",
            "level": "",
            "dist_province": "",
            "dist_city": "",
            "dist_code": "",
            "nlpVO": {},
            "org_name_pinyin_order": "asc"
        })
        return payload
    if payload_type == 'proj':
        payload = json.dumps({
            "name": "",
            "industry": "",
            "min": 0,
            "max": 10000000000000000,
            "pageNumber": page_num,
            "size": 5,
            "level": "",
            "start": "",
            "end": "",
            "dist_province": "",
            "dist_city": "",
            "dist_code": "",
            "nlpVO": {},
            "created_date_order": "desc"
        })
    return payload
 
 
def request_method(request_type, request_url, headers, payload, is_stream):
    with requests.Session() as s:
        status = 0
        count = 0
        while status != 200:
            if count != 0:
                time.sleep(random.randint(1, 3))
            count = count + 1
            try:
                resp = s.request(request_type, request_url, headers=headers, data=payload, timeout=5, stream=is_stream, verify=False)
                status = resp.status_code
            except Exception as e:
                print(f'网络异常{e}')
                time.sleep(random.randint(1, 3))
        if is_stream:
            return resp
        else:
            return resp.json()
 
 
def get_proj(proj_base_url, page_num, msg_queue):
    json_result = request_method("POST", proj_base_url, headers=get_headers(), payload=get_payload(page_num, 'proj'), is_stream=False)
    result_list = json_result.get('data').get('hits')
    if len(result_list) > 0:
        for item in result_list:
            proj_name = item.get('proj_name')
            proj_rid = item.get('proj_rid')
            result = [proj_rid, proj_name]
            msg_queue.put(result)
        page_num = page_num + 1
        return True, page_num
    else:
        msg_queue.put(['finish', 0])
        return False, page_num
 
 
def download_org(msg_queue, org_base_url):
    have_org_more = True
    page_num = 1
    while have_org_more:
        json_result = request_method("POST", org_base_url, headers=get_headers(), payload=get_payload(page_num, 'org'), is_stream=False)
        org_list = json_result.get('data').get('hits')
        if len(org_list) > 0:
            for org in org_list:
                org_no = org.get('org_no')
                my_count = 0
                page_num = 1
                has_child_more = True
                while has_child_more:
                    child_url = f"https://www.cpppc.org:8082/api/pub/organization/consulting/project/list?orgNo={org_no}&pageNumber={page_num}&pageSize=10"
                    child_json = request_method(request_type='GET', request_url=child_url, headers='', payload='', is_stream=False)
                    proj_list = child_json.get('data').get('currentPageResult')
                    total_count = child_json.get('data').get('totalCount')
                    my_count = my_count + len(proj_list)
                    if my_count <= total_count:
                        for proj in proj_list:
                            proj_name = proj.get('projectName')
                            proj_rid = proj.get('projectId')
                            result = [proj_rid, proj_name]
                            msg_queue.put(result)
                        page_num = page_num + 1
                    else:
                        has_child_more = False
            page_num = page_num + 1
        else:
            have_org_more = False
    print(f'下载完毕:{org_base_url}')
 
 
def download_data(msg_queue, project_url):
    have_more = True
    page_num = 1
    while have_more:
        time.sleep(random.randint(1, 3))
        have_more, page_num = get_proj(proj_base_url=project_url, page_num=page_num, msg_queue=msg_queue)
    print(f'下载完毕:{project_url}')
 
 
def analysis_and_download(msg_queue):
    while True:
        if msg_queue.empty():
            print('analysis_and_download队列无任务')
            time.sleep(random.randint(1, 3))
        else:
            # 从队列中获取数据
            data = msg_queue.get()
            page_json = request_method("GET", f"https://www.cpppc.org:8082/api/pub/project/prepare-detail/{data[0]}", headers=get_headers(), payload='', is_stream=False)
            root = {"root": page_json}
            xml = xmltodict.unparse(root, pretty=False)
            bs = BeautifulSoup(xml, 'lxml')
            all_attachs = bs.findAll("attachs")
            for attachs in all_attachs:
                fileid = attachs.find('fileid').text
                filename = attachs.find('filename').text
                print(f'文件ID:{fileid}')
                print(f'文件名:{filename}')
                # 当前路径
                program_path = os.path.dirname(os.path.realpath(sys.argv[0]))
                replace_project_name = data[1].replace(" ", "_")
                project_dir_path = os.path.join(program_path, '下载', replace_project_name)
                if not os.path.exists(project_dir_path):
                    os.makedirs(project_dir_path)
                if not os.path.isfile(os.path.join(project_dir_path, filename)):
                    download_url = f'https://www.cpppc.org:8082/api/pdfs/front/download/{fileid}?token=null&appId=public'
                    print(f'下载链接:{download_url}')
                    download_file(download_url, project_dir_path, filename)
                else:
                    pass
                    print('文件已下载,跳过')
 
 
def download_file(download_url, dir_path, file_name):
    r = request_method('GET', download_url, headers='', payload='', is_stream=True)
    # 获取文件下载数据源
    content = r.content
    # 打开文件写入
    file_path = os.path.join(dir_path, file_name)
    with open(file_path, 'wb') as f:
        f.write(content)
 
 
def start():
    """
    # 专家库
    https://www.cpppc.org:8082/api/pub/experts/search
    # 项目报告
    https://www.cpppc.org:8082/api/pub/project-report/search
    """
    # 管理库项目
    search_url = "https://www.cpppc.org:8082/api/pub/project/search"
    # 储备清单
    search_store_url = "https://www.cpppc.org:8082/api/pub/project/search-store"
    # 机构库
    org_url = "https://www.cpppc.org:8082/api/pub/organization/search"
    # 队列
    q1 = multiprocessing.Queue()
    # 生产者
    p1 = multiprocessing.Process(target=download_data, args=(q1, search_url,))
    p2 = multiprocessing.Process(target=download_data, args=(q1, search_store_url,))
    p3 = multiprocessing.Process(target=download_org, args=(q1, org_url,))
    # 消费
    customer = multiprocessing.Process(target=analysis_and_download, args=(q1,))
    p1.start()
    p2.start()
    p3.start()
    customer.start()
 
 
if __name__ == '__main__':
    if sys.platform.startswith('win'):
        multiprocessing.freeze_support()
    start()

依赖:
[Python] 纯文本查看 复制代码
01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
altgraph==0.17.2
beautifulsoup4==4.11.1
certifi==2021.10.8
charset-normalizer==2.0.12
Faker==13.11.1
future==0.18.2
idna==3.3
importlib-metadata==4.11.3
lxml==4.8.0
pefile==2021.9.3
pyinstaller==5.1
pyinstaller-hooks-contrib==2022.5
python-dateutil==2.8.2
pywin32-ctypes==0.2.0
requests==2.27.1
six==1.16.0
soupsieve==2.3.2.post1
typing_extensions==4.2.0
urllib3==1.26.9
xmltodict==0.13.0
zipp==3.8.0
涛之雨 发表于 2022-5-18 12:12
打开网页,f12,选择“网络”标签卡,刷新页面。

在一堆请求链接里找到类似于

https://www.cpppc.org/api/pdfs/front/download/003520200118114206088tgb00009m1axjy?token=null&appId=public

这样的链接,右键保存到本地,重命名为pdf就行

应该是检测了什么,链接直接打开会错误(估计是reffer)

现在无电脑,无法演示,严格按照上面的操作步骤进行应该就可以看到下载到了

免费评分

参与人数 1吾爱币 +1 热心值 +1 收起 理由
菠萝蜜 + 1 + 1 我很赞同!

查看全部评分

0xSui 发表于 2022-5-18 13:11
首先,主页访问的html加载时候会发起请求,获取json数据,这里面就有页面上所有的预览按钮的pdf名称和id号。
https://www.cpppc.org/api/pub/project/prepare-detail/59b8bb2d044e478d8dbbf998118fe6ea
之后,每个pdf都是从上面的json信息里把id拼接到下载url之后就能下载pdf源文件了。
https://www.cpppc.org/api/pdfs/front/download/003520200118114206088tgb00009m1axjy
at1636 发表于 2022-5-18 14:38
已帮你下载,「分享文件」https://www.aliyundrive.com/s/8noP9Dp2Gzx
点击链接保存,或者复制本段内容,打开「阿里云盘」APP ,无需下载极速在线查看,视频原画倍速播放。
 楼主| RobLcci 发表于 2022-5-18 15:17
at1636 发表于 2022-5-18 14:38
已帮你下载,「分享文件」https://www.aliyundrive.com/s/8noP9Dp2Gzx
点击链接保存,或者复制本段内容, ...

可以教一下下载方法吗?因为我需要大量下载,谢谢了
 楼主| RobLcci 发表于 2022-5-18 15:18
0xSui 发表于 2022-5-18 13:11
首先,主页访问的html加载时候会发起请求,获取json数据,这里面就有页面上所有的预览按钮的pdf名称和id号 ...

小白没怎么看明白,大神可以步骤截图一下吗?谢谢了
 楼主| RobLcci 发表于 2022-5-18 15:26
涛之雨 发表于 2022-5-18 12:12
打开网页,f12,选择“网络”标签卡,刷新页面。

在一堆请求链接里找到类似于

我是用的QQ浏览器,右键没有保存这个选项,请问该怎么办呢?
一之太刀 发表于 2022-5-18 15:48
楼主提供的网址用edge点开就下载了,也许是因为我的浏览器设置了不能在线预览pdf吧,都提示PDF.js v2.0.943 (build: dc98bf76)信息:Failed to fetch,然后就出来idm下载了,只是文件名字都是乱码的

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118114206088tgb00009m1axjy?token=null&appId=public   这是授权书,预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118114206088tgb00009m1axjy

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118104020819tgb0000hcu0urq?token=null&appId=public     这是实施方案批复、预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118104020819tgb0000hcu0urq

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118122716456tgb0000fcidbkb?token=null&appId=public      这是财政承受能力论证报告,预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118122716456tgb0000fcidbkb
lei0730 发表于 2022-5-18 16:10
我的浏览器是小智,用的IDM,打开你的地址 点预览PDF IDM直接弹出下载
返回列表

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2025-5-22 05:55

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表