吾爱破解 - LCG - LSG |安卓破解|病毒分析|www.52pojie.cn

 找回密码
 注册[Register]

QQ登录

只需一步,快速开始

查看: 2776|回复: 25
收起左侧

[下载转存] 求大神分享一个政府网站下载PDF文件的方法

[复制链接]
RobLcci 发表于 2022-5-18 11:23
50吾爱币
因为工作原因,需要在财政部政府和社会资本合作中心下载大量项目PDF文件,但是网站的PDF文件只供预览,选择打印的文件也看不清楚,请问各位大神有好的办法吗?求教,谢谢了
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118114206088tgb00009m1axjy
https://www.cpppc.org:8082/inforpublic/homepage.html#/projectDetail/59b8bb2d044e478d8dbbf998118fe6ea
网站的网址已放上面了,再次感谢帮助的大神,谢谢

最佳答案

查看完整内容

看了下,机构库里面多了一层机构,下面的项目和【管理库项目、储备清单】,数据结构一样。 重新生成了exe,直接下载https://wwu.lanzouy.com/iHoDB053ivoh可以运行,文件都自动保存到下载文件夹里面。 更新下程序代码: [mw_shl_code=python,true] import multiprocessing import os import random import sys import time import requests import json from bs4 import BeautifulSoup import xmltodict as xmltodic ...

发帖前要善用论坛搜索功能,那里可能会有你要找的答案或者已经有人发布过相同内容了,请勿重复发帖。

0xSui 发表于 2022-5-18 11:23
看了下,机构库里面多了一层机构,下面的项目和【管理库项目、储备清单】,数据结构一样。
重新生成了exe,直接下载https://wwu.lanzouy.com/iHoDB053ivoh可以运行,文件都自动保存到下载文件夹里面。
更新下程序代码:
[Python] 纯文本查看 复制代码
import multiprocessing
import os
import random
import sys

import time
import requests
import json
from bs4 import BeautifulSoup
import xmltodict as xmltodict
import urllib3
urllib3.disable_warnings()
from faker import Faker
fake = Faker(locale='zh_CN')


def get_headers():
    headers = {
        'Content-Type': 'application/json',
        'Referer': 'https://www.cpppc.org:8082/inforpublic/homepage.html',
        'User-Agent': fake.user_agent()
    }
    return headers


def get_payload(page_num, payload_type):
    payload = ''
    if payload_type == 'org':
        payload = json.dumps({
            "name": "",
            "industry": "",
            "pageNumber": page_num,
            "size": 5,
            "service_types": "",
            "level": "",
            "dist_province": "",
            "dist_city": "",
            "dist_code": "",
            "nlpVO": {},
            "org_name_pinyin_order": "asc"
        })
        return payload
    if payload_type == 'proj':
        payload = json.dumps({
            "name": "",
            "industry": "",
            "min": 0,
            "max": 10000000000000000,
            "pageNumber": page_num,
            "size": 5,
            "level": "",
            "start": "",
            "end": "",
            "dist_province": "",
            "dist_city": "",
            "dist_code": "",
            "nlpVO": {},
            "created_date_order": "desc"
        })
    return payload


def request_method(request_type, request_url, headers, payload, is_stream):
    with requests.Session() as s:
        status = 0
        count = 0
        while status != 200:
            if count != 0:
                time.sleep(random.randint(1, 3))
            count = count + 1
            try:
                resp = s.request(request_type, request_url, headers=headers, data=payload, timeout=5, stream=is_stream, verify=False)
                status = resp.status_code
            except Exception as e:
                print(f'网络异常{e}')
                time.sleep(random.randint(1, 3))
        if is_stream:
            return resp
        else:
            return resp.json()


def get_proj(proj_base_url, page_num, msg_queue):
    json_result = request_method("POST", proj_base_url, headers=get_headers(), payload=get_payload(page_num, 'proj'), is_stream=False)
    result_list = json_result.get('data').get('hits')
    if len(result_list) > 0:
        for item in result_list:
            proj_name = item.get('proj_name')
            proj_rid = item.get('proj_rid')
            result = [proj_rid, proj_name]
            msg_queue.put(result)
        page_num = page_num + 1
        return True, page_num
    else:
        msg_queue.put(['finish', 0])
        return False, page_num


def download_org(msg_queue, org_base_url):
    have_org_more = True
    page_num = 1
    while have_org_more:
        json_result = request_method("POST", org_base_url, headers=get_headers(), payload=get_payload(page_num, 'org'), is_stream=False)
        org_list = json_result.get('data').get('hits')
        if len(org_list) > 0:
            for org in org_list:
                org_no = org.get('org_no')
                my_count = 0
                page_num = 1
                has_child_more = True
                while has_child_more:
                    child_url = f"https://www.cpppc.org:8082/api/pub/organization/consulting/project/list?orgNo={org_no}&pageNumber={page_num}&pageSize=10"
                    child_json = request_method(request_type='GET', request_url=child_url, headers='', payload='', is_stream=False)
                    proj_list = child_json.get('data').get('currentPageResult')
                    total_count = child_json.get('data').get('totalCount')
                    my_count = my_count + len(proj_list)
                    if my_count <= total_count:
                        for proj in proj_list:
                            proj_name = proj.get('projectName')
                            proj_rid = proj.get('projectId')
                            result = [proj_rid, proj_name]
                            msg_queue.put(result)
                        page_num = page_num + 1
                    else:
                        has_child_more = False
            page_num = page_num + 1
        else:
            have_org_more = False
    print(f'下载完毕:{org_base_url}')


def download_data(msg_queue, project_url):
    have_more = True
    page_num = 1
    while have_more:
        time.sleep(random.randint(1, 3))
        have_more, page_num = get_proj(proj_base_url=project_url, page_num=page_num, msg_queue=msg_queue)
    print(f'下载完毕:{project_url}')


def analysis_and_download(msg_queue):
    while True:
        if msg_queue.empty():
            print('analysis_and_download队列无任务')
            time.sleep(random.randint(1, 3))
        else:
            # 从队列中获取数据
            data = msg_queue.get()
            page_json = request_method("GET", f"https://www.cpppc.org:8082/api/pub/project/prepare-detail/{data[0]}", headers=get_headers(), payload='', is_stream=False)
            root = {"root": page_json}
            xml = xmltodict.unparse(root, pretty=False)
            bs = BeautifulSoup(xml, 'lxml')
            all_attachs = bs.findAll("attachs")
            for attachs in all_attachs:
                fileid = attachs.find('fileid').text
                filename = attachs.find('filename').text
                print(f'文件ID:{fileid}')
                print(f'文件名:{filename}')
                # 当前路径
                program_path = os.path.dirname(os.path.realpath(sys.argv[0]))
                replace_project_name = data[1].replace(" ", "_")
                project_dir_path = os.path.join(program_path, '下载', replace_project_name)
                if not os.path.exists(project_dir_path):
                    os.makedirs(project_dir_path)
                if not os.path.isfile(os.path.join(project_dir_path, filename)):
                    download_url = f'https://www.cpppc.org:8082/api/pdfs/front/download/{fileid}?token=null&appId=public'
                    print(f'下载链接:{download_url}')
                    download_file(download_url, project_dir_path, filename)
                else:
                    pass
                    print('文件已下载,跳过')


def download_file(download_url, dir_path, file_name):
    r = request_method('GET', download_url, headers='', payload='', is_stream=True)
    # 获取文件下载数据源
    content = r.content
    # 打开文件写入
    file_path = os.path.join(dir_path, file_name)
    with open(file_path, 'wb') as f:
        f.write(content)


def start():
    """
    # 专家库
    https://www.cpppc.org:8082/api/pub/experts/search
    # 项目报告
    https://www.cpppc.org:8082/api/pub/project-report/search
    """
    # 管理库项目
    search_url = "https://www.cpppc.org:8082/api/pub/project/search"
    # 储备清单
    search_store_url = "https://www.cpppc.org:8082/api/pub/project/search-store"
    # 机构库
    org_url = "https://www.cpppc.org:8082/api/pub/organization/search"
    # 队列
    q1 = multiprocessing.Queue()
    # 生产者
    p1 = multiprocessing.Process(target=download_data, args=(q1, search_url,))
    p2 = multiprocessing.Process(target=download_data, args=(q1, search_store_url,))
    p3 = multiprocessing.Process(target=download_org, args=(q1, org_url,))
    # 消费
    customer = multiprocessing.Process(target=analysis_and_download, args=(q1,))
    p1.start()
    p2.start()
    p3.start()
    customer.start()


if __name__ == '__main__':
    if sys.platform.startswith('win'):
        multiprocessing.freeze_support()
    start()

依赖:
[Python] 纯文本查看 复制代码
altgraph==0.17.2
beautifulsoup4==4.11.1
certifi==2021.10.8
charset-normalizer==2.0.12
Faker==13.11.1
future==0.18.2
idna==3.3
importlib-metadata==4.11.3
lxml==4.8.0
pefile==2021.9.3
pyinstaller==5.1
pyinstaller-hooks-contrib==2022.5
python-dateutil==2.8.2
pywin32-ctypes==0.2.0
requests==2.27.1
six==1.16.0
soupsieve==2.3.2.post1
typing_extensions==4.2.0
urllib3==1.26.9
xmltodict==0.13.0
zipp==3.8.0
涛之雨 发表于 2022-5-18 12:12
打开网页,f12,选择“网络”标签卡,刷新页面。

在一堆请求链接里找到类似于

https://www.cpppc.org/api/pdfs/front/download/003520200118114206088tgb00009m1axjy?token=null&appId=public

这样的链接,右键保存到本地,重命名为pdf就行

应该是检测了什么,链接直接打开会错误(估计是reffer)

现在无电脑,无法演示,严格按照上面的操作步骤进行应该就可以看到下载到了

免费评分

参与人数 1吾爱币 +1 热心值 +1 收起 理由
菠萝蜜 + 1 + 1 我很赞同!

查看全部评分

0xSui 发表于 2022-5-18 13:11
首先,主页访问的html加载时候会发起请求,获取json数据,这里面就有页面上所有的预览按钮的pdf名称和id号。
https://www.cpppc.org/api/pub/project/prepare-detail/59b8bb2d044e478d8dbbf998118fe6ea
之后,每个pdf都是从上面的json信息里把id拼接到下载url之后就能下载pdf源文件了。
https://www.cpppc.org/api/pdfs/front/download/003520200118114206088tgb00009m1axjy
at1636 发表于 2022-5-18 14:38
已帮你下载,「分享文件」https://www.aliyundrive.com/s/8noP9Dp2Gzx
点击链接保存,或者复制本段内容,打开「阿里云盘」APP ,无需下载极速在线查看,视频原画倍速播放。
 楼主| RobLcci 发表于 2022-5-18 15:17
at1636 发表于 2022-5-18 14:38
已帮你下载,「分享文件」https://www.aliyundrive.com/s/8noP9Dp2Gzx
点击链接保存,或者复制本段内容, ...

可以教一下下载方法吗?因为我需要大量下载,谢谢了
 楼主| RobLcci 发表于 2022-5-18 15:18
0xSui 发表于 2022-5-18 13:11
首先,主页访问的html加载时候会发起请求,获取json数据,这里面就有页面上所有的预览按钮的pdf名称和id号 ...

小白没怎么看明白,大神可以步骤截图一下吗?谢谢了
 楼主| RobLcci 发表于 2022-5-18 15:26
涛之雨 发表于 2022-5-18 12:12
打开网页,f12,选择“网络”标签卡,刷新页面。

在一堆请求链接里找到类似于

我是用的QQ浏览器,右键没有保存这个选项,请问该怎么办呢?
一之太刀 发表于 2022-5-18 15:48
楼主提供的网址用edge点开就下载了,也许是因为我的浏览器设置了不能在线预览pdf吧,都提示PDF.js v2.0.943 (build: dc98bf76)信息:Failed to fetch,然后就出来idm下载了,只是文件名字都是乱码的

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118114206088tgb00009m1axjy?token=null&appId=public   这是授权书,预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118114206088tgb00009m1axjy

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118104020819tgb0000hcu0urq?token=null&appId=public     这是实施方案批复、预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118104020819tgb0000hcu0urq

https://www.cpppc.org:8082/api/pdfs/front/download/003520200118122716456tgb0000fcidbkb?token=null&appId=public      这是财政承受能力论证报告,预览链接是
https://www.cpppc.org:8082/inforpublic/homepage.html#/preview/003520200118122716456tgb0000fcidbkb
lei0730 发表于 2022-5-18 16:10
我的浏览器是小智,用的IDM,打开你的地址 点预览PDF IDM直接弹出下载
快速回复 收藏帖子 返回列表 搜索

RSS订阅|小黑屋|处罚记录|联系我们|吾爱破解 - LCG - LSG ( 京ICP备16042023号 | 京公网安备 11010502030087号 )

GMT+8, 2024-5-1 14:55

Powered by Discuz!

Copyright © 2001-2020, Tencent Cloud.

快速回复 返回顶部 返回列表