国家云平台课程视频M3U8批量抓取工具

sky9131986 · 发表于 2022-5-7 10:19

本帖最后由 sky9131986 于 2022-5-11 18:46 编辑

因疫情原因，又开始线上教学，有需要下载课程视频，之前也写过一个下载工具，因网站改版，工具失效了。
这次国家云平台相比以往更新了不少视频，视频压缩的质量也很不错十几分钟的视频就十几兆，清晰度还不错。

工具只是简单的爬取视频的M3U8链接，下载任务交给了逍遥一仙的M3U8批量下载器（感谢

）

因没有做过多的错误判定，可能会出现问题。

工具很简单，没有做UI界面，是控制台模式。

需要输入：1、保存地址，2、课本ID

获取课本ID：首先打开国家云平台（https://www.zxx.edu.cn/syncClassroom）找到课本封面图片，ID 就在封面图片的链接里

右键点击图片，可在【在新标签页中打开图片】或者【复制图片的地址】，
找到图片地址：https://s-file-2.ykt.cbern.com.cn/x_course_s_g/c/101449322/cover/ba84232d-99ff-4b6e-b3a1-355d93cb35dc.png
加红的部分就是课本ID，之后会在输入保存的地址生成一个Txt文件（课程名称和课程M3U8地址）

TXT文件的第一行是视频保存地址，可以更改，具体使用方法如下：

配置文件格式
名称(或参数名),链接(或参数值)、一行一条、英文逗号分割。
注：参数任务为单独一条任务，换行需使用\r\n
例如：希望将目录更改为D盘，下载2个文件后改为E盘，则配置内容应是
#OUT,D:\
第一个文件名,第一个链接
第二个文件名,第二个链接
#OUT,E:\
第三个文件名,第三个链接

2、M3U8 下载
打开逍遥一仙的M3U8下载器，把Txt直接拖入下载器，点击开始就OK了。

3、ptyhon 编写，已打包成单文件exe，本机Windows 10 企业版 LTSC 21H2 测试正常，其他未测试

单文件版：
https://yanxiu.lanzouh.com/iMcLJ04f2esj
密码:meizu

目录版：
https://yanxiu.lanzouh.com/iBYws04hbisj
密码:meizu

4、实在运行不了，那就运行源码

[Python] 纯文本查看 复制代码

import requests
import os
import json
import urllib3


header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36',
    'Accept':'application/json, text/plain, */*'
}
save_path = ''


def get_json(url_pre, tag_id, url_fix):
    url = url_pre + tag_id + url_fix
    urllib3.disable_warnings()
    request = requests.get(url=url, headers=header, verify=False)
    json_data = json.loads(request.text)
    return json_data


def creat_text(book_name, path):
    global save_path
    save_path = path + "/" + book_name
    os.makedirs(save_path, exist_ok=True)
    text_path = save_path + '/' + book_name + '.txt'
    if os.path.exists(text_path):
        with open(text_path, 'a', encoding="utf-8") as f:
            f.seek(0)
            f.truncate()
    with open(text_path, 'a', encoding="utf-8") as f:
        f.write('#OUT,' + save_path + '\n')


def write_text(course_id, book_name, course_name):
    course_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v1/x_class_hour_activity/'
    course_url_fix = '/resources.json'
    course_data = get_json(course_url_pre, course_id, course_url_fix)
    course_m3u8 = course_data[0]['video_extend']['urls'][-1]['urls'][0]
    down_msg = course_name + ',' + course_m3u8 + '\n'
    text_path = save_path + '/' + book_name + '.txt'
    with open(text_path, 'a', encoding="utf-8") as f:
        f.write(down_msg)


def get_m3u8(tag_id, path):
    tag_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v2/business_courses/'
    tag_url_fix = '/course_relative_infos/zh-CN.json'
    tag_data = get_json(tag_url_pre, tag_id, tag_url_fix)
    activity_set_id = tag_data['course_detail']['activity_set_id']
    activity_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v2/activity_sets/'
    activity_url_fix = '/fulls.json'
    activity_data = get_json(activity_url_pre, activity_set_id, activity_url_fix)
    book_name = activity_data['activity_set_name']
    nodes = activity_data['nodes']
    creat_text(book_name, path)
    for child_nodes in nodes:
        first_title = child_nodes['node_name'].split()[0] if ' ' in child_nodes['node_name'] else ''
        first_title = first_title + ' ' if first_title else ''
        for course_nodes in child_nodes['child_nodes']:
            if course_nodes['child_nodes']:
                sec_title = course_nodes['node_name'].split()[0] if ' ' in course_nodes['node_name'] else ''
                sec_title = sec_title + ' ' if sec_title else ''
                for course_node in course_nodes['child_nodes']:
                    course_id = course_node['node_id']
                    course_name = first_title + sec_title + course_node['node_name']
                    course_name = course_name.replace(',', ' ')
                    print(course_name)
                    write_text(course_id, book_name, course_name)

            else:
                course_id = course_nodes['node_id']
                course_name = first_title + course_nodes['node_name']
                course_name = course_name.replace(',', ' ')
                print(course_name)
                write_text(course_id, book_name, course_name)



print("------半自动下载国家云平台m3u8格式视频课程-------\n")
path = input('请输入【课程视频】保存路径：')
tag_id = input('请输入需要下载的课程ID：')
get_m3u8(tag_id, path)
input('抓取完毕，按任意键关闭窗口。')

其他：
国家云平台上有些课程直接是MP4格式视频，工具可以抓取，但m3u8无法下载，可以导入到IDM 下载，但需手动重命名视频文件，其实我想过通过调取IDM的，但打包后运行失败，就没有尝试了。
人教版（部编版）的视频基本上是m3u8的形式的。

sky9131986 · 发表于 2022-5-11 18:44

aee 发表于 2022-5-11 14:11
想要一份源码来试试

[Python] 纯文本查看 复制代码

import requests
import os
import json
import urllib3



header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36',
    'Accept':'application/json, text/plain, */*'
}
save_path = ''


def get_json(url_pre, tag_id, url_fix):
    url = url_pre + tag_id + url_fix
    # requests.packages.urllib3.disable_warnings()
    urllib3.disable_warnings()
    request = requests.get(url=url, headers=header, verify=False)
    json_data = json.loads(request.text)
    return json_data


def creat_text(book_name, path):
    global save_path
    save_path = path + "/" + book_name
    os.makedirs(save_path, exist_ok=True)
    text_path = save_path + '/' + book_name + '.txt'
    if os.path.exists(text_path):
        with open(text_path, 'a', encoding="utf-8") as f:
            f.seek(0)
            f.truncate()
    with open(text_path, 'a', encoding="utf-8") as f:
        f.write('#OUT,' + save_path + '\n')


def write_text(course_id, book_name, course_name):
    course_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v1/x_class_hour_activity/'
    course_url_fix = '/resources.json'
    course_data = get_json(course_url_pre, course_id, course_url_fix)
    course_m3u8 = course_data[0]['video_extend']['urls'][-1]['urls'][0]
    down_msg = course_name + ',' + course_m3u8 + '\n'
    text_path = save_path + '/' + book_name + '.txt'
    with open(text_path, 'a', encoding="utf-8") as f:
        f.write(down_msg)


def get_m3u8(tag_id, path):
    tag_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v2/business_courses/'
    tag_url_fix = '/course_relative_infos/zh-CN.json'
    tag_data = get_json(tag_url_pre, tag_id, tag_url_fix)
    activity_set_id = tag_data['course_detail']['activity_set_id']
    # print(activity_set_id)
    activity_url_pre = 'https://s-file-1.ykt.cbern.com.cn/zxx/s_course/v2/activity_sets/'
    activity_url_fix = '/fulls.json'
    activity_data = get_json(activity_url_pre, activity_set_id, activity_url_fix)
    book_name = activity_data['activity_set_name']
    nodes = activity_data['nodes']
    creat_text(book_name, path)
    # print(nodes)
    for child_nodes in nodes:
        first_title = child_nodes['node_name'].split()[0] if ' ' in child_nodes['node_name'] else ''
        first_title = first_title + ' ' if first_title else ''
        # print(first_title)
        for course_nodes in child_nodes['child_nodes']:
            # print(course_nodes['child_nodes'])
            if course_nodes['child_nodes']:
                sec_title = course_nodes['node_name'].split()[0] if ' ' in course_nodes['node_name'] else ''
                sec_title = sec_title + ' ' if sec_title else ''
                for course_node in course_nodes['child_nodes']:
                    course_id = course_node['node_id']
                    course_name = first_title + sec_title + course_node['node_name']
                    course_name = course_name.replace(',', ' ')
                    print(course_id)
                    print(course_name)
                    write_text(course_id, book_name, course_name)

            else:
                # print(course_nodes['node_id'])
                course_id = course_nodes['node_id']
                course_name = first_title + course_nodes['node_name']
                course_name = course_name.replace(',', ' ')
                # print(course_id)
                print(course_name)
                write_text(course_id, book_name, course_name)



print("------半自动下载国家云平台m3u8格式视频课程-------\n")
path = input('请输入【课程视频】保存路径：')
tag_id = input('请输入需要下载的课程ID：')
get_m3u8(tag_id, path)
input('抓取完毕，按任意键关闭窗口。')

tian321 · 发表于 2022-5-8 11:12

谢谢楼主的详解。

mushaoai2016 · 发表于 2022-5-8 11:21

看起来不错的样子

xjjlxcb123 · 发表于 2022-5-8 11:26

正要这样的软件，谢谢分享！

hnbove · 发表于 2022-5-8 12:05

打开程序一闪而过就没了

sandy99800000 · 发表于 2022-5-8 12:11

win7 提示api-ms-win-core-path-l1-1-0.dll 缺失如何修复，一般的修复不了

tryit · 发表于 2022-5-8 12:17

做个UI界面，支持添加序号选项，这样查看视频的时候就方便知道播放顺序了

99丶 · 发表于 2022-5-8 12:19

感谢分享!

sky9131986 · 发表于 2022-5-8 12:22

sandy99800000 发表于 2022-5-8 12:11
win7 提示api-ms-win-core-path-l1-1-0.dll 缺失如何修复，一般的修复不了

生产目录版，不知道可不可以
https://yanxiu.lanzouh.com/iBYws04hbisj
密码:meizu

sky9131986 · 发表于 2022-5-8 12:23

hnbove 发表于 2022-5-8 12:05
打开程序一闪而过就没了

生产目录版，不知道可不可以
https://yanxiu.lanzouh.com/iBYws04hbisj
密码:meizu

帐号		自动登录	找回密码
密码			注册[Register]

[原创工具] 国家云平台课程视频M3U8批量抓取工具

免费评分

本帖被以下淘专辑推荐:

个人中心