Python下载保存html文件时数学公式的显示问题

bin_chb · 发表于 2023-9-21 20:16

本帖最后由 bin_chb 于 2023-9-21 23:14 编辑

以下代码是下载某专栏的文章，但是知乎中的数学公式无法正常显示，请问如何能正常显示数学公式
参考：https://blog.csdn.net/weixin_44283855/article/details/124123375

[Python] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

import time
import re
import os,sys
import requests
from bs4 import BeautifulSoup
from pathlib import Path 
import pdfkit
 
def get_list():
    url = 'https://www.zhihu.com/api/v4/columns/%s/articles?include=data.topics&limit=10' % author
    article_dict = {}
    while True:
        print('fetching', url)
        try:
            resp = requests.get(url, headers=headers)
            j = resp.json()
            data = j['data']
        except:
            print('get list failed')
 
        for article in data:
            aid = article['id']
            akeys = article_dict.keys()
            if aid not in akeys:
                article_dict[aid] = article['title']
 
        if j['paging']['is_end']:
            break
        url = j['paging']['next']
        time.sleep(2)
 
    with open(filedir.joinpath('zhihu_ids.txt'), 'w') as f:
        items = sorted(article_dict.items())
        for item in items:
            f.write('%s %s\n' % item)
def get_html(aid, title, index,encoding='UTF-8'):
    title = re.sub('[\/:*?"<>|]', '-', title)  # 正则过滤非法文件字符
    print(title)
    file_name = '%03d. %s.html' % (index, title)
    file_name=file_name.replace(" ","").strip()
    # file_name=[strings.replace(i, "") for i in special_characters]
    # 删除字符串中的特殊字符
 
    print('saving', title)
    try:
        url = 'https://zhuanlan.zhihu.com/p/' + aid
        res= requests.get(url, headers=headers)
        encoding=res.encoding
        html =res.text
        # soup = BeautifulSoup(html, 'lxml')
        soup = BeautifulSoup(html, 'html.parser')
        content = soup.find(class_='Post-RichText').prettify()
        content = content.replace('data-actual', '')
        content = content.replace('h1>', 'h2>')
        content = re.sub(r'<noscript>.*?</noscript>', '', content)
        content = re.sub(r'src="data:image.*?"', '', content)
        # content = f'<!DOCTYPE html><html><head><meta charset={encoding}></head><body><h1>{title}</h1>{content}</body></html>'
        strmath = '<script type="text/javascript" async src="https://cdn.staticfile.org/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML" ></script>' #解析数学公式
         
        content1 = f'<!DOCTYPE html><html><head><meta charset={encoding}></head><body><h1>{title}</h1>{content}{strmath}</body></html>'
        str1 = """
    <style>
      body {
        margin: 0 50px;
      }
     
      p {
        text-indent: 2em;
      } # 文字首行缩进2em
     
      img {
        width: 100%;
      } # 图片显示中间 不超出
      pre {
        white-space: pre-wrap;
        word-wrap: break-word;
      } # 代码段自动换行
      .ztext-math {
        display: inline-block;
      }
    </style>
    """
        content2 = content1 +  str1
         
        with open(filedir.joinpath(file_name), 'w', encoding='utf-8') as f:
            f.write(content2)
    except:
        print('get %s failed', title)
    time.sleep(2)
 
def get_details():
    with open(filedir.joinpath('zhihu_ids.txt')) as f:
        i = 1
        for line in f:
            lst = line.strip().split(' ')
            aid = lst[0]
            title = '_'.join(lst[1:])
            get_html(aid, title, i)
            i += 1
 
def to_pdf():
    '''
    如需导出pdf，除通过pip安装pdfkit外，还需要手动安装 wkhtmltopdf，具体参见：
    [url=https://github.com/JazzCore/python-pdfkit/wiki/Installing-wkhtmltopdf]https://github.com/JazzCore/pyth ... talling-wkhtmltopdf[/url]
    [url=https://wkhtmltopdf.org/downloads.html]https://wkhtmltopdf.org/downloads.html[/url]
    '''
    print('exporting PDF...')
    htmls = []
    htmls += [name.name for name in Path.iterdir(filedir) if name.suffix==".html"]
    path_wk = r"E:\Software\wkhtmltopdf\bin\wkhtmltopdf.exe"
    # htmls.remove('index.html')
    config = pdfkit.configuration(wkhtmltopdf = path_wk)
    path_wk = r"E:\Software\wkhtmltopdf\bin\wkhtmltopdf.exe"
    config = pdfkit.configuration(wkhtmltopdf = path_wk)
    options = {
            'encoding': "utf-8",
            # 'page-size': 'A4',
            'orientation': 'Portrait',#'Landscape',#'Portrait',#横屏竖屏
            # 'margin-top': '2mm',
            # 'margin-right': '2mm',
            # 'margin-bottom': '2mm',
            # 'margin-left': '2mm',
            # 'no-outline': None,
            # 'background': background_color,
            # 'quiet':'' #默认情况下, PDFKit 将会显示所有的wkhtmltopdf输出. 如果你不想看到这些信息，需要传递一个quiet选项
        
       'enable-local-file-access': '--enable-local-file-access',
        'enable-internal-links': '--enable-internal-links',
        'enable-javascript':'--enable-javascript',
        # 'javascript - delay' :'--javascript - delay < 300 >',
        'javascript-delay':'10000',
        'no-stop-slow-scripts':'--no-stop-slow-scripts',
        'debug-javascript':'--debug-javascript',
        'enable-forms':'--enable-forms',
        'disable-smart-shrinking':'--disable-smart-shrinking'
       }
    htmls_dir=[filedir.joinpath(name) for name in sorted(htmls)]
    pdfkit.from_file(htmls_dir, dir_path+"\\"+author + '.pdf',options=options,configuration=config)
    print('Done')
if __name__ == '__main__':
    dir_path=r'E:\Brandon\Desktop\zhihu'
    author = 'c_1322265113534304256'
 
             
    filedir=Path(dir_path).joinpath(author)
    filedir.mkdir(parents=True,exist_ok=True) # 创建文件夹
 
    headers = {
        'origin': 'https://zhuanlan.zhihu.com',
        'referer': 'https://zhuanlan.zhihu.com/%s' % author,
        'User-Agent': ('Mozilla/5.0'),
    }
    get_list()
    get_details()
#to_pdf()

LoveCode · 发表于 2023-9-21 20:16

在那些 re.sub 方法后面再插入以下代码。

# 在这里开始替换，其类似如下格式，关键在于 T 部分
# <span class="ztext-math" data-eeimg="1" data-tex="T">T</span>
content = re.sub(r'data-tex="(.*?)">\s+\1\s+<', r">\(\1\)<",  content)

我测试了，下载的 HTML 可以正常渲染数学表达式，那么生成 PDF 应该也没问题（这个没有测试）。

现在不能上传图片，网站显示错误： “不支持的类型”，如果明天可以的话我再详细说明一下。

如果你感到疑惑，可以去看一下 MathJax 的官网如，里面提到了插入 LaTex 表达式时应该使用 $$ 或者 。

LoveCode · 发表于 2023-9-22 00:20

额，我发现无法上传图片是因为只能上传 .txt 文件。
那我还是详细用文字说明思考过程吧。

猜测

我先阅读了你所给出的 csdn 博客，其中关于数学表达式的显示，原作者是在下载的 HTML 中插入了外部的渲染数学表达式的 js。

既然你提到了数学表达式的渲染问题，我猜测是这个 js 文件导入失败了。另外，只要下载的 HTML 显示正常，那么生成的 PDF 大概也是正常的。

试探

我运行了你的代码，成功下载了几个 HTML 文件（此时还没有生成 PDF），数学表达式没有渲染出来。

（此处省略展示数学表达式渲染失败的图片。）

现在我有几个猜测：

使用的 mathjax.js 不对
那个数学表达式不对。不过我只知道 Tex 这个名字，其它都不懂。

为了验证第 1 个猜想，我从 MathJax 官网扒下来一个例子，然后用 python 源码中的 mathjax.js 测试，显示正确，说明这个 js 没有问题。

（此处省略图片。）

为了验证第二个猜想，我把爬取下来的网页中的数学表达式放到一个 Tex 测试网站上测试，结果是正确的！

现在 mathjax.js 文件正确，数学表达式正确，但就是没有渲染出来。最后猜测是 mathjax.js 的问题，因为它够复杂，所以仔细阅读了官方文档，才确定问题所在。

The default math delimiters are
$...$
and [...] for displayed mathematics, and (...) for in-line mathematics. Note in particular that the $...$ in-line delimiters are not used by default.

所以在下载来的 HTML 源码中应该找到所有的数学表达式，然后给它们添加对应的前缀和后缀即可。
只要爬取下来的 HTML 可以正常显示，生成的 PDF 也就正常了。

修改 Python 源码

现在的问题就是给所有的 Tex 表达式添加前后缀了，经过分析，发现它们的规律：

它们都在 class 为特定值的 span 中。
data-tex 属性的值和 span 标签的文本一样，都是要显示的 Tex 表达式。

所以可以使用正则来替换（具体见最回复提到的代码）。

<span class="ztext-math" data-eeimg="1" 
            data-tex="\sum_{i=1}^{n}X_{i}\sim N(n\mu,n\sigma^{2}) \\">
            \sum_{i=1}^{n}X_{i}\simN(n\mu,n\sigma^{2}) \\
</span>

最终我决定添加 $..$ 促成行内形式，而不是 $$ 独占一行，因为我也无法区分哪些表达式应该是行内还是独占一行显示。

LoveCode · 发表于 2023-9-22 00:26

额，从 MathJax 官网复制的文本被当作 markdown 处理了，具体还是看这里的文档：https://docs.mathjax.org/en/latest/basic/mathematics.html。

bin_chb · 发表于 2023-9-22 09:09

LoveCode 发表于 2023-9-21 20:16
[md]在那些 `re.sub` 方法后面再插入以下代码。

```python

感谢回复了这么多

帐号		自动登录	找回密码
密码			注册[Register]

[经验求助] Python下载保存html文件时数学公式的显示问题

最佳答案

猜测

试探

修改 Python 源码

免费评分

免费评分