python把word docx文档按标题分化成多个新文件

大柚子 · 发表于 2023-7-5 15:47

由于经常写报告，开始的时候是把报告写一个汇总文档里，后来要按标题一个一个的拆分，特此写了个脚本来拆分

[Python] 纯文本查看 复制代码

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

108

109

110

import os
from docx import Document
from docx.shared import Inches
from docx.shared import Pt
from docx.shared import Cm
 
#创建一个函数，将文档中所有的图片保存到images文件夹下，图片名以图片的embed属性值命名（即 rId 命名)
def extract_and_save_images_from_docx(doc_path):
    # 加载文档
    doc = Document(doc_path)
 
    # 创建用于存储图片的文件夹
    image_folder = 'images'
    if not os.path.exists(image_folder):
        os.makedirs(image_folder)
 
    # 获取文档中的所有图片
    def find_images(document):
        images = []
        rels = document.part.rels
        for rel in rels.values():
            if 'image' in rel.reltype:
                image_part = rel.target_part
                embed_id = rel.rId
                images.append((image_part, embed_id))
        return images
 
    # 保存图片
    image_list = find_images(doc)
    for i, (image_part, embed_id) in enumerate(image_list):
        image_filename = f'{embed_id}.png'
        image_path = os.path.join(image_folder, image_filename)
        with open(image_path, 'wb') as f:
            f.write(image_part.blob)
 
    print("Images extracted and saved successfully.")
 
 
 
#创建分化文档的函数，新文件名以  标号+标题名命名 ，标号每次匹配到标题自动 +1
def extract_images_from_docx(doc_path):
    # 加载文档
    doc = Document(doc_path)
 
    # 创建新文档
    new_doc = Document()
    doc = Document(doc_path)
    current_title = None
    current_index = 1
    new_doc = None
 
    # 创建用于存储文档的文件夹
    file_folder = '零散'
    if not os.path.exists(file_folder):
        os.makedirs(file_folder)
 
 
 
    # 遍历段落
    for paragraph in doc.paragraphs:
        if paragraph.style.name.startswith('Heading'): #判断是否是标题
            if new_doc is not None:
                file_name = str(current_index) +current_title + '.docx'
                file_path = os.path.join(file_folder, file_name)
                new_doc.save(file_path)
                #new_doc.save(f'{current_index}{current_title}.docx') # 文件名加上序号和标题名命名，每次遇到新标题序号 +1 （注意：文件名长度可能受到操作系统的限制，导致读取失败，有相关问题修改下标题名）
                current_index += 1
            current_title = paragraph.text
            new_doc = Document()
            new_doc.add_paragraph(paragraph.text, style=paragraph.style.name)
        elif new_doc is not None:
            new_paragraph = new_doc.add_paragraph(paragraph.text, style='Normal')
 
            # 遍历段落中的运行元素
            for run in paragraph.runs:
                # 获取运行元素的XML表示
                run_xml = run._r
                # 检查是否包含图片
                if '<w:drawing>' in run_xml.xml:
                    print("found an image1")
                    # 解析图片信息
                    image_start = run_xml.xml.find('<w:drawing>')
                    image_end = run_xml.xml.find('</w:drawing>') + len('</w:drawing>')
                    image_xml = run_xml.xml[image_start:image_end]
 
                    # 获取关联标识符
                    image_id_start = image_xml.find('r:embed="') + len('r:embed="')
                    image_id_end = image_xml.find('"', image_id_start)
                    image_id = image_xml[image_id_start:image_id_end]
                    #print(image_id)
                    image_path = 'images/'+ image_id +'.png'
                    print("image_path=" + image_path)
 
                    #根据image_id，将images文件夹的图片复制到新文档中
                    new_paragraph.add_run().add_picture(image_path, width=Inches(5))
 
 
                if not new_paragraph.runs:  # 如果段落中没有图片，则添加原文本内容
                    new_paragraph.text = paragraph.text
 
    # 保存新文档
    if new_doc is not None:
        file_name = str(current_index) + current_title + '.docx'
        file_path = os.path.join(file_folder, file_name)
        new_doc.save(file_path)
 
 
# 使用示例
extract_and_save_images_from_docx('pp2.docx')#替换你要分化的文档
extract_images_from_docx('pp2.docx')  #替换你要分化的文档

思路：
最开始是打算直接用run._element.tag.endswith('drawing'): 来检测段落中是否包含图片的，可惜一直检测不出来。后来想到文档的xml中图片的embed的值是唯一，故先把原始文档的所有图片先导出一份，启用embed 的值作为图片名；最后通过检测run._r.xml检测run中是否包含<w: drawing>来检测是否有图片，并且提取的其中的embed的值，再将对应图片添加进去。

不足：
正文字体格式都统一为正文格式了，没有加粗之类的效果（主要是因为文档中有些字体格式不标准化，有的是wps的格式，有的是自定义的），大佬们不知道有无更好的办法

羚羋 · 发表于 2023-7-6 18:30

word 可以直接拆分的呀。标题格式规范的话，在视图-大纲-显示级别选择对应的分级-显示文档-创建-保存就可以了，还能保留其他格式。

coolkid · 发表于 2023-7-5 17:42

谢谢分享

Yangxiao112 · 发表于 2023-7-5 17:46

谢谢分享

UmiSonada · 发表于 2023-7-5 17:48

谢谢分享👍🏻

catCatBlue · 发表于 2023-7-5 17:51

很实用的脚本，感谢分享

rongrong666 · 发表于 2023-7-5 17:57

学习了，谢谢分享

Snowfly011 · 发表于 2023-7-5 18:07

感谢大佬分享

myandroid · 发表于 2023-7-5 18:19

谢谢分享111

qingyu710 · 发表于 2023-7-5 18:29

谢谢分享

ning0053 · 发表于 2023-7-5 18:35

谢谢分享

帐号		自动登录	找回密码
密码			注册[Register]

[Python 原创] python把word docx文档按标题分化成多个新文件

免费评分

浏览过的版块