chroming / pdfdir Goto Github PK

View Code? Open in Web Editor NEW

426.0 6.0 50.0 170 KB

PDF导航（大纲/目录）添加工具

License: GNU General Public License v3.0

Python 99.43% Batchfile 0.57%

python pdf pyqt pyqt5 multi-platform

pdfdir's Introduction

pdfdir —— PDF导航书签添加工具

根据已有的目录文本为你的PDF自动生成导航书签。

此项目实现逻辑深受 https://github.com/ifnoelse/pdf-bookmark 项目影响。

软件功能

根据网上或PDF中已有的目录内容自动将导航书签（大纲）插入PDF文件中。

适用于以下场景：

扫描版电子书籍无导航书签；
文字版电子文档无导航书签但PDF中有目录。

下载

Windows/macOS/Ubuntu:

下载地址

其他平台：

请使用源码方式运行或自行打包。

使用

基本用法

选择文件：在 "PDF文件路径" 文本框中填入pdf文件路径（如D:/统计思维.pdf）或点击 "打开" 按钮通过文件管理器选择所需的pdf文件。
目录文本：将目录文本粘贴到“目录文本”框中。如何获取目录文本。
编辑写入目录（可选项）：根据目录文本自动生成的实际写入目录，可双击任一目录或页数进行编辑。同时支持拖动改变顺序/目录上下级关系。
编辑页数增加（可选项）：
写入：点击右下角的“写入”按钮，稍等片刻，待状态栏提示"******* Finished!"表示写入成功，此时可在pdf目录下找到包含书签的 原文件名_new.pdf 文件。

获取目录文本

目录文本是类似以下形式的文本内容：

中译版序言
致**读者
作者来信
前言
第1章社会心理学导论2  
第一编社会思维
第2章社会中的自我32
.....................
结语605
参考文献606

即：标题+页数形式。文本内容一般来源于网上书店（如亚马逊）或图书介绍网站（如豆瓣读书）。图书的介绍中一般会列出该书的目录文本，如亚马逊的在 商品描述--目录 下。注意：自动生成的目录完全依赖于目录文本，如果此文本有问题则生成的目录也会有问题。

英文支持

下载源码中的language/en.qm 放到程序同目录下 language/en.qm , 之后点击程序菜单栏中的 "语言 -- English" 即可切换为英文界面。

已知问题

一般图书非正文部分（如序言，目录等）没有标页码或使用另一套页码标记，本程序将这些目录默认链接到第一页，如需修正这些链接可手动修改。
有些正文中的目录没有标页码，程序会将该条目录链接到上一个有页码的标题页。

其他

通过源码运行

运行源码所需环境：

Python2/3 均可，推荐Python3
PyQt5
PyPDF2
six

注意：Python2与Python3 不兼容，某些系统（如macOS）系统自带Python2，使用python命令调用，若自行安装Python3则可能需要通过python3来调用Python3，pip同理。本文不区分python/python3, pip/pip3，请用户按当前系统所安装版本使用对应命令。

获取代码

git clone https://github.com/chroming/pdfdir

安装运行环境

安装Python:

https://www.python.org/downloads/

安装依赖包:

进入项目目录，执行：

pip install -r requirements.txt

pip install -r pyqt5

若提示No matching distribution found for pyqt5 可参照PyQt官方文档进行安装。

环境装好之后进入源码目录，运行以下命令：

python run_gui.py

如果不需要GUI界面:

python run.py

通过源码运行命令行接口

可以通过程序的run_cli.py 在没有Qt的环境下运行.
通过cli运行接口支持最多6级目录, 目录文本通过文件输入更加容易编辑.

python run_cli.py --help                                                                                                                                                                                                                            myrepo/pdfdir
usage: run_cli.py [-h] [--offset OFFSET] [--l0 L0] [--l1 L1] [--l2 L2] [--l3 L3] [--l4 L4] [--l5 L5] pdfPath tocPath

Add content to PDF.

positional arguments:
  pdfPath          path of PDF
  tocPath          path of contents file

options:
  -h, --help       show this help message and exit
  --offset OFFSET  Page offset of contents
  --l0 L0          Regular expression of level 0 of content
  --l1 L1          Regular expression of level 1 of content
  --l2 L2          Regular expression of level 2 of content
  --l3 L3          Regular expression of level 3 of content
  --l4 L4          Regular expression of level 4 of content
  --l5 L5          Regular expression of level 5 of content

打包源码

如果你想在本机打包此程序：

安装Pyinstaller

pip install pyinstaller

打包程序

pyinstaller.py -F run_gui.py -n "PDFdir.exe" --noconsole

目录文本格式

目前通过以下格式处理目录文本：

标题+页数+换行符

所有在一行的都被认为是一条目录。页数通过正则(\d*$)匹配（匹配文本结尾处的所有数字），如果匹配不到则默认为第一页或上一条目录的页数。

正则表达式简要说明

正则表达式是编程中常用的一种工具。如果你没有使用过，可以把他当成类似于office中通配符的东西。本工具中可能会用到的正则：

\d 表示单个数字，如 "第\d章" 可以匹配 "第1章", "第2章"……等，但不能匹配"第10章", 因为10是两个数字；
\w 表示单个字符，包括单个数字。如 "第\w章" 可以匹配"第1章", "第一章"……等；
. 表示任意字符，包括\w所能匹配的所有字符以及空格等特殊字符；
* 表示表达式中的前一个符号可以匹配不到，或匹配任意多次，如 "第\d*章" 可以匹配 "第章", "第1章", "第100章"；
+ 跟*类似，但是前一个符号不能匹配不到；
{m, n} 匹配前一个字符m至n次。

注意：

*, + 符号会匹配尽可能多的内容，比如如果用"第\w*章" 来匹配，"第一节如何阅读此章"这段内容也会被匹配到，更好的写法是确定要匹配内容的长度，写成"第\w{1,2}章"。
要匹配一个不是正则表达式中的正常字符直接写即可，如"第", "1", 甚至包括空格。但正则表达式中有定义用于匹配的一些特殊字符如果要作为普通字符匹配，则要在前面加一个"\"，比如匹配"1.1"这种格式，可以写成"\d\.\d"。"\"符号本身也要如此。

pdfdir's People

Stargazers

Watchers

Forkers

bennettxu doio 0xr0ot qdhqf liuqiangblog songguang-2010 chengren1992 jiazhang42 tansh4731 manyoubaby123 my-pyqt-learning csu-anzai csu-xiao-an alphacheng yongjunhe11 bingsur heitao9 throughs fujohnwang soft98-top dayowong0 tsunho12 litong860418 ff567 proitheus wzx1998johnny cikorsky xiaowuzi863 jkl375 fmsunyh uliyanjun theend233 reporthole tz0385 pointerto mamong artisticzhao xuyushengming lattic heavenlybard robin329 usefullcode moxi000 lirui991225 simpletab super-zoe luojineng chenyanjiangariana bi1pbuthu pp875598763

pdfdir's Issues

beta17和18在macos12.7.3打不开

beta17和18在macos12.7.3打不开，在dock上跳一下就没了，打不开

[bug]命令行中目录字符串只能读取一行

如题,现在的命令行工具在读取输入时用的是input(),造成在读取目录字符串时只能读一行,考虑换readlines?

[bug?] Linux下gui缺少功能(源码运行)

如图 , 按照readme里的源码运行方法,出现了这样的gui,缺少了更改目录等级正则等功能.日志

kf.kio.core: Malformed JSON protocol file for protocol: "trash" , number of the ExtraNames fields should match the number of ExtraTypes fields
kf.kio.core: Malformed JSON protocol file for protocol: "trash" , number of the ExtraNames fields should match the number of ExtraTypes fields
Qt: Session management error: networkIdsList argument is NULL
Icon theme "gnome" not found.
Icon theme "ubuntu-mono-dark" not found.
Icon theme "Mint-X" not found.
Icon theme "elementary" not found.
Icon theme "gnome" not found.
qt.accessibility.atspi: WARNING Qt AtSpiAdaptor: Accessible invalid:  QAccessibleInterface(0x5654bc77c2b0 invalid) "/org/a11y/atspi/accessible/2147483786"
qt.accessibility.atspi: WARNING Qt AtSpiAdaptor: Accessible invalid:  QAccessibleInterface(0x5654bc77c2b0 invalid) "/org/a11y/atspi/accessible/2147483786"

python 3.10.2; pyqt5 5.15.6-7.1; qt5 5.15.2
看readme里好像写不能正常用源码运行,可能不是bug?(真不是的话有什么workaround吗?)

[懒人包]章节, 分节, 小节的正则表达式的提供

首层: "^第-?[1-9]\d章"
二层: "^[1-9]\d.\d*"
三层: "^[1-9]\d*.[1-9]\d*.[1-9]\d*"

使用建议:

先将目录的文本放到编辑器, 把"第"后面和"章"前面多余的空格去掉;
留意目录标题最后是不是数字, 如果是数字, 则需把最后的数字和页码空格分开(如: 1.4.1 计算机历史1900~202281)

有目录无书签的pdf文件生成新的带书签的pdf文件后，原先目录不支持点击跳转了

非常感谢有这么优秀的工具。
但是使用发现一个问题，就是我的 pdf 文件本身是有目录的，目录是支持点击跳转到具体的页码的。
使用这个工具生成书签栏后，点击书签栏可以跳转到具体页码。但是原先的目录点击后没反应了，不能跳转到具体页码了。

软件挺好用的，但是没有适配mac的retina屏幕

软件挺好用的，但是没有适配mac的retina屏幕，在retina屏幕上文字和界面很虚，不清晰，希望可以优化一下。

写入完总是会更改pdf文件的打开方式

当pdf文件默认打开方式是SumatraPDF(v3.5.2 64-bit)，写入完目录信息总是会更改pdf文件的打开方式，将全部pdf文件的默认打开方式变为Texworks(0.6.8)。
但是当文件默认打开方式是Adobe Acrobat pro(2023.008.20412)时，不会出现上述变更默认打开方式的问题。

v0.3.0-beta4在M1版macOS 12.6下，点写入时闪退

用了好几年的人来夸夸作者

给pdf加目录是我这几年的inner peace来源之一，满足赛博整理癖，谢谢作者！
虽然很久没更新过了，但还是想问，目录里面可以加负数页码吗？比如我的偏移是17，正文开始是第1页，但是我想把preface, contents啥的加进目录，它们的页码应该写成负的，试了一下好像不行，是我哪里弄错了还是本身没有这个功能呢？
谢谢！

pypdf.errors.DependencyError: cryptography>=3.1 is required for AES algorithm

release中最新的win版本写入报错
pypdf.errors.DependencyError: cryptography>=3.1 is required for AES algorithm
2024-02-07 00:06:16,494 - CRITICAL - DependencyError: cryptography>=3.1 is required for AES algorithm:
Traceback (most recent call last):
File "src\gui\main.py", line 281, in write_tree_to_pdf
File "src\gui\main.py", line 288, in dict_to_pdf
File "src\pdf\bookmark.py", line 35, in add_bookmark
File "src\pdf\api.py", line 46, in init
File "pypdf_writer.py", line 2946, in append
File "pypdf_utils.py", line 486, in wrapper
File "pypdf_writer.py", line 3027, in merge
File "pypdf_writer.py", line 419, in add_page
File "pypdf_writer.py", line 332, in _add_page
File "pypdf\generic_data_structures.py", line 199, in clone
File "pypdf\generic_data_structures.py", line 310, in _clone
File "pypdf\generic_data_structures.py", line 116, in clone
File "pypdf\generic_base.py", line 292, in clone
File "pypdf\generic_base.py", line 312, in get_object
File "pypdf_reader.py", line 1417, in get_object
File "pypdf_encryption.py", line 850, in decrypt_object
File "pypdf_encryption.py", line 99, in decrypt_object
File "pypdf_crypt_providers_fallback.py", line 69, in decrypt
pypdf.errors.DependencyError: cryptography>=3.1 is required for AES algorithm

源码运行没问题

在mac端对部分pdf文件闪退

我测试了多个pdf文件，只有对其中一个pdf文件可以正常写入大纲。对其他的pdf文件，点击写入后软件会没有提示地直接闪退，求解这个怎么解决。

自动格式化整理目录文体（空格）

我们从网上找到目录结构数据，很多时候并不是规整的，主要在：

页码前可能没有空格
二/三级标题数字后面没有空格
以上每次都需要手动调整，不知道可以自动处理否。

v0.3.0-beta版本首层目录正则表达式不准确

目录文本：

Contents
1 INTRODUCTION AND SCOPE ................................................................................................................ 4
1.1 SUPPORTED FEATURE LIST ..................................................................................................................................5
1.2 STANDARDS COMPLIANCE...................................................................................................................................6

首层：^[1-9]\d*
二层：^[1-9]\d*.\d*
三层：^[1-9]\d*.[1-9]\d*.[1-9]\d*

可以发一下新版本的release吗

我看#5里新版本好像支持负数页码？
试着按说明里的打包。提示如下错误
Fatal error: PyInstaller does not include a pre-compiled bootloader for your
platform. For more details and instructions how to build the bootloader see
https://pyinstaller.readthedocs.io/en/stable/bootloader-building.html
去BootLoader里想再试试，看到This requires a recent version of Xcode (12.2 or later)。我用的是10.15.7的mac mini 2018款，没法升级了。
作者能更新一下吗？

感觉好多bug

1.使用三级目录的时候，会意外退出
2.当使用如下的目录时会出现页码和目录不匹配

考点1 题目1
考点1 解析12
考点2 题目3
考点2 解析14

而写入的是

考点1 题目1
考点1 解析1
考点2 题目3
考点2 解析3

macos安装软件后launch pad不显示图标

好久没有升级软件了，之前用的还是v0.2版本，升级到最新版本后，软件图标没有了，并且launch pad不显示图标，让人很头痛

macos 12.6.6 Monterey

迁移到pypdf2 2.x

如题，pypdf2 大版本已更新到2.x，跟本项目使用的1.x语法已经不兼容。或者可以考虑在requirements.txt里加上具体版本需求

老哥，你这个说明安装文档对编程小白来说是天书啊

太简洁了，完全懵逼ing

Object 11 0 not defined.

通过python3执行，目录里面什么都不写也会报错，写了也会报错
/usr/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
PdfReadWarning: Object 11 0 not defined. [pdf.py:1629]
Traceback (most recent call last):
File "/home/deepindh/Code/python/pdfdir/src/gui/main.py", line 157, in export_pdf
new_path = add_directory(*self._get_args())
File "/home/deepindh/Code/python/pdfdir/src/pdfdirectory.py", line 9, in add_directory
return add_bookmark(pdf_path, index_dict)
File "/home/deepindh/Code/python/pdfdir/src/pdf/bookmark.py", line 36, in add_bookmark
return pdf.save_pdf()
File "/home/deepindh/Code/python/pdfdir/src/pdf/api.py", line 70, in save_pdf
self.writer.write(out)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 482, in write
self._sweepIndirectReferences(externalReferenceMap, self._root)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 556, in _sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, data[i])
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 571, in _sweepIndirectReferences
self._sweepIndirectReferences(externMap, realdata)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 547, in _sweepIndirectReferences
value = self._sweepIndirectReferences(externMap, value)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 577, in _sweepIndirectReferences
newobj = data.pdf.getObject(data)
File "/home/deepindh/.local/lib/python3.6/site-packages/PyPDF2/pdf.py", line 1631, in getObject
raise utils.PdfReadError("Could not find object.")
PyPDF2.utils.PdfReadError: Could not find object.
已放弃

复现步骤：

在目录文本中输入至少两行内容
双击预览中的任意页码，进入编辑模式
再在目录文本中删除任意一行
点击预览中的任意条目
报错

File "./pdfdir/src/gui/base.py", line 92, in close_editor
    self.closePersistentEditor(self.last_item, self.last_column)
RuntimeError: wrapped C/C++ object of type QTreeWidgetItem has been deleted

应该是尝试关闭已经删除的条目导致的。应该在

pdfdir/src/gui/base.py, line 92

附近加个try catch就行？

软件打开以后显示不全怎么办

兼容模式也不行

以空格缩进表示标题层级，但是正则无效，即使在VSCode中正常匹配

我倾向于从豆瓣图书、超星读秀等地方复制目录，然后在VS Code中整理。

批量选择，然后按TAB，直接缩进4个空格，是比较符合直觉的层级标记方式。

但是不管我怎么写正则，都没法在软件中匹配，即使在VS Code中可以正常匹配。

如：

一级标题
    二级标题
        三级标题

在VS Code中，可以使用正则匹配：

.*
\s{4}.*
\s{8}.*

但是在软件中无效。

我觉得这是可以改进的，即对于为PDF增加目录这件事，不是在软件中用正则匹配层级，而是用户按照对应规则整理好层级，软件自动提取页码并偏移页码。

因为每本书的目录都不同，如果用正则去匹配，那么每本书都要写一次正则，像“第x章”、“第x节”这种，不可能符合所有书籍。

如果用户直接用空格缩进，软件识别0空格为一级，4空格为二级，8空格为三级，对于软件来说反而更简单。