Giter Site home page Giter Site logo

pdftool's Introduction

本项目是个小工具,使用了第三方库Spire的免费版本

目的

最近我有一个需求,就是把一个很大的PDF转换为word,要求是尽量不失真。但是我找了很多的在线免费转换工具,或者是有页数要求,或者是有大小要求。高级功能需要收费。

于是我考虑自己去实现,第一想到的是python,这个实现起来很简单,但是转换后的word会失真,页面的排版等等不符合要求,于是考虑采用第三方免费工具,然后写代码自己转换。

目前的话我没有做成web形式的,后期会改进。

转换思路如下:

1、免费版本转换页数要求11页

2、输入一个pdf时候,小于11页直接转换,大于11页就先切分成子pdf

3、对每一个小的pdf进行转换,最后再合并。

总体上就是一个大的pdf拆分转换再合并的问题。

其他的思路:

不使用第三方库,直接OCR技术扫描。这个技术我正在考虑当中,有兴趣的可以添加我的微信交流:

fdd15735171890

本工具特点:

1、图片不会转换

2、文字正常转换

3、数学公式正常转换

4、排版不会失真

(完全的不失真也做不到,会有微小的差别,但是和pdf基本上一样)

使用流程

1、git clone [email protected]:fengdongdongwsn/PdfTool.git

2、如果你是Eclipse或者是MyEclipse,直接导入运行即可,入口类在Main.java

3、如果你是Idea或其他的Maven环境下,在pom.xml环境中添加如下依赖:

    <repositories>
        <repository>
            <id>com.e-iceblue</id>
            <url>http://repo.e-iceblue.cn/repository/maven-public/</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.pdf.free</artifactId>
            <version>2.6.3</version>
        </dependency>

        <dependency>
            <groupId>e-iceblue</groupId>
            <artifactId>spire.doc.free</artifactId>
            <version>2.7.3</version>
        </dependency>

    </dependencies>

然后直接运行Main类

个人说明

喜欢的给个支持吧各位老铁们:

image

pdftool's People

Contributors

fengdongdongwsn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdftool's Issues

Maven多工具一起使用会报错(pdf和doc)

问题:使用PdfTool时,用maven引入工具,切割文件以后再合并文件时会报错:
Exception in thread "main" java.lang.NoSuchMethodError: com.spire.license.LicenseProvider.validateVarAndGetInfo(Ljava/lang/String;)Lcom/spire/license/Assembly;
解决方法:在使用完pdf时,需要用另外一个doc合并文档时,需要先将之前的pdf关闭,即:pdf.close(),之后就可以正常使用了。

运行报异常

class com.spire.pdf.packages.sprDuB: Culture Name: en-CN is not a supported culture
是不能转换中文的pdf么

生成的 docx 文档有试用版水印,且在命令行运行时有异常

环境

  • macOS Catalina, Version 10.15.7
  • PDF 文件为通过 Markdown 生成的中英文混杂的 PDF
以下是我在命令行界面操作的步骤:
> java -version
java version "1.8.0_211"
Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)

> git clone https://github.com/fengdongdongwsn/PdfTool.git
> cd PdfTool
> cat > src/Main.java << 'EOF'
import java.io.File;

public class Main {
	public static void main(String[] args) {
		if (args.length == 1) {
			String path = args[0];

			// 检查文件是否存在
			File f = new File(path);
			if (f.exists() && !f.isDirectory()) {
				String res = new PdfToWord().pdftoword(path);
				System.out.println("\n" + path + "\n -> " + res);
			} else {
				System.out.println("错误:文件 '" + path + "' 不存在");
			}

		} else {
			System.out.println("请传入一个 PDF 文件的路径!");
		}
	}
}
EOF

> javac -classpath lib/Spire.Doc.jar:lib/Spire.Pdf.jar src/*.java -d bin/

> java -cp bin:lib/Spire.Doc.jar:lib/Spire.Pdf.jar Main ./python_note.pdf

./doc/
java.lang.IllegalStateException: Cannot find table 'OS/2' in the font file.
	at com.spire.doc.packages.sprmUA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprZOc.<init>(Unknown Source)
	at com.spire.doc.packages.sprUSc.spr (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.spryRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.Document.spr   (Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at MergeWordDocument.merge(MergeWordDocument.java:24)
	at PdfToWord.pdftoword(PdfToWord.java:64)
	at Main.main(Main.java:12)
java.lang.IllegalStateException: Cannot find table 'OS/2' in the font file.
	at com.spire.doc.packages.sprmUA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprZOc.<init>(Unknown Source)
	at com.spire.doc.packages.sprUSc.spr (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.spryRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.Document.spr   (Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at MergeWordDocument.merge(MergeWordDocument.java:24)
	at PdfToWord.pdftoword(PdfToWord.java:64)
	at Main.main(Main.java:12)
java.lang.IllegalStateException: Cannot find table 'OS/2' in the font file.
	at com.spire.doc.packages.sprmUA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.spraWA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprtUA.spr  (Unknown Source)
	at com.spire.doc.packages.sprZOc.<init>(Unknown Source)
	at com.spire.doc.packages.sprUSc.spr (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.sprUSc.spr  (Unknown Source)
	at com.spire.doc.packages.spryRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.packages.sprRRc.spr  (Unknown Source)
	at com.spire.doc.Document.spr   (Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at com.spire.doc.Document.saveToFile(Unknown Source)
	at MergeWordDocument.merge(MergeWordDocument.java:24)
	at PdfToWord.pdftoword(PdfToWord.java:64)
	at Main.main(Main.java:12)
true

./python.pdf
 -> 转换成功

问题

一、执行命令 java -cp bin:lib/Spire.Doc.jar:lib/Spire.Pdf.jar Main ./python_note.pdf 的过程中出现了一些异常:

java.lang.IllegalStateException: Cannot find table 'OS/2' in the font file.

详细报错见最上面的折叠部分。

二、生成的 docx 文档存在水印
生成的 docx 文档如下:
demo

水印内容为 Evaluation Warning : The document was created with Spire.PDF for Java.

这些水印能不能不生成,或者自动删除?

if weather have some tools or script for recognize the text of pdf

I have many pdf book.
But the pdf scan is not clear enough, whether there are algorithms or tools that can enhance recognition.
At present, a domestic tool pdf2HD has been found, but it can only be used when reading, and it can only optimize the page for reading.

代码jar证书有问题啊

Exception in thread "main" java.lang.NoSuchMethodError: com.spire.license.LicenseProvider.validateVarAndGetInfo(Ljava/lang/String;)Lcom/spire/license/Assembly;
at com.spire.doc.Document.spr� � (Unknown Source)
at com.spire.doc.Document.(Unknown Source)
at com.spire.doc.Document.(Unknown Source)

运行异常

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/bind/JAXBException
at com.spire.pdf.PdfDocument.(Unknown Source)
at PdfToWord.pdftoword(PdfToWord.java:34)
at Main.main(Main.java:7)
好像是找不到jar包的代码,但是我导入试过了啊。。。为什么还是不行啊

java.lang.NullPointerException: null

博主你好,我的pdf页数>10,报错代码如下图所示。奇怪的是在windows环境不报错,在linux环境报错了,还没找到问题,报错截图如下:
image
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.