thoqbk / traprange Goto Github PK

View Code? Open in Web Editor NEW

321.0 34.0 130.0 11.36 MB

(Java)A Method to Extract Tabular Content from PDF Files

License: MIT License

Java 15.64% Shell 0.16% HTML 84.20%

java pdf pdfbox parser pdf-parsing pdf-manipulation pdf-files

traprange's Introduction

TrapRange: a Method to Extract Table Content in PDF Files

Source: http://www.dzone.com/articles/traprange-method-extract-table

Update

Utilize OpenAI API to extract information from PDF files
To extract information from PDF invoice
To run from the command line. Type java -jar traprange.latest.jar -h for help OR see examples in file test-command-line.sh

Introduction

Table data structure is one of the most important data structure in document, especially when exporting data from enterprise systems, data is usually in table format.

There are several data file formats that are often used to store tabular content such as CSV, text, and pdf. For the first two formats, it is quite straight forward just by opening file, loop through lines, and split cells with proper separator. The libraries to do this are quite a lot.

With PDF file, the story is completely different because it doesn't have a dedicated data definition for tabular content, something like table, tr, td tag in HTML. PDF is a complicated format with text data, font, styling, and also image, audio, and video, they can be mixed all together. Below is my proposed solution to data in high-density tabular content.

How to detect a table

After some investigation, I realized that:

Column: text content in cells of the same column lies on a rectangular space that does not overlap with other rectangular spaces of another column. For example, in the following image, red rectangle and blue rectangle are separated spaces
Row: words in the same horizontal alignment are in the same row. But this is just sufficient condition because a cell in a row may be a multi-line cell. For example, the fourth cell in the yellow rectangle has two lines, phrases “FK to this customer’s record in” and "Ledgers table" are not in the same horizontal alignment but they are still considered in the same row. In my solution, I simply assume that content in a cell only is single-line content. Different lines in a cell are considered to belong to different rows. Therefore the content in the yellow rectangle contains two rows: 1. {"Ledger_ID", "|", "Sales Ledger Account", "FK to this customer's record to"} 2. {NULL, NULL, NULL, "Ledgers table"}

PDFBox API

My library behind traprange is PDFBox which is the best PDF lib I know so far. To extract text from a pdf file, PDFBox API provides 4 classes:

PDDocument: contains information of entire pdf file. In order to load a pdf file, we use method PDDocument.load(stream: InputStream)
PDPage: represents each page in pdf document. We possibly archive a specific page content by passing the index of the page with this method: document.getDocumentCatalog().getAllPages().get(pageIdx: int)
TextPosition: represents an individual word or character in the document. We can fetch all TextPosition objects of a PDPage by overriding method processTextPosition(text: TextPosition) in class PDTextStripper. A TextPosition object has methods getX(), getY(), getWidth(), getHeight() that returns its bound in page and method getCharacter() to get its content.

In my work, I process text chunks directly by using TextPosition objects. For each text chunk in PDF file, it returns a text element with the following attributes:

x: horizontal distance from the left of the page
y: vertical distance from the top border of the page
maxX: equals x + width of the text chunk
maxY: equals y+ height of the text chunk

Trap ranges

The most important thing is identifying the bound of each row and column because if we know the bound of a row/column, we can retrieve all texts in that row/column from that we can easily extract all content inside the table and put it in a structured model. We name these bounds are trap-ranges. TrapRange has two attributes:

lowerBound: contains the lower endpoint of this range
upperBound: contains the upper endpoint of this range To calculate values of trap-ranges, we loop through all texts of the page and project range of each text onto the horizontal and vertical axis, get the result and join them together. After looping through all texts of the page, we will calculate trap-ranges and use them to identify cell data of the table.

Algorithm 1: calculating trap-ranges for each pdf page:

columnTrapRanges <-- []
rowTrapRanges <-- []
for each text in page
begin
     columnTrapRanges <-- join(columnTrapRanges, {text.x, text.x + text.width} )
     rowTrapRanges <-- join(rowTrapRanges, {text.y, text.y + text.height} )
end

After calculating trap-ranges for the table, we loop through all texts again and classify them into correct cells of the table.

Algorithm 2: classifying text chunks into correct cells:

table <-- new Table()
for each text in page
begin
     rowIdx <-- in rowTrapRanges, get index of the range that containts this text
     columnIdx <-- in columnTrapRanges, get index of the range that contains this text
     table.addText(text, rowIdx, columnIdx)
end

Design and implement

The above is class diagram of main classes:

TrapRangeBuilder: build() to calculate and return ranges
Table, TableRow and TableCell: for table data struture
PDFTableExtractor is the most important class. It contains methods to initialize and extract table data from PDF file. Builder pattern was used here. Following is some highlighted methods in this class:
setSource: set source of the pdf file. There're 3 overloads setSource(InputStream), setSource(File) and setSource(String)
addPage: to determine which pages will be processed. Default is all pages
exceptPage: to skip a page
exceptLine: to skip noisy data. All texts in these lines will be avoided.
extract: process and return result

Example

PDFTableExtractor extractor = new PDFTableExtractor();
List<Table> tables = extractor.setSource(“table.pdf”)
	.addPage(0)
	.addPage(1)
	.exceptLine(0) //the first line in each page
	.exceptLine(1) //the second line in each page
	.exceptLine(-1)//the last line in each page
	.extract();
String html = tables.get(0).toHtml();//table in html format
String csv = tables.get(0).toString();//table in csv format using semicolon as a delimiter

Following are some sample results (check out and run the test file TestExtractor.java):

Sample 1: Source: sample-1.pdf, result: sample-1.html
Sample 2: Source: sample-2.pdf, result: sample-2.html
Sample 3: Source: sample-3.pdf, result: sample-3.html
Sample 4: Source: sample-4.pdf, result: sample-4.html
Sample 5: Source: sample-5.pdf, result: sample-5.html

Evaluation

In experimentation, I used pdf files having high density of table content. The results show that my implementation detects tabular content better than other open-sources: pdftotext, pdftohtml, pdf2table. With documents having multi tables or too much noisy data, my method does not work well. If row has cells overlapped, columns of these cells will be merged.

Conclusion

TrapRange method works the best with PDF files having high density of table data. With documents have multi-table or too much noisy data, TrapRange is not a good choice. My method also can be implemented in other programming languages by replacing PDFBox by a corresponding pdf library or using command-line tool pdftohtml to extract text chunks and using these data as input data for algorithm 1, 2.

System requirements

Java 8+
Maven 3+

References

License

The MIT License (MIT)

traprange's People

Contributors

Stargazers

Watchers

Forkers

pleguen bajinkya sanchitaggarwal-innoplexus saihegde lucanaso psporysz congwang-ai ingorammer dare2code raviprakashshahi naveenprasath chetan-ballolli mrcoder256 hitme cttn nnavaneetha kevinhuo88888 shivaennam hitenpratap gucasbrg hwangjaesuk abhamza sithmetalord ygoverdhan onlyhuman jacklone marcusleandro nicholasren vishalgoel003 chinw padmaragl kayleyang dh-heima sivarajradhakrishnan suryasomasundar vineglobal shivakumarjakkani zhzhgaga ysinkevich glsgolden qwzhong1988 cnshjack myz6 soumya-digi madhuri5279 leon-trainee luiz158 newusername321 mbharani5 akanippayyur david-ohio devendra-ulhas-patil jfornelc babylon3389 kentgu abhichabhi baramit xuliang482 dougnoel silvesterin loovelj walliword adrianbadarau rafenden micsta keqinghe foodzee pavan0894 corbettcode romelquitasol balasubramanian-rengasamy ciperlabs asmaada anup5512 benjak135765 yangyifei666 cheneygan wengbenjue flyeagles rajendramaharjan mbjavadev fileme binoyzone linsen1983 pydawan duychuongvn uemoe gramali handong0123 danielbohnrs stevecrossin eseldeslachts fmancia shaoliu08 tamalnayek egis-kevin yssai2011 rubeeny shivasheguri karthikeyanloganathan

traprange's Issues

Can we get all PDF data into the String variable, instead of getting data page by page?

Hi a. Tho,

Currently, I'm using "get" method to get PDF data from specific page. I wonder that can we get all PDF data at once instead of getting data page by page like that?
My code:

public static int rowNumberOfPDFFile(String pdfLink, int pagePDFNumber) throws IOException {
PDFTableExtractor extractor = new PDFTableExtractor();
List

tables = extractor.setSource(pdfLink).extract();
// get date from page 1 to String html. Page number starts from 0
String html = tables.get(pagePDFNumber).toHtml();

    html = html.substring(html.indexOf("border='1'>") + 11);
    int rowNumber = org.apache.commons.lang3.StringUtils.countMatches(html, "/tr");
    return rowNumber;
}

I would like to get all PDF data into "html" field. Could you please help?

Thanks,
Phan Nguyen

The method sort underfined for the type List<Range<Integer>>

Hi, thoqbk! I have downloaded the repository and I find there existing some error when use maven to packaging the jar. The error message are as follows:

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /E:/Repo/traprange-master/src/test/java/com/giaybac/traprange/test/TESTP
DFBox.java:[57,15] 找不到符号
  符号:   方法 sort(<匿名java.util.Comparator<com.google.common.collect.Range>>)

  位置: 类型为java.util.List<com.google.common.collect.Range<java.lang.Integer>>
的变量 ranges
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.129 s
[INFO] Finished at: 2015-11-18T12:03:09+08:00
[INFO] Final Memory: 14M/310M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.
1:testCompile (default-testCompile) on project traprange: Compilation failure
[ERROR] /E:/Repo/traprange-master/src/test/java/com/giaybac/traprange/test/TESTP
DFBox.java:[57,15] 找不到符号
[ERROR] 符号:   方法 sort(<匿名java.util.Comparator<com.google.common.collect.Ra
nge>>)
[ERROR] 位置: 类型为java.util.List<com.google.common.collect.Range<java.lang.Int
eger>>的变量 ranges

I continue to move the project into Eclipse. After I fixed the libraries settings according to the pom.xml and import suggestions, the following code in PDFTableExtractor.java are suggested wrong.

this.textPositions.sort(new Comparator<TextPosition>() {

This appeared in many other java files, e.g. TrapRangeBuilder, and TESTPDFBox. And here is the Eclipse suggestion:

The method sort(new Comparator<Range>(){}) is undefined for the type List<Range<Integer>>

Could you make it clear how you implement the sort method?
Thank you very much.

Win 7 x64
Apache Maven 3.3.3
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

Error NotSuchMethod

While implementing the extractor on a main java method, I get the following error:

java.lang.NoSuchMethodError: com.google.common.collect.Range.closed(Ljava/lang/Comparable;Ljava/lang/Comparable;)Lcom/google/common/collect/Range;

The program is as follows:

PDFTableExtractor extractor = new PDFTableExtractor();
        extractor = extractor.setSource("C:/test.pdf");        
        extractor.addPage(0);
        extractor.exceptLine(0, new int[]{0, 1});  

        List<Table> tables = extractor.extract();    

         try (Writer writer = new OutputStreamWriter(new FileOutputStream("C:/Users/rys_s/Documents/MiArchivo.html"), "UTF-8")) {
                for (Table table : tables) {
                    writer.write("Page: " + (table.getPageIdx() + 1) + "\n");
                    writer.write(table.toHtml());
                }
        }

And the error comes apparently at the extractor.extract() line.

Adding an argument for specifying table position

Most tables do not occupy the whole pdf page. Can an argument be added so that users can specify the table position in the pdf? Only the region specified by the users is processed, instead of the whole pdf page.

error testpdfbox.java

The method sort(new Comparator(){}) is undefined for the type List<Range> TESTPDFBox.java line 54

running java version 8 update 91

availability in Maven repo

It would be really nice to have this library available in public Maven repo.

Thank you!

Need to get output in csv fromat without html tags embedded in it.

how can i get csv file as an output? just by giving output file as .csv file?
i tried it by giving like that and getting even html cells like tr,td in the csv file.
how can i avoid getting those in the csv file?

Table column not getting extracted.

I tried the code with my pdf's, result was amazing. I found that rows are getting extracted but the problem is that columns are not getting extracted.

Missing input file

though i have given the path of the file its giving error as Missing input file..
Can u please update with this?

Trying to understand the process of creating a cell

I am actually working on something to automatize the process of reading pdf tabs, the fact is that i actually get only on big collum with multiple row, while trying to dive into the process i actually saw that after extracting the TextPositions in the PDFTableExtractor.java class I get only one character by text position, where I suppose we are waiting to get a full cell, is this where my problem come from?

here is an example tab from my tests, there is also tons of empty columns and row that I delete myself in front.

java.lang.NoSuchMethodError: 'org.apache.pdfbox.pdmodel.PDDocument org.apache.pdfbox.pdmodel.PDDocument.load(java.io.InputStream)'

In Method PdfTableExtractor.extract() there is the following line
this.document = PDDocument.load(this.inputStream);
This no longer works. There is a new way to load the file
Loader.loadPDF(File file);

install on ubuntu 15.10 ?

is it possible to run this app on ubuntu 15.10 ? (from command line)

thanks

Add a JAR without dependencies to allow other logging implementations

The only JAR available is packaging log4J implementation and it is forcing to use log4j as active logger in the projects using this.

I've been using this JAR with Spring Boot v2.2.5.RELEASE successfuly. I wanted to use current version 3.0.6 which benefits from way more recent java version and API but... I keep having exception saying I have to remove log4j logger factories from classpath if I want to use Logback.
I finally rolled back to v2.2.5.RELEASE to keep working.

This would not be a problem if it were published on maven repository (#19 ) I guess.

Please would it be ok to put a JAR without depdendencies in your repo ?

Included guava version 18 conflicts with selenium-java 3.12.0

When including the traprange jar in a project with selenium-java 3.12.0 in maven, the Guava version 18 conflicts. Attempting to instantiate a ChromeDriver object results in the following error:

java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)V
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:124)
at org.openqa.selenium.chrome.ChromeDriverService.access$000(ChromeDriverService.java:33)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:139)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:335)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:89)
at org.openqa.selenium.chrome.ChromeDriver.(ChromeDriver.java:123)

Solution is to use a more recent version of Guava.

Error - Overlapping letters of words

In PDF attached first cell at header table is (Dell-EMC Part) and after using extraction it become (DPaerllt-EMC)

can you please check it

Dell_PartsList_ARNOLD INDUSTRIES INC_2016-12-23.pdf

Export to CSV

Hi. This is a really great package. Thanks so much.

I'm having one issue with exporting to CSV:

String csv = tables.get(0).toString();

I have a PDF table where the first column is sometimes empty. TrapRange discovers the column just fine, and export to HTML shows the column with the empty cell. But when I export to CSV, the empty cells are lost with no starting semicolon to denote them.

Do you experience the same?

Mike

Pages are numbered from 0 (in command-line options)

Hi, I noticed that page numbers are read incorrect from command line options like -p -ep and -el @.
I made sort-of-a-workaround in commit dc99b6c , but I actually think that it's better to fix either options parsing or page number checks in PDFTableExtractor#extract

Upgrade to pdfbox 2.0?

Hi,

Nice work! Are you planning to migrate to pdfbox 2.0??
The new API has vastly improved and it would be a big plus for the users wishing to migrate to pdfbox 2.0, since many functions have moved/been deleted along with some classes.

Would love to hear back!
Cheers