Giter Site home page Giter Site logo

traprange's Introduction

TrapRange: a Method to Extract Table Content in PDF Files

Source: http://www.dzone.com/articles/traprange-method-extract-table

Update

Introduction

Table data structure is one of the most important data structure in document, especially when exporting data from enterprise systems, data is usually in table format.

There are several data file formats that are often used to store tabular content such as CSV, text, and pdf. For the first two formats, it is quite straight forward just by opening file, loop through lines, and split cells with proper separator. The libraries to do this are quite a lot.

With PDF file, the story is completely different because it doesn't have a dedicated data definition for tabular content, something like table, tr, td tag in HTML. PDF is a complicated format with text data, font, styling, and also image, audio, and video, they can be mixed all together. Below is my proposed solution to data in high-density tabular content.

How to detect a table

After some investigation, I realized that:

  • Column: text content in cells of the same column lies on a rectangular space that does not overlap with other rectangular spaces of another column. For example, in the following image, red rectangle and blue rectangle are separated spaces
  • Row: words in the same horizontal alignment are in the same row. But this is just sufficient condition because a cell in a row may be a multi-line cell. For example, the fourth cell in the yellow rectangle has two lines, phrases “FK to this customer’s record in” and "Ledgers table" are not in the same horizontal alignment but they are still considered in the same row. In my solution, I simply assume that content in a cell only is single-line content. Different lines in a cell are considered to belong to different rows. Therefore the content in the yellow rectangle contains two rows: 1. {"Ledger_ID", "|", "Sales Ledger Account", "FK to this customer's record to"} 2. {NULL, NULL, NULL, "Ledgers table"}

recognize a table

PDFBox API

My library behind traprange is PDFBox which is the best PDF lib I know so far. To extract text from a pdf file, PDFBox API provides 4 classes:

  • PDDocument: contains information of entire pdf file. In order to load a pdf file, we use method PDDocument.load(stream: InputStream)
  • PDPage: represents each page in pdf document. We possibly archive a specific page content by passing the index of the page with this method: document.getDocumentCatalog().getAllPages().get(pageIdx: int)
  • TextPosition: represents an individual word or character in the document. We can fetch all TextPosition objects of a PDPage by overriding method processTextPosition(text: TextPosition) in class PDTextStripper. A TextPosition object has methods getX(), getY(), getWidth(), getHeight() that returns its bound in page and method getCharacter() to get its content.

In my work, I process text chunks directly by using TextPosition objects. For each text chunk in PDF file, it returns a text element with the following attributes:

  • x: horizontal distance from the left of the page
  • y: vertical distance from the top border of the page
  • maxX: equals x + width of the text chunk
  • maxY: equals y+ height of the text chunk

text position rectangle

Trap ranges

The most important thing is identifying the bound of each row and column because if we know the bound of a row/column, we can retrieve all texts in that row/column from that we can easily extract all content inside the table and put it in a structured model. We name these bounds are trap-ranges. TrapRange has two attributes:

  • lowerBound: contains the lower endpoint of this range
  • upperBound: contains the upper endpoint of this range To calculate values of trap-ranges, we loop through all texts of the page and project range of each text onto the horizontal and vertical axis, get the result and join them together. After looping through all texts of the page, we will calculate trap-ranges and use them to identify cell data of the table.

join sample

Algorithm 1: calculating trap-ranges for each pdf page:

columnTrapRanges <-- []
rowTrapRanges <-- []
for each text in page
begin
     columnTrapRanges <-- join(columnTrapRanges, {text.x, text.x + text.width} )
     rowTrapRanges <-- join(rowTrapRanges, {text.y, text.y + text.height} )
end

After calculating trap-ranges for the table, we loop through all texts again and classify them into correct cells of the table.

Algorithm 2: classifying text chunks into correct cells:

table <-- new Table()
for each text in page
begin
     rowIdx <-- in rowTrapRanges, get index of the range that containts this text
     columnIdx <-- in columnTrapRanges, get index of the range that contains this text
     table.addText(text, rowIdx, columnIdx)
end

Design and implement

traprange class diagram

The above is class diagram of main classes:

  • TrapRangeBuilder: build() to calculate and return ranges
  • Table, TableRow and TableCell: for table data struture
  • PDFTableExtractor is the most important class. It contains methods to initialize and extract table data from PDF file. Builder pattern was used here. Following is some highlighted methods in this class:
  • setSource: set source of the pdf file. There're 3 overloads setSource(InputStream), setSource(File) and setSource(String)
  • addPage: to determine which pages will be processed. Default is all pages
  • exceptPage: to skip a page
  • exceptLine: to skip noisy data. All texts in these lines will be avoided.
  • extract: process and return result

Example

PDFTableExtractor extractor = new PDFTableExtractor();
List<Table> tables = extractor.setSource(“table.pdf”)
	.addPage(0)
	.addPage(1)
	.exceptLine(0) //the first line in each page
	.exceptLine(1) //the second line in each page
	.exceptLine(-1)//the last line in each page
	.extract();
String html = tables.get(0).toHtml();//table in html format
String csv = tables.get(0).toString();//table in csv format using semicolon as a delimiter 

Following are some sample results (check out and run the test file TestExtractor.java):

Evaluation

In experimentation, I used pdf files having high density of table content. The results show that my implementation detects tabular content better than other open-sources: pdftotext, pdftohtml, pdf2table. With documents having multi tables or too much noisy data, my method does not work well. If row has cells overlapped, columns of these cells will be merged.

Conclusion

TrapRange method works the best with PDF files having high density of table data. With documents have multi-table or too much noisy data, TrapRange is not a good choice. My method also can be implemented in other programming languages by replacing PDFBox by a corresponding pdf library or using command-line tool pdftohtml to extract text chunks and using these data as input data for algorithm 1, 2.

System requirements

  1. Java 8+
  2. Maven 3+

References

  1. http://en.wikipedia.org/wiki/Portable_Document_Format
  2. http://pdfbox.apache.org
  3. http://ieg.ifs.tuwien.ac.at/pub/yildiz_iicai_2005.pdf
  4. http://www.foolabs.com/xpdf/
  5. http://ieg.ifs.tuwien.ac.at/projects/pdf2table/

License

The MIT License (MIT)

traprange's People

Contributors

hitenpratap avatar tho-resdiary avatar thoqbk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

traprange's Issues

Can we get all PDF data into the String variable, instead of getting data page by page?

Hi a. Tho,

Currently, I'm using "get" method to get PDF data from specific page. I wonder that can we get all PDF data at once instead of getting data page by page like that?
My code:

public static int rowNumberOfPDFFile(String pdfLink, int pagePDFNumber) throws IOException {
PDFTableExtractor extractor = new PDFTableExtractor();
List

tables = extractor.setSource(pdfLink).extract();
// get date from page 1 to String html. Page number starts from 0
String html = tables.get(pagePDFNumber).toHtml();

    html = html.substring(html.indexOf("border='1'>") + 11);
    int rowNumber = org.apache.commons.lang3.StringUtils.countMatches(html, "/tr");
    return rowNumber;
}

I would like to get all PDF data into "html" field. Could you please help?

Thanks,
Phan Nguyen

The method sort underfined for the type List<Range<Integer>>

Hi, thoqbk! I have downloaded the repository and I find there existing some error when use maven to packaging the jar. The error message are as follows:

[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /E:/Repo/traprange-master/src/test/java/com/giaybac/traprange/test/TESTP
DFBox.java:[57,15] 找不到符号
  符号:   方法 sort(<匿名java.util.Comparator<com.google.common.collect.Range>>)

  位置: 类型为java.util.List<com.google.common.collect.Range<java.lang.Integer>>
的变量 ranges
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.129 s
[INFO] Finished at: 2015-11-18T12:03:09+08:00
[INFO] Final Memory: 14M/310M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.
1:testCompile (default-testCompile) on project traprange: Compilation failure
[ERROR] /E:/Repo/traprange-master/src/test/java/com/giaybac/traprange/test/TESTP
DFBox.java:[57,15] 找不到符号
[ERROR] 符号:   方法 sort(<匿名java.util.Comparator<com.google.common.collect.Ra
nge>>)
[ERROR] 位置: 类型为java.util.List<com.google.common.collect.Range<java.lang.Int
eger>>的变量 ranges

I continue to move the project into Eclipse. After I fixed the libraries settings according to the pom.xml and import suggestions, the following code in PDFTableExtractor.java are suggested wrong.

this.textPositions.sort(new Comparator<TextPosition>() {

This appeared in many other java files, e.g. TrapRangeBuilder, and TESTPDFBox. And here is the Eclipse suggestion:

The method sort(new Comparator<Range>(){}) is undefined for the type List<Range<Integer>>

Could you make it clear how you implement the sort method?
Thank you very much.

Win 7 x64
Apache Maven 3.3.3
java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)

Error NotSuchMethod

While implementing the extractor on a main java method, I get the following error:

java.lang.NoSuchMethodError: com.google.common.collect.Range.closed(Ljava/lang/Comparable;Ljava/lang/Comparable;)Lcom/google/common/collect/Range;

The program is as follows:

PDFTableExtractor extractor = new PDFTableExtractor();
        extractor = extractor.setSource("C:/test.pdf");        
        extractor.addPage(0);
        extractor.exceptLine(0, new int[]{0, 1});  

        List<Table> tables = extractor.extract();    

         try (Writer writer = new OutputStreamWriter(new FileOutputStream("C:/Users/rys_s/Documents/MiArchivo.html"), "UTF-8")) {
                for (Table table : tables) {
                    writer.write("Page: " + (table.getPageIdx() + 1) + "\n");
                    writer.write(table.toHtml());
                }
        }

And the error comes apparently at the extractor.extract() line.

Adding an argument for specifying table position

Most tables do not occupy the whole pdf page. Can an argument be added so that users can specify the table position in the pdf? Only the region specified by the users is processed, instead of the whole pdf page.

error testpdfbox.java

The method sort(new Comparator(){}) is undefined for the type List<Range> TESTPDFBox.java line 54

running java version 8 update 91

Table column not getting extracted.

I tried the code with my pdf's, result was amazing. I found that rows are getting extracted but the problem is that columns are not getting extracted.

Missing input file

though i have given the path of the file its giving error as Missing input file..
Can u please update with this?

Trying to understand the process of creating a cell

I am actually working on something to automatize the process of reading pdf tabs, the fact is that i actually get only on big collum with multiple row, while trying to dive into the process i actually saw that after extracting the TextPositions in the PDFTableExtractor.java class I get only one character by text position, where I suppose we are waiting to get a full cell, is this where my problem come from?
exempleTab
here is an example tab from my tests, there is also tons of empty columns and row that I delete myself in front.

Add a JAR without dependencies to allow other logging implementations

The only JAR available is packaging log4J implementation and it is forcing to use log4j as active logger in the projects using this.

I've been using this JAR with Spring Boot v2.2.5.RELEASE successfuly. I wanted to use current version 3.0.6 which benefits from way more recent java version and API but... I keep having exception saying I have to remove log4j logger factories from classpath if I want to use Logback.
I finally rolled back to v2.2.5.RELEASE to keep working.

This would not be a problem if it were published on maven repository (#19 ) I guess.

Please would it be ok to put a JAR without depdendencies in your repo ?

Included guava version 18 conflicts with selenium-java 3.12.0

When including the traprange jar in a project with selenium-java 3.12.0 in maven, the Guava version 18 conflicts. Attempting to instantiate a ChromeDriver object results in the following error:

java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkState(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)V
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:124)
at org.openqa.selenium.chrome.ChromeDriverService.access$000(ChromeDriverService.java:33)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:139)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:335)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:89)
at org.openqa.selenium.chrome.ChromeDriver.(ChromeDriver.java:123)

Solution is to use a more recent version of Guava.

Export to CSV

Hi. This is a really great package. Thanks so much.

I'm having one issue with exporting to CSV:

String csv = tables.get(0).toString();

I have a PDF table where the first column is sometimes empty. TrapRange discovers the column just fine, and export to HTML shows the column with the empty cell. But when I export to CSV, the empty cells are lost with no starting semicolon to denote them.

Do you experience the same?

Mike

Pages are numbered from 0 (in command-line options)

Hi, I noticed that page numbers are read incorrect from command line options like -p -ep and -el @.
I made sort-of-a-workaround in commit dc99b6c , but I actually think that it's better to fix either options parsing or page number checks in PDFTableExtractor#extract

Upgrade to pdfbox 2.0?

Hi,

Nice work! Are you planning to migrate to pdfbox 2.0??
The new API has vastly improved and it would be a big plus for the users wishing to migrate to pdfbox 2.0, since many functions have moved/been deleted along with some classes.

Would love to hear back!
Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.