Giter Site home page Giter Site logo

bobld / tabula-sharp Goto Github PK

View Code? Open in Web Editor NEW
139.0 8.0 22.0 9.5 MB

Extract tables from PDF files (port of tabula-java)

License: MIT License

C# 100.00%
extracting-tables pdfs extraction-engine tabula pdfpig pdfparser csharp netstandard table dotnet

tabula-sharp's Introduction

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java

Windows Linux Mac OS

  • Supports .NET 6, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.52, 4.6, 4.61, 4.62, 4.7
  • No java bindings

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

  • Uses PdfPig, and not PdfBox.
  • Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
  • The NurminenDetectionAlgorithm is replaced by SimpleNurminenDetectionAlgorithm, because it requieres an image management library.
  • Table results might be different because of the way PdfPig builds Letters bounding box.

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);
	
	// detect canditate table zones
	SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
	var regions = detector.Detect(page);
	
	IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
	List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
	var table = tables[0];
	var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);

	IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
	List<Table> tables = ea.Extract(page);
	var table = tables[0];
	var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

example

Lattice mode - SpreadsheetExtractionAlgorithm

example

tabula-sharp's People

Contributors

bobld avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tabula-sharp's Issues

Extract table issue - don't get an underscore

Hi @BobLd,

Many thanks also from me for the helpful library!

I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.

The names from the table are intended to represent variable names for a computer program.

Unfortunately, the result is shown without the underscore:

I get "Content" instead of "C_ontent"

Example

After c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox))); was called in SpreadsheetExtractionAlgorithm.cs, this.textElements = textElements; is executed in public void SetTextElements(List<T> textElements){} in RectangularTextContainer.cs.

I found out that the order is exchanged when assigning values ​​in this.textElements = textElements;:

List of values ​​before assignment:

assigning_1

List of values ​​after assignment:

assigning_2

Later the back part will be cut off, I think.

Unfortunately I haven't found a solution. I would be very grateful for any help.

Example PDF.pdf

Continuous table columns ignored if the page has no data for the column

In the issue scenario, PDF first page has headers and on remaining pages it has continuous table data populated without the header.

When the extract is performed on remaining pages it is returning less number of columns if there is no data for those columns.

This makes the data invalid and in these instances not able to identify which column could be missing to fix the data issue.

image

[BUG] - {Lattice} SpreadsheetExtractionAlgorithm failing in capturing rows and cells

Describe the bug
While extracting a PDF I realized some tables were getting split because a row was not captured, as if it was considered a blank line. On other tables, the last cell in the row was skipped.

Screenshots
The screenshots are not from the original PDF, but it will hopefully illustrate the problem.

Given a table with a schema similar to the image below:
1-schema

I expected to capture the entire table at once:
2-expected

However, once the 5th was skipped, I ended up with two distinct capture groups.
3-extracted

The other problem is that sometimes some cell are skipped. One example is the 1st capture, that have missing data at the 3 ending rows:
4-skipped-data

Sometimes the data skipping happens also when the table is not split.

Rows from table in PDF return as one unbroken string (no indivdual cells)

When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)

Example table (cannot attach PDF as it has personal info)
TableExample

I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row

 using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
        {
            ObjectExtractor oe = new ObjectExtractor(document);
            PageArea page = oe.Extract(Page);

            // detect canditate table zones
            SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
            var regions = detector.Detect(page);

            IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
            List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
            var table = tables[0];
            var rows = table.Rows;

            string result = "";
            string test = rows[0][0].GetText(); // <---- testing first cell
            Run.PrintLog("Test: " + test);

            foreach (var r in rows)
            {
                foreach (RectangularTextContainer txt in r)
                {
                    result += txt.GetText() + "|";   //<---- for each cell (?)
                }
                result += System.Environment.NewLine;
            }
            Run.PrintLog("Tab result: " + result);
        }

[BUG] - {Stream} PageArea has top=0 and bottom=-612

Describe the bug
ObjectExtractor.Extract() produces a PageArea with top=0 and bottom=-612. Only happens on certain PDFs, and I'm not sure what causes it.

To Reproduce
I can send the PDF privately. Don't want to expose any proprietary information online.

using (PdfDocument doc = PdfDocument.Open(filePath, new ParsingOptions() { ClipPaths = true }))
{
    foreach (UglyToad.PdfPig.Content.Page page in doc.GetPages())
    {
        var extractor = new ObjectExtractor(doc);
        var pageArea = extractor.Extract(page.Number);
    }
}

Expected behavior
I would expect top/bottom to always be positive. left and right attributes don't ever seem to display this behavior.

Screenshots
image

Additional context
My workaround is to detect this particular scenario and change the way I'm creating the PdfRectangle. I've also seen other scenarios where top=-612 and bottom=0, and height=-612. Probably a similar issue.

Merged columns when extracting tables

Hi @BobLd ,

Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.

pdf
dotnet

Originally posted by @emrebiber in #13 (comment)

garbled text

When Japanese-language forms are output, the characters may be garbled in Stream mode.

This may be caused by PdfPig.
When I extracted only text using PdfPig, the text was garbled as well.

Are you planning to use another library (such as DocNET) instead of PdfPig?
In the case of DocNET, the text was not garbled.

Sometimes SpreadsheetExtractionAlgorithm ignores last row.

This code in Tabula.PageArea.GetArea method adds to PageArea instance horizontal ruling from right to left

rv.AddRuling(new Ruling(
new PdfPoint(rv.Right, rv.Bottom),
new PdfPoint(rv.Left, rv.Bottom)));

.
It leads to situation when Tabula.Ruling.SortObjectComparer order objects in invalid order.
As a result, the list of intersection is returned by Tabula.Ruling.FindIntersections is invalid and result of Tabula.Extractors.SpreadsheetExtractionAlgorithm.FindCells does not contains some cells that it should.

As a fix:
Fix Tabula.PageArea.GetArea method

rv.AddRuling(new Ruling(
    new PdfPoint(rv.Left, rv.Bottom),
    new PdfPoint(rv.Right, rv.Bottom)));

[BUG] - Stream: Area detection hangs on PDF page

Describe the bug
When attempting to extract tables from this 250+ page PDF, I found that it hangs on a specific page (98), in the 'Detect' method.

To Reproduce
Using 40927R03.pdf

I've tried with 0.1.3 and 0.1.4-alpha001, and got hang in same spot.

Using .NET 6.0, C#.

using var pdoc = PdfDocument.Open(content.Stream, new ParsingOptions { SkipMissingFonts = true, UseLenientParsing = true });
var da = new Tabula.Detectors.SimpleNurminenDetectionAlgorithm();

var area = Tabula.ObjectExtractor.ExtractPage(pdoc, 98 /* hangs on this page */);
var regions = da.Detect(area); <-- this line hangs

Expected behavior
To properly parse all tables.

Extract table issue - stream algo, no guessed area

Hi @BobLd

First of all, thanks so much for sharing this useful library!!

I used the BasicExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
test3.pdf

The second-page output is just fine, as I expected:
image

but for the first page, because there are some text on the top of the table, the column output is wrong (see highlight in yellow).
image

What's your view to fix it please? Maybe there is a way (for the first page, ignore those non-table text??)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.