bobld / tabula-sharp Goto Github PK

View Code? Open in Web Editor NEW

139.0 8.0 22.0 9.5 MB

Extract tables from PDF files (port of tabula-java)

License: MIT License

C# 100.00%

extracting-tables pdfs extraction-engine tabula pdfpig pdfparser csharp netstandard table dotnet

tabula-sharp's Introduction

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java

Supports .NET 6, .NET Core 3.1, .NET Standard 2.0, .NET Framework 4.52, 4.6, 4.61, 4.62, 4.7
No java bindings

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

Uses PdfPig, and not PdfBox.
Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
The NurminenDetectionAlgorithm is replaced by SimpleNurminenDetectionAlgorithm, because it requieres an image management library.
Table results might be different because of the way PdfPig builds Letters bounding box.

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);
	
	// detect canditate table zones
	SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
	var regions = detector.Detect(page);
	
	IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
	List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
	var table = tables[0];
	var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
	ObjectExtractor oe = new ObjectExtractor(document);
	PageArea page = oe.Extract(1);

	IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
	List<Table> tables = ea.Extract(page);
	var table = tables[0];
	var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

Lattice mode - SpreadsheetExtractionAlgorithm

tabula-sharp's People

Contributors

Stargazers

Watchers

tabula-sharp's Issues

Extract table issue - don't get an underscore

Hi @BobLd,

Many thanks also from me for the helpful library!

I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.

The names from the table are intended to represent variable names for a computer program.

Unfortunately, the result is shown without the underscore:

I get "Content" instead of "C_ontent"

After c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox))); was called in SpreadsheetExtractionAlgorithm.cs, this.textElements = textElements; is executed in public void SetTextElements(List<T> textElements){} in RectangularTextContainer.cs.

I found out that the order is exchanged when assigning values in this.textElements = textElements;:

List of values before assignment:

List of values after assignment:

Later the back part will be cut off, I think.

Unfortunately I haven't found a solution. I would be very grateful for any help.

Example PDF.pdf

Continuous table columns ignored if the page has no data for the column

In the issue scenario, PDF first page has headers and on remaining pages it has continuous table data populated without the header.

When the extract is performed on remaining pages it is returning less number of columns if there is no data for those columns.

This makes the data invalid and in these instances not able to identify which column could be missing to fix the data issue.

[BUG] - {Lattice} SpreadsheetExtractionAlgorithm failing in capturing rows and cells

Describe the bug
While extracting a PDF I realized some tables were getting split because a row was not captured, as if it was considered a blank line. On other tables, the last cell in the row was skipped.

Screenshots
The screenshots are not from the original PDF, but it will hopefully illustrate the problem.

Given a table with a schema similar to the image below:

I expected to capture the entire table at once:

However, once the 5th was skipped, I ended up with two distinct capture groups.

The other problem is that sometimes some cell are skipped. One example is the 1st capture, that have missing data at the 3 ending rows:

Sometimes the data skipping happens also when the table is not split.

Rows from table in PDF return as one unbroken string (no indivdual cells)

When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)

Example table (cannot attach PDF as it has personal info)

I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row

 using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
        {
            ObjectExtractor oe = new ObjectExtractor(document);
            PageArea page = oe.Extract(Page);

            // detect canditate table zones
            SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
            var regions = detector.Detect(page);

            IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
            List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
            var table = tables[0];
            var rows = table.Rows;

            string result = "";
            string test = rows[0][0].GetText(); // <---- testing first cell
            Run.PrintLog("Test: " + test);

            foreach (var r in rows)
            {
                foreach (RectangularTextContainer txt in r)
                {
                    result += txt.GetText() + "|";   //<---- for each cell (?)
                }
                result += System.Environment.NewLine;
            }
            Run.PrintLog("Tab result: " + result);
        }

[BUG] - {Stream} PageArea has top=0 and bottom=-612

Describe the bug
ObjectExtractor.Extract() produces a PageArea with top=0 and bottom=-612. Only happens on certain PDFs, and I'm not sure what causes it.

To Reproduce
I can send the PDF privately. Don't want to expose any proprietary information online.

using (PdfDocument doc = PdfDocument.Open(filePath, new ParsingOptions() { ClipPaths = true }))
{
    foreach (UglyToad.PdfPig.Content.Page page in doc.GetPages())
    {
        var extractor = new ObjectExtractor(doc);
        var pageArea = extractor.Extract(page.Number);
    }
}

Expected behavior
I would expect top/bottom to always be positive. left and right attributes don't ever seem to display this behavior.

Screenshots

Additional context
My workaround is to detect this particular scenario and change the way I'm creating the PdfRectangle. I've also seen other scenarios where top=-612 and bottom=0, and height=-612. Probably a similar issue.

The money text is split

Which is better...tabulasharp or camelotsharp?

Hi @BobLd ,

I see you have ported two libraries to C#... tabula and camelot. Which is better? Do you simply use both and pick the best results?

Merged columns when extracting tables

Hi @BobLd ,

Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.

Originally posted by @emrebiber in #13 (comment)

garbled text

When Japanese-language forms are output, the characters may be garbled in Stream mode.

This may be caused by PdfPig.
When I extracted only text using PdfPig, the text was garbled as well.

Are you planning to use another library (such as DocNET) instead of PdfPig?
In the case of DocNET, the text was not garbled.

Sometimes SpreadsheetExtractionAlgorithm ignores last row.

This code in Tabula.PageArea.GetArea method adds to PageArea instance horizontal ruling from right to left

tabula-sharp/Tabula/PageArea.cs

Lines 161 to 163 in fe6e6e5

    
           rv.AddRuling(new Ruling( 
        
               new PdfPoint(rv.Right, rv.Bottom), 
        
               new PdfPoint(rv.Left, rv.Bottom)));

.
It leads to situation when Tabula.Ruling.SortObjectComparer order objects in invalid order.
As a result, the list of intersection is returned by Tabula.Ruling.FindIntersections is invalid and result of Tabula.Extractors.SpreadsheetExtractionAlgorithm.FindCells does not contains some cells that it should.

As a fix:
Fix Tabula.PageArea.GetArea method

rv.AddRuling(new Ruling(
    new PdfPoint(rv.Left, rv.Bottom),
    new PdfPoint(rv.Right, rv.Bottom)));

[BUG] - Stream: Area detection hangs on PDF page

Describe the bug
When attempting to extract tables from this 250+ page PDF, I found that it hangs on a specific page (98), in the 'Detect' method.

To Reproduce
Using 40927R03.pdf

I've tried with 0.1.3 and 0.1.4-alpha001, and got hang in same spot.

Using .NET 6.0, C#.

using var pdoc = PdfDocument.Open(content.Stream, new ParsingOptions { SkipMissingFonts = true, UseLenientParsing = true });
var da = new Tabula.Detectors.SimpleNurminenDetectionAlgorithm();

var area = Tabula.ObjectExtractor.ExtractPage(pdoc, 98 /* hangs on this page */);
var regions = da.Detect(area); <-- this line hangs

Expected behavior
To properly parse all tables.

Extract table issue - stream algo, no guessed area

Hi @BobLd

First of all, thanks so much for sharing this useful library!!

I used the BasicExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
test3.pdf

The second-page output is just fine, as I expected:

but for the first page, because there are some text on the top of the table, the column output is wrong (see highlight in yellow).

What's your view to fix it please? Maybe there is a way (for the first page, ignore those non-table text??)

	rv.AddRuling(new Ruling(
	new PdfPoint(rv.Right, rv.Bottom),
	new PdfPoint(rv.Left, rv.Bottom)));

bobld / tabula-sharp Goto Github PK

tabula-sharp's Introduction

tabula-sharp

Differences with tabula-java

Usage

Stream mode - BasicExtractionAlgorithm

Lattice mode - SpreadsheetExtractionAlgorithm

Results

Stream mode - BasicExtractionAlgorithm

Lattice mode - SpreadsheetExtractionAlgorithm

tabula-sharp's People

Contributors

Stargazers

Watchers

Forkers

tabula-sharp's Issues

Recommend Projects

Recommend Topics

Recommend Org