bobld / tabula-sharp Goto Github PK

Extract tables from PDF files (port of tabula-java)

License: MIT License

C# 100.00%

extracting-tables pdfs extraction-engine tabula pdfpig pdfparser csharp netstandard table dotnet tabula-sharp tabula-java extract-table extraction extract table-extraction pdf-table-extract pdf-table-extraction

tabula-sharp's Issues

Rows from table in PDF return as one unbroken string (no indivdual cells)

When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)

Example table (cannot attach PDF as it has personal info)

I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row

 using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
        {
            ObjectExtractor oe = new ObjectExtractor(document);
            PageArea page = oe.Extract(Page);

            // detect canditate table zones
            SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
            var regions = detector.Detect(page);

            IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
            List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
            var table = tables[0];
            var rows = table.Rows;

            string result = "";
            string test = rows[0][0].GetText(); // <---- testing first cell
            Run.PrintLog("Test: " + test);

            foreach (var r in rows)
            {
                foreach (RectangularTextContainer txt in r)
                {
                    result += txt.GetText() + "|";   //<---- for each cell (?)
                }
                result += System.Environment.NewLine;
            }
            Run.PrintLog("Tab result: " + result);
        }

Which is better...tabulasharp or camelotsharp?

Hi @BobLd ,

I see you have ported two libraries to C#... tabula and camelot. Which is better? Do you simply use both and pick the best results?

Extract table issue - stream algo, no guessed area

Hi @BobLd

First of all, thanks so much for sharing this useful library!!

I used the BasicExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
test3.pdf

The second-page output is just fine, as I expected:

but for the first page, because there are some text on the top of the table, the column output is wrong (see highlight in yellow).

What's your view to fix it please? Maybe there is a way (for the first page, ignore those non-table text??)

Sometimes SpreadsheetExtractionAlgorithm ignores last row.

This code in Tabula.PageArea.GetArea method adds to PageArea instance horizontal ruling from right to left

tabula-sharp/Tabula/PageArea.cs

Lines 161 to 163 in fe6e6e5

    
           rv.AddRuling(new Ruling( 
        
               new PdfPoint(rv.Right, rv.Bottom), 
        
               new PdfPoint(rv.Left, rv.Bottom)));

.
It leads to situation when Tabula.Ruling.SortObjectComparer order objects in invalid order.
As a result, the list of intersection is returned by Tabula.Ruling.FindIntersections is invalid and result of Tabula.Extractors.SpreadsheetExtractionAlgorithm.FindCells does not contains some cells that it should.

As a fix:
Fix Tabula.PageArea.GetArea method

rv.AddRuling(new Ruling(
    new PdfPoint(rv.Left, rv.Bottom),
    new PdfPoint(rv.Right, rv.Bottom)));

[BUG] - {Stream} PageArea has top=0 and bottom=-612

Describe the bug
ObjectExtractor.Extract() produces a PageArea with top=0 and bottom=-612. Only happens on certain PDFs, and I'm not sure what causes it.

To Reproduce
I can send the PDF privately. Don't want to expose any proprietary information online.

using (PdfDocument doc = PdfDocument.Open(filePath, new ParsingOptions() { ClipPaths = true }))
{
    foreach (UglyToad.PdfPig.Content.Page page in doc.GetPages())
    {
        var extractor = new ObjectExtractor(doc);
        var pageArea = extractor.Extract(page.Number);
    }
}

Expected behavior
I would expect top/bottom to always be positive. left and right attributes don't ever seem to display this behavior.

Screenshots

Additional context
My workaround is to detect this particular scenario and change the way I'm creating the PdfRectangle. I've also seen other scenarios where top=-612 and bottom=0, and height=-612. Probably a similar issue.

garbled text

When Japanese-language forms are output, the characters may be garbled in Stream mode.

This may be caused by PdfPig.
When I extracted only text using PdfPig, the text was garbled as well.

Are you planning to use another library (such as DocNET) instead of PdfPig?
In the case of DocNET, the text was not garbled.

[BUG] - {Lattice} SpreadsheetExtractionAlgorithm failing in capturing rows and cells

Describe the bug
While extracting a PDF I realized some tables were getting split because a row was not captured, as if it was considered a blank line. On other tables, the last cell in the row was skipped.

Screenshots
The screenshots are not from the original PDF, but it will hopefully illustrate the problem.

Given a table with a schema similar to the image below:

I expected to capture the entire table at once:

However, once the 5th was skipped, I ended up with two distinct capture groups.

The other problem is that sometimes some cell are skipped. One example is the 1st capture, that have missing data at the 3 ending rows:

Sometimes the data skipping happens also when the table is not split.

Extract table issue - don't get an underscore

Hi @BobLd,

Many thanks also from me for the helpful library!

I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.

The names from the table are intended to represent variable names for a computer program.

Unfortunately, the result is shown without the underscore:

I get "Content" instead of "C_ontent"

After c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox))); was called in SpreadsheetExtractionAlgorithm.cs, this.textElements = textElements; is executed in public void SetTextElements(List<T> textElements){} in RectangularTextContainer.cs.

I found out that the order is exchanged when assigning values in this.textElements = textElements;:

List of values before assignment:

List of values after assignment:

Later the back part will be cut off, I think.

Unfortunately I haven't found a solution. I would be very grateful for any help.

Example PDF.pdf

[BUG] - Stream: Area detection hangs on PDF page

Describe the bug
When attempting to extract tables from this 250+ page PDF, I found that it hangs on a specific page (98), in the 'Detect' method.

To Reproduce
Using 40927R03.pdf

I've tried with 0.1.3 and 0.1.4-alpha001, and got hang in same spot.

Using .NET 6.0, C#.

using var pdoc = PdfDocument.Open(content.Stream, new ParsingOptions { SkipMissingFonts = true, UseLenientParsing = true });
var da = new Tabula.Detectors.SimpleNurminenDetectionAlgorithm();

var area = Tabula.ObjectExtractor.ExtractPage(pdoc, 98 /* hangs on this page */);
var regions = da.Detect(area); <-- this line hangs

Expected behavior
To properly parse all tables.

Merged columns when extracting tables

Hi @BobLd ,

Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.

Originally posted by @emrebiber in #13 (comment)

Continuous table columns ignored if the page has no data for the column

In the issue scenario, PDF first page has headers and on remaining pages it has continuous table data populated without the header.

When the extract is performed on remaining pages it is returning less number of columns if there is no data for those columns.

This makes the data invalid and in these instances not able to identify which column could be missing to fix the data issue.

bobld / tabula-sharp Goto Github PK

tabula-sharp's Issues

Rows from table in PDF return as one unbroken string (no indivdual cells)

Which is better...tabulasharp or camelotsharp?

Extract table issue - stream algo, no guessed area

Sometimes SpreadsheetExtractionAlgorithm ignores last row.

[BUG] - {Stream} PageArea has top=0 and bottom=-612

garbled text

[BUG] - {Lattice} SpreadsheetExtractionAlgorithm failing in capturing rows and cells

Extract table issue - don't get an underscore

[BUG] - Stream: Area detection hangs on PDF page

Merged columns when extracting tables

Continuous table columns ignored if the page has no data for the column

The money text is split

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	rv.AddRuling(new Ruling(
	new PdfPoint(rv.Right, rv.Bottom),
	new PdfPoint(rv.Left, rv.Bottom)));