bobld / tabula-sharp Goto Github PK
View Code? Open in Web Editor NEWExtract tables from PDF files (port of tabula-java)
License: MIT License
Extract tables from PDF files (port of tabula-java)
License: MIT License
When I use the old tabula-java, it will split the cells out of the table but it is not working in tabula-sharp, I just get a whole row/line without individual data broken out. Maybe this is because the table is non-uniform? (different column counts on different rows)
Example table (cannot attach PDF as it has personal info)
I am using the latest version of PDFPig but that didn't seem to work. See example code below, maybe i'm doing something wrong with the syntax, just trying to iterate through the row
using (PdfDocument document = PdfDocument.Open(path, new ParsingOptions() { ClipPaths = false }))
{
ObjectExtractor oe = new ObjectExtractor(document);
PageArea page = oe.Extract(Page);
// detect canditate table zones
SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
var regions = detector.Detect(page);
IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
var table = tables[0];
var rows = table.Rows;
string result = "";
string test = rows[0][0].GetText(); // <---- testing first cell
Run.PrintLog("Test: " + test);
foreach (var r in rows)
{
foreach (RectangularTextContainer txt in r)
{
result += txt.GetText() + "|"; //<---- for each cell (?)
}
result += System.Environment.NewLine;
}
Run.PrintLog("Tab result: " + result);
}
Hi @BobLd ,
I see you have ported two libraries to C#... tabula and camelot. Which is better? Do you simply use both and pick the best results?
Hi @BobLd
First of all, thanks so much for sharing this useful library!!
I used the BasicExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
test3.pdf
The second-page output is just fine, as I expected:
but for the first page, because there are some text on the top of the table, the column output is wrong (see highlight in yellow).
What's your view to fix it please? Maybe there is a way (for the first page, ignore those non-table text??)
This code in Tabula.PageArea.GetArea
method adds to PageArea
instance horizontal ruling from right to left
tabula-sharp/Tabula/PageArea.cs
Lines 161 to 163 in fe6e6e5
Tabula.Ruling.SortObjectComparer
order objects in invalid order.Tabula.Ruling.FindIntersections
is invalid and result of Tabula.Extractors.SpreadsheetExtractionAlgorithm.FindCells
does not contains some cells that it should.
As a fix:
Fix Tabula.PageArea.GetArea
method
rv.AddRuling(new Ruling(
new PdfPoint(rv.Left, rv.Bottom),
new PdfPoint(rv.Right, rv.Bottom)));
Describe the bug
ObjectExtractor.Extract()
produces a PageArea
with top=0
and bottom=-612
. Only happens on certain PDFs, and I'm not sure what causes it.
To Reproduce
I can send the PDF privately. Don't want to expose any proprietary information online.
using (PdfDocument doc = PdfDocument.Open(filePath, new ParsingOptions() { ClipPaths = true }))
{
foreach (UglyToad.PdfPig.Content.Page page in doc.GetPages())
{
var extractor = new ObjectExtractor(doc);
var pageArea = extractor.Extract(page.Number);
}
}
Expected behavior
I would expect top
/bottom
to always be positive. left
and right
attributes don't ever seem to display this behavior.
Additional context
My workaround is to detect this particular scenario and change the way I'm creating the PdfRectangle
. I've also seen other scenarios where top=-612
and bottom=0
, and height=-612
. Probably a similar issue.
When Japanese-language forms are output, the characters may be garbled in Stream mode.
This may be caused by PdfPig.
When I extracted only text using PdfPig, the text was garbled as well.
Are you planning to use another library (such as DocNET) instead of PdfPig?
In the case of DocNET, the text was not garbled.
Describe the bug
While extracting a PDF I realized some tables were getting split because a row was not captured, as if it was considered a blank line. On other tables, the last cell in the row was skipped.
Screenshots
The screenshots are not from the original PDF, but it will hopefully illustrate the problem.
Given a table with a schema similar to the image below:
I expected to capture the entire table at once:
However, once the 5th was skipped, I ended up with two distinct capture groups.
The other problem is that sometimes some cell are skipped. One example is the 1st capture, that have missing data at the 3 ending rows:
Sometimes the data skipping happens also when the table is not split.
Hi @BobLd,
Many thanks also from me for the helpful library!
I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
The names from the table are intended to represent variable names for a computer program.
Unfortunately, the result is shown without the underscore:
I get "Content" instead of "C_ontent"
After c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox)));
was called in SpreadsheetExtractionAlgorithm.cs, this.textElements = textElements;
is executed in public void SetTextElements(List<T> textElements){}
in RectangularTextContainer.cs.
I found out that the order is exchanged when assigning values in this.textElements = textElements;
:
List of values before assignment:
List of values after assignment:
Later the back part will be cut off, I think.
Unfortunately I haven't found a solution. I would be very grateful for any help.
Describe the bug
When attempting to extract tables from this 250+ page PDF, I found that it hangs on a specific page (98), in the 'Detect' method.
To Reproduce
Using 40927R03.pdf
I've tried with 0.1.3 and 0.1.4-alpha001, and got hang in same spot.
Using .NET 6.0, C#.
using var pdoc = PdfDocument.Open(content.Stream, new ParsingOptions { SkipMissingFonts = true, UseLenientParsing = true });
var da = new Tabula.Detectors.SimpleNurminenDetectionAlgorithm();
var area = Tabula.ObjectExtractor.ExtractPage(pdoc, 98 /* hangs on this page */);
var regions = da.Detect(area); <-- this line hangs
Expected behavior
To properly parse all tables.
Hi @BobLd ,
Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.
Originally posted by @emrebiber in #13 (comment)
In the issue scenario, PDF first page has headers and on remaining pages it has continuous table data populated without the header.
When the extract is performed on remaining pages it is returning less number of columns if there is no data for those columns.
This makes the data invalid and in these instances not able to identify which column could be missing to fix the data issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.