Comments (3)
Not sure.
If it's for tables: Tables are not directly supported but you can use Tabula Sharp or Camelot Sharp. As of 2023 Tabula-sharp is the most complete port source
It's worth checking out tabula sharp as they try to work out lines (For tables) and use pdf pig under the covers. Might give you some inspiration
Source: https://github.com/UglyToad/PdfPig/wiki
from pdfpig.
Thank you for your reply.
I just noticed that in the latest version (0.1.9-alpha-20240612-d2cae), the Line class appeared in PdfSubpath.
As I understand it, this is still a test option, but it seems to me that on this basis it will be possible to create a method for extracting tables. This is important to me, since neither Camelot nor Tabula (in the C# version and in the Python version) suited me.
I have a set of rendered tables (usually 6 columns and many rows), but both of these tools in C# produce incorrect tables with extra columns that are not visible on the pdf.
Therefore, I want to develop my own solution to this problem
from pdfpig.
I think I had a similar problem. Below was my solution using the Cells bounding box to recreate the table ignoring blank cells. This solution doesn't use the lines.
Please let me know if you what you end up doing. I would be curious :)
And maybe worth raising a PR to Tabula sharp/camelot with your solution
// using Tabula.Extractors;
// using Tabula;
// using UglyToad.PdfPig.Core;
// using UglyToad.PdfPig;
// using UglyToad.PdfPig.Geometry;
public class TableExtractor
{
private ObjectExtractor _objectExtractor;
private IExtractionAlgorithm _tableExtractionAlgo;
public TableExtractor(PdfDocument document)
{
_objectExtractor = new ObjectExtractor(document);
_tableExtractionAlgo = new SpreadsheetExtractionAlgorithm();
}
public void ExtractTables(int pageNo)
{
var pageArea = _objectExtractor.Extract(pageNo);
var tables = _tableExtractionAlgo.Extract(pageArea);
foreach (var table in tables)
{
var cells = table.Cells.Select(x => new CellBlockWrap(x));
var tableData = ExtractTable(cells);
}
}
public static List<List<string>> ExtractTable(IEnumerable<CellBlockWrap> orderedCells)
{
var cellsWithText = orderedCells.Where(x => !string.IsNullOrEmpty(x.Text)).ToList();
var cellsNoDuplicates = cellsWithText.Distinct(new CellBlockWrapComparer()).ToList();
if (cellsNoDuplicates.Count <= 1)
{
return new List<List<string>>();
}
var rows = ConstructRows(cellsNoDuplicates);
var result = SortIntoColumns(rows);
return result;
}
// Cells are given in an ordered manner. We will recreate the rows by processing in order creating rows
private static List<List<CellBlockWrap>> ConstructRows(List<CellBlockWrap> cellsNoDuplicates)
{
var lastRow = new List<CellBlockWrap>();
var rows = new List<List<CellBlockWrap>> { lastRow };
foreach (var cell in cellsNoDuplicates)
{
var lastCellInRow = lastRow.LastOrDefault();
// Base Case
if (lastCellInRow == null)
{
lastRow.Add(cell);
continue;
}
if (IsOnSameLine(lastCellInRow.BoundingBox, cell.BoundingBox))
{
lastRow.Add(cell);
}
else
{
lastRow = new List<CellBlockWrap>() { cell };
rows.Add(lastRow);
}
}
return rows;
}
// Sort out columns
// We go through each column and make sure they take up a similar area as the left most cell we found
// If not we return to the pool
private static List<List<string>> SortIntoColumns(List<List<CellBlockWrap>> rows)
{
var result = CreateEmptyWithRows(rows.Count);
while (rows.Any(x => x.Count > 0))
{
var firstColumnCells = rows.Select(x => x.FirstOrDefault()).ToList();
var colGuide = firstColumnCells.LeftMost();
for (int rowIdx = 0; rowIdx < rows.Count; rowIdx++)
{
var candidateForRow = firstColumnCells[rowIdx];
if (candidateForRow != null && IsOnSameColumnAs(candidateForRow.BoundingBox, colGuide.BoundingBox))
{
result[rowIdx].Add(candidateForRow.Text);
rows[rowIdx].RemoveAt(0);
}
else
{
// We do not remove the candidate for this row. It'll try in the next round
result[rowIdx].Add("");
}
}
}
return result;
}
private static List<List<string>> CreateEmptyWithRows(int rowCount)
{
var result = new List<List<string>>();
for (int i = 0; i < rowCount; i++)
{
result.Add(new List<string>());
}
return result;
}
public static bool IsOnSameLine(this PdfRectangle first, PdfRectangle second)
{
if (first.Rotation != 0d || second.Rotation != 0d)
{
throw new ArgumentException("Pdf bounding boxes are rotated");
}
var bound = Math.Max(first.Height, second.Height) / 2d;
return Math.Abs(first.Centroid.Y - second.Centroid.Y) < bound;
}
public static bool IsOnSameColumnAs(this PdfRectangle first, PdfRectangle second)
{
if (first.Rotation != 0d || second.Rotation != 0d)
{
throw new ArgumentException("Pdf bounding boxes are rotated");
}
var bound = Math.Max(first.Width, second.Width) / 2d;
return Math.Abs(first.Centroid.X - second.Centroid.X) < bound;
}
public class CellBlockWrap : IBoundingBox
{
public CellBlockWrap(string text, PdfRectangle pdfRectangle)
{
BoundingBox = pdfRectangle;
Text = text;
}
public CellBlockWrap(Tabula.Cell cell)
{
BoundingBox = cell.BoundingBox;
Text = cell.GetText();
}
public PdfRectangle BoundingBox { get; set; }
public string Text { get; set; }
}
private class CellBlockWrapComparer : IEqualityComparer<CellBlockWrap>
{
public bool Equals(CellBlockWrap first, CellBlockWrap second)
{
return first.Text == second.Text
&& (first.BoundingBox.Contains(second.BoundingBox)
|| second.BoundingBox.Contains(first.BoundingBox)
|| first.BoundingBox.Contains(second.BoundingBox.Centroid)
|| second.BoundingBox.Contains(first.BoundingBox.Centroid));
}
public int GetHashCode([DisallowNull] CellBlockWrap obj)
{
return obj.Text.GetHashCode();
}
}
}
from pdfpig.
Related Issues (20)
- Memory Issues on GetWords() and crashes with given file HOT 2
- EOF problem with file.
- Unable to parse pdf due to font issue
- UnsupervisedReadingOrder orders 2 blocks on the same row out of order HOT 2
- PDF linearization
- When a get textblock from a PDF vary depending on the operating system HOT 6
- New Nuget package release for PDF Pig HOT 4
- XYLeaf.GetLines collect lines not robust enough HOT 18
- Copy existing page to PdfDocumentBuilder without it's text HOT 1
- TryGetForm does not support field partial names with a "." HOT 5
- Support p7m signed PDFs
- Why GlyphRectangle bounding box not correct for letter g?
- Errors in examples on "readme.md" ? HOT 1
- Allow reading orders dectors to support any class that has a bounding box/PdfRectangle HOT 1
- File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. HOT 4
- ArgumentOutOfRangeException when reading a document HOT 7
- Add image to PDF with different coordinate origin
- Using DuplicateOverlappingTextProcessor in HOcrTextExporter
- Populate data catalog info without reading the rest of the pdf
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdfpig.