Comments (9)
I'm not the maintainer of this repo -- just watching this because I find this interesting, and may start building some kind of viewer based on this and Avalonia. :)
from mupdfcore.
Hi! Yes, that is correct, MuPDF has a way of returning the text in the page, but there is currently no managed api in MuPDFCore to access that information.
TL;DR: implementing this requires a good amount of work; I will try to look at if when I have time, but I have no idea when that will be.
Implementing such an API would most likely not be exactly trivial:
- First you have to run the text through a "structured text device" to get a
fz_stext_page
. - Then, you need to extract a list of
fz_stext_block
s from thefz_stext_page
. - Each
fz_stext_block
has a bounding box and contains a single image or a list of lines.- If the block contains a list of lines, these should be extracted as
fz_stext_line
s.- Each
fz_stext_line
has a bounding box, a direction (which is useful e.g. if the text has been rotated and is not horizontal), and contains a list offz_stext_char
s. - Each
fz_stext_char
has a Unicode code point, a colour, an origin (i.e. the point at which the start of the glyph's baseline is located), aquad
(which is like a bounding box - except that its sides are not necessarily parallel to the x/y axes), a size and a font.
- Each
- If the block contains an image, this is also associated with a matrix transform (probably used to position/rotate the image appropriately)
- If the block contains a list of lines, these should be extracted as
On the C# side, you would have to define a "Block" interface/abstract class, with two implementations ("TextBlock" for blocks containing lists of lines, and "ImageBlock", for blocks containing images). The "Block" interface would define a bounding box property.
A "TextBlock" would contain a list of "TextLine"s, whose properties would be a bounding box, a direction, a string representing the content of the line and possibly a few arrays of attributes of the glyphs (e.g. a list of colours, origins etc). The font structure would probably be too complex to pass it to C# in a useful way without additional libraries. Converting between a Unicode code point a C# char
will probably be fun.
An "ImageBlock" would have the transform matrix as an additional property, and could contain a raw binary representation of the image data. If this is the case, somewhere in the extraction process there must be a flag to avoid collecting this data if not necessary, to avoid the associated waste in memory and time.
Once all the relevant stuff has been passed to managed objects and the raw pointers are not needed, the unmanaged code should free all resources that it has allocated (e.g. the device, text page etc.), which means that some pointers will need to be passed back and forth from managed to unmanaged code to keep track of the references.
All in all this is an interesting problem, and it would probably not be impossible, but it does require a fair amount of work (also to make sure that exceptions are handled correctly, there are no memory leaks etc). I will see if I can have a look when I have time, but I don't know when that will happen ๐
However, extracting the text and bounding boxes from the PDF is only half the work: once you have those, you need to figure out what the user selected, based e.g. on the point where they clicked at the beginning and the current position of the mouse (if they are dragging the selection).
A helper method to figure out to which glyph (of which line of which block) a certain point corresponds to should be easy to write, but once you get the "start glyph" and "end glyph", you need to decide which glyphs are "between" those two... That is easy if the start and end are both on the same line, but it gets tricky if they are on different lines or different blocks (especially if you have text that flows in multiple directions like left-to-right, right-to-left, vertical, rotated by 45 degrees etc).
Then, you need to figure out how to show the selection: you could highlight the text by painting a semi-transparent rectangle in front of it (like SumatraPDF does), but you need to decide the correct shape of the rectangle, as different glyps have different sizes... You could start by drawing a separate rectangle for each glyph, but that would be ugly (and probably slow for large amounts of text); otherwise, you could draw the smallest rectangle that contains all the glyphs in one line, but you need some non-trivial maths to take care of lines with arbitrarily rotated text. Then, you also need a way to "join" overlapping rectangles to avoid the overlap being painted twice - and this is also annoying because the union of two rectangles is not necessarily a rectangle.
For example, look at this screenshot from SumatraPDF:
The word "Acidobacteria" is actually split over six lines (note the number of rectangles that make up the selection shape) and if you try to copy and paste it you get:
Ac
id
ob
ac
te
ria
Quite interestingly, Adobe Reader actually manages to get the copy-paste right, although the way it highlights every glyph seems weird:
All of this breaks down to the fact the the PDF format does not have any notion of a "body of text", because "text" in a PDF is nothing more than a series of individually positioned and painted glyphs. MuPDF (as any other PDF library) uses some heuristics to try and get this right, and these work acceptably well in the most "vanilla" cases, but you cannot rely on them too much in general. Also, I have never been exposed to documents written in anything other than Latin script, but I imagine these issues would be even worse if you are dealing with Middle-Eastern and Asian languages that do not use a simple left-to-right, top-to-bottom layout...
from mupdfcore.
Then, you need to figure out how to show the selection: you could highlight the text by painting a semi-transparent rectangle in front of it (like SumatraPDF does), but you need to decide the correct shape of the rectangle, as different glyps have different sizes... You could start by drawing a separate rectangle for each glyph, but that would be ugly (and probably slow for large amounts of text); otherwise, you could draw the smallest rectangle that contains all the glyphs in one line, but you need some non-trivial maths to take care of lines with arbitrarily rotated text. Then, you also need a way to "join" overlapping rectangles to avoid the overlap being painted twice - and this is also annoying because the union of two rectangles is not necessarily a rectangle.
I guess for PDF and HTML alike there's one straightforward way of implementing selection -- just rely on the order of the text block in the serialized representation. Consider a DOM:
<html><body>
<div> <p> Paragraph 1 </p> <div> <p> Paragraph 2 </p> <div> <p> Paragraph 3 </p> </div> </div> </div>
<div> <p> Paragraph 4 </p> </div>
</body></html>
If the hit test says the range spans from paragraph 2 to paragraph 4, then paragraph 3 is included.
The textual selection would be the concatenation of these blocks, and the visual representation would be the union of the bounding boxes (maybe relaxed a little bit to allow easier merging).
The good thing is that this implementation is very straightforward. The bad thing is, this is unfortunately why sometimes text selection doesn't work that well, and selects something far away with no other apparent reasons ๐
from mupdfcore.
I guess mupdf has access to the text -- a list of text objects anchored to the pages.
from mupdfcore.
Thanks, I have some ideas now, and seems you didn't provide an api to get text and its boundary infomation?
from mupdfcore.
Got it! Thanks for your detailed reply!
from mupdfcore.
Yes, that would probably be a sensible way to deal with it.
The problem is always that the text in PDF does not necessarily have to appear in the "source" in the same order as it appears in the finished document (and actually the same is also true with an HML page: for example, you could use CSS styles to move paragraph 4 above paragraph 3 or hide paragraph 2).
However, MuPDF maintains that the blocks it returns should be in "natural reading order", hence I expect that it should be still be possible to obtain reasonable results with this approach - at least in simple cases.
from mupdfcore.
Ok, I think I have managed to get a reasonably decent implementation. v1.2.0 now supports generating a MuPDFStructuredTextPage
containing structured text information (with support for hit-testing, searching, and delimiting text regions). The MuPDFRenderer
now also does text selection and searching. Let me know what you think!
The order of the text is the same as what MuPDF returns, which is apparently (according to a comment in the source code) "the order in which text appears in the file, so may not be accurate". At least, it appears to be the same as SumatraPDF, the Chrome PDF viewer, and Adobe Reader.
I have seen people somewhere suggesting to sort blocks/lines/words/glyphs top-to-bottom and left-to-right (e.g. https://github.com/pymupdf/PyMuPDF/wiki/How-to-extract-text-in-natural-reading-order-(up2down,-left2right) or https://www.tallcomponents.com/pdfkit4/extract-glyphs-from-pdf-and-sort), but this clearly does not work when there are multiple columns or, worse, a single page has both a full-width section as well as a section with columns (e.g. the first page in many scientific papers).
I think it would be an interesting problem to get an AI involved with, but I assume that if neither the developers of MuPDF, nor those at Google, nor those at Adobe have managed to get around this issue, it is certainly way out of my league ๐
from mupdfcore.
COOL!
from mupdfcore.
Related Issues (20)
- If MuPdfWrapper.dll can work in 32-bit windows HOT 8
- How to improve or decrease png quality? HOT 3
- .dll problem when released HOT 3
- Method not found: 'Void MuPDFCore.MuPDFContext..ctor(Int64)'. HOT 2
- Cannot open document HOT 5
- Other output image formats HOT 3
- Accept and return ReadOnlySpan<byte> instead of IntPtr HOT 3
- do you support reading encrypted PDF files? HOT 3
- Install MuPDFCore in Blazor WebAssembly HOT 4
- PDFViewerDemo cannot chanage pages HOT 2
- show pdf files continuously HOT 1
- Don't work with musl-based linux distros HOT 2
- MuPDFCore.dll in MuPDFCore NuGet package 1.7 has no strong name HOT 3
- blank area above PDFRender
- using MuPDFRenderer control on Avalonia 11 HOT 3
- Set anti-aliasing level HOT 2
- Is there a way to support iOS and Mac Catalyst? HOT 1
- JPX support disabled HOT 4
- What is the best way to get the text of a pdf? HOT 6
- Always get MuPDFCore.MuPDFException:โCannot open documentโ exception in PDFViewer demo HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mupdfcore.