Hello, Any plans for implementing more of the API such as for gettin

Hello, ToC() and Metadata() are added in <a class="commit-link" data

ReadAll is used in case you use NewFromMemory or <cod

Extract and set metadata? about go-fitz HOT 11 CLOSED

gen2brain commented on July 17, 2024

Extract and set metadata?

from go-fitz.

Comments (11)

gen2brain commented on July 17, 2024

Hello,

ToC() and Metadata() are added in c281116 .
No plans to add functions for editing.

from go-fitz.

gennaios commented on July 17, 2024

Thank you very much. I will test it out soon. If I need editing, I'll try looking at your code and maybe adding it.

from go-fitz.

gen2brain commented on July 17, 2024

No problem. PRs welcome.

from go-fitz.

gennaios commented on July 17, 2024

Tested the TOC. Works great. Looking at the code for PyMuPDF, 1 is added to all pages where the page isn't equal to -1. Seems correct. Looks like MuPDF returns page numbers with an index of 0, meaning 0 is physical page 1.

For pages with a value of -1, it seems PDFs store additional bookmark info. I have at least one problematic PDF for which most apps have problems reading the TOC. Jpfbookmarks, open source Java, reads them correctly by processing some additional info but even Acrobat doesn't so perhaps its not worth trying to process them since other PDF readers such as macOS Preview also do not properly link them. Unsure about that.

from go-fitz.

gennaios commented on July 17, 2024

FYI, I'm not too familiar with Golang though some note that ReadAll might not be the best way to read in files. Curious about performance with large PDFs like in the hundreds of MBs. I'm also not familiar with the PDF spec. The question is if I am processing a lot of PDFs to read only metadata such as creating an external catalog, is the PDF file structure as such where within the PDF there's some info stored where one could then seek directly to some part of the file and read such data without reading in the whole file. Unsure if that is possible or if MuPDF itself supports such if it is. Will look into it later.

from go-fitz.

gen2brain commented on July 17, 2024

ReadAll is used in case you use NewFromMemory or NewFromReader, the default with New is to open the document from the file path, and if you use New and just read ToC it knows what to do and will read just that. At least that is how I imagine it works, don't know all internals but just what I noticed so far from MuPDF API.

from go-fitz.

gennaios commented on July 17, 2024

Thank you for the explanation. Indeed I was using New(). I just started using Go and this module thus am not too familiar with the API.

About the problematic TOC entries in a sample PDF, using cpdf, it outputs all entries, with those problematic as page 0 (-1 in the original item dictionary as MuPDF reports), and then at the end, for those entries:

Warning: Could not read destination G [null/XYZ 0 276 0]
Warning: Could not read destination D a.indd:1509617304794_12
…

That suggests perhaps that within the outline dictionary, there are additional entries as such. Each at the end "…_X", such as the 2nd (1509617304794_12), points to the correct page (+1). Looking at the PDF spec, I'm not sure if those correspond to possible keys A ((Optional; PDF 1.1; shall not be present if a Dest entry is present) The action that shall be performed when this item is activated), SE ((Optional; PDF 1.3; shall be an indirect reference) The structure element to which the item refers (see 14.7.2, “Structure Hierarchy”). (PDF1.0) An item may also specify a destination (Dest) corresponding to an area of a page where the contents of the designated structure element are displayed.), or something else.

I haven't looked at the MuPDF code to see if such is exposed. It could be the problematic PDF does not follow the spec and it's not worth pursuing trying to parse additional links to find the correct page. Up to you; I might look more into it in the future.

FYI, I hadn't noticed that go-fitz and MuPDF also processes EPUBs. I had a need too for cataloging EPUB outlines and it also works great. Very happy it's there.

I haven't tested the author/title/etc metadata; I probably will in a few weeks. If you're confident it works, no problem. I think for the ToC, an addition of adding 1 to each page which is not equal to "-1", then all seems to be there, and you can close the issue.

A question, is it perhaps worth eventually making the go-fitz API better match the MuPDF API, e.g., loading the outline as LoadOutline? As the go-fitz API grows, maybe it'd help others already familiar with MuPDF more easily learn go-fitz. Also a decision you can make; perhaps you already decided to not follow it.

from go-fitz.

gennaios commented on July 17, 2024

Testing with an EPUB 3.0, and displaying output on the console, I get "warning: unknown epub version: 3.0". Perhaps MuPDF reads only the outlines in the EPUB 2.0 .NCX (usually included for backwards compatibility) and ignores the 3.0 NAV document. No problem with that. At first glance, I'm not sure where the warning text comes from. Does go-fitz catch that? So far, I don't see it being reported by New() or ToC(). Maybe it's some console warning from MuPDF. Not sure. It'd be nice to be able to catch that from wherever MuPDF seems to report it.

from go-fitz.

gen2brain commented on July 17, 2024

When I started go-fitz, the idea was not to replicate every function of MuPDF API and create bindings for the library, but just wrapper to get the image.Image from PDF. Then later it was expanded to include more. Not that I am opposed to that idea, just don't have the time and what I needed is already included.

MuPDF uses something like fz_try/fz_catch for errors, don't know how that will work with CGO/Go and if it needs a lot of work to implement. It is not something I am used to and usually don't see such handling of errors in C libs.

from go-fitz.

gennaios commented on July 17, 2024

Ok.

As for other things mentioned, I may look into it in the future as needed. For me, getting metadata is the main thing I needed.

I think it's a good idea to add 1 to the page number when it isn't equal to -1. Unless you prefer the page number to be as MuPDF reports, starting at 0 for page 1. Either way, then we can consider this issue resolved if you like.

As far as naming methods to be more consistant with the MuPDF API, it's possible others may want to extend go-fitz features in the future and thus it may be a good idea as the API grows and others contribute someday.

from go-fitz.

gennaios commented on July 17, 2024

found this about warnings from CGO; may help:

https://groups.google.com/forum/#!topic/golang-nuts/mM9PeFsDfQ8

I'm not that familar with cgo and MuPDF. I have an issue where retrieving metadata from a PDF, and trying to insert it into SQLite, I get some error: "unrecognized token: "'Title" …. It's inserting a single quote at the beginning of the value and I'm not sure why. Perhaps you have some idea? Unsure if it's encoding or something else. Looking into it.

from go-fitz.

Extract and set metadata? about go-fitz HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent