Giter Site home page Giter Site logo

Extract and set metadata? about go-fitz HOT 11 CLOSED

gen2brain avatar gen2brain commented on July 17, 2024
Extract and set metadata?

from go-fitz.

Comments (11)

gen2brain avatar gen2brain commented on July 17, 2024

Hello,

ToC() and Metadata() are added in c281116 .
No plans to add functions for editing.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

Thank you very much. I will test it out soon. If I need editing, I'll try looking at your code and maybe adding it.

from go-fitz.

gen2brain avatar gen2brain commented on July 17, 2024

No problem. PRs welcome.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

Tested the TOC. Works great. Looking at the code for PyMuPDF, 1 is added to all pages where the page isn't equal to -1. Seems correct. Looks like MuPDF returns page numbers with an index of 0, meaning 0 is physical page 1.

For pages with a value of -1, it seems PDFs store additional bookmark info. I have at least one problematic PDF for which most apps have problems reading the TOC. Jpfbookmarks, open source Java, reads them correctly by processing some additional info but even Acrobat doesn't so perhaps its not worth trying to process them since other PDF readers such as macOS Preview also do not properly link them. Unsure about that.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

FYI, I'm not too familiar with Golang though some note that ReadAll might not be the best way to read in files. Curious about performance with large PDFs like in the hundreds of MBs. I'm also not familiar with the PDF spec. The question is if I am processing a lot of PDFs to read only metadata such as creating an external catalog, is the PDF file structure as such where within the PDF there's some info stored where one could then seek directly to some part of the file and read such data without reading in the whole file. Unsure if that is possible or if MuPDF itself supports such if it is. Will look into it later.

from go-fitz.

gen2brain avatar gen2brain commented on July 17, 2024

ReadAll is used in case you use NewFromMemory or NewFromReader, the default with New is to open the document from the file path, and if you use New and just read ToC it knows what to do and will read just that. At least that is how I imagine it works, don't know all internals but just what I noticed so far from MuPDF API.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

Thank you for the explanation. Indeed I was using New(). I just started using Go and this module thus am not too familiar with the API.

About the problematic TOC entries in a sample PDF, using cpdf, it outputs all entries, with those problematic as page 0 (-1 in the original item dictionary as MuPDF reports), and then at the end, for those entries:

Warning: Could not read destination G [null/XYZ 0 276 0]
Warning: Could not read destination D a.indd:1509617304794_12

That suggests perhaps that within the outline dictionary, there are additional entries as such. Each at the end "…_X", such as the 2nd (1509617304794_12), points to the correct page (+1). Looking at the PDF spec, I'm not sure if those correspond to possible keys A ((Optional; PDF 1.1; shall not be present if a Dest entry is present) The action that shall be performed when this item is activated), SE ((Optional; PDF 1.3; shall be an indirect reference) The structure element to which the item refers (see 14.7.2, “Structure Hierarchy”). (PDF1.0) An item may also specify a destination (Dest) corresponding to an area of a page where the contents of the designated structure element are displayed.), or something else.

I haven't looked at the MuPDF code to see if such is exposed. It could be the problematic PDF does not follow the spec and it's not worth pursuing trying to parse additional links to find the correct page. Up to you; I might look more into it in the future.

FYI, I hadn't noticed that go-fitz and MuPDF also processes EPUBs. I had a need too for cataloging EPUB outlines and it also works great. Very happy it's there.

I haven't tested the author/title/etc metadata; I probably will in a few weeks. If you're confident it works, no problem. I think for the ToC, an addition of adding 1 to each page which is not equal to "-1", then all seems to be there, and you can close the issue.

A question, is it perhaps worth eventually making the go-fitz API better match the MuPDF API, e.g., loading the outline as LoadOutline? As the go-fitz API grows, maybe it'd help others already familiar with MuPDF more easily learn go-fitz. Also a decision you can make; perhaps you already decided to not follow it.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

Testing with an EPUB 3.0, and displaying output on the console, I get "warning: unknown epub version: 3.0". Perhaps MuPDF reads only the outlines in the EPUB 2.0 .NCX (usually included for backwards compatibility) and ignores the 3.0 NAV document. No problem with that. At first glance, I'm not sure where the warning text comes from. Does go-fitz catch that? So far, I don't see it being reported by New() or ToC(). Maybe it's some console warning from MuPDF. Not sure. It'd be nice to be able to catch that from wherever MuPDF seems to report it.

from go-fitz.

gen2brain avatar gen2brain commented on July 17, 2024

When I started go-fitz, the idea was not to replicate every function of MuPDF API and create bindings for the library, but just wrapper to get the image.Image from PDF. Then later it was expanded to include more. Not that I am opposed to that idea, just don't have the time and what I needed is already included.

MuPDF uses something like fz_try/fz_catch for errors, don't know how that will work with CGO/Go and if it needs a lot of work to implement. It is not something I am used to and usually don't see such handling of errors in C libs.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

Ok.

As for other things mentioned, I may look into it in the future as needed. For me, getting metadata is the main thing I needed.

I think it's a good idea to add 1 to the page number when it isn't equal to -1. Unless you prefer the page number to be as MuPDF reports, starting at 0 for page 1. Either way, then we can consider this issue resolved if you like.

As far as naming methods to be more consistant with the MuPDF API, it's possible others may want to extend go-fitz features in the future and thus it may be a good idea as the API grows and others contribute someday.

from go-fitz.

gennaios avatar gennaios commented on July 17, 2024

found this about warnings from CGO; may help:

https://groups.google.com/forum/#!topic/golang-nuts/mM9PeFsDfQ8

I'm not that familar with cgo and MuPDF. I have an issue where retrieving metadata from a PDF, and trying to insert it into SQLite, I get some error: "unrecognized token: "'Title" …. It's inserting a single quote at the beginning of the value and I'm not sure why. Perhaps you have some idea? Unsure if it's encoding or something else. Looking into it.

from go-fitz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.