abejellinek / pdfdata Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 6.69 MB

Home Page: https://pdf.abe.im

License: Other

Java 96.72% HTML 3.21% JavaScript 0.07%

pdfdata's People

Contributors

Stargazers

Watchers

Forkers

hjellinek fritexvz

pdfdata's Issues

Update examples

Current examples are broken with new storage format.

Handle embedding XLS files

Allow attaching XLS and XLSX files in addition to CSV.

PDFData Read produces a ZIP file named "upload" with no extension

It would be convenient if the download file had the ".zip" extension.

It would be icing on the cake if the main part of the file name bore some relationship to the original name of the embedded CSV, CSV fragment name, or even the PDF file's name. In that last case, if the input PDF were health.pdf, the extracted CSV could be named health.csv.zip, for instance.

Web service for this funcitonality?

Would it be possible to put up a simple web service for trying out this functionality?

Thanks!
Walter Chang
Adobe Research

demonstrate file attachment naming methods

there's a discussion about the naming of attached files on the CG mailing list.
Follow one or more of those, to assess usability.

Here's my latest idea on file naming:

If you add in the CSVW metadata in JSON, there will be two files for each table, one in CSV and the other in JSON, each with file names which reliably identify them as to their role and relation. So the CSV file could be called

csvname = pdfname [“-“ pagenum [“-“ top “.” left “.” w “.” d ]] “.csv”

the name of the csv file is made filling a template with the PDFd name, optional page number, and optional viewport on the page. This would let you figure out which page and region the table/chart (first) appears.

The metadata files are named after the csv files:

Metadatafilename = csvname “-metadata.json”

demonstrate use of manifest

I think it's worth experimenting with the idea of a 'manifest', a particular metadata XMP property for each data source.

Server error on embedding 5th CSV in PDF

My Statistical Abstract example uses one chapter of the Statistical Abstract of the US (2012) and 5 of the chapter's tables, converted from XLS to CSV.

As luck would have it, I get an error when embedding file number 5.

PDF with first 4 files embedded

The fatal CSV, compressed

output should default to returning triples

Similar to what the HTML/RDFa renderer does, the output should preferably in terms of triples, in Turtle or OWL or JSON-LD. Perhaps the library has a call-back to generate triples.

Regression: PDFData produces or extracts gibberish

I had previously encountered a crash bug when embedding a CSV in a file that already had one embedded in it. The fix for that was to update PDFData to use the latest PDFBox.

The good news is that the server no longer crashes. The bad news is that the server is now producing bad output.

Test case:

I used the PDFData server to embed the attached CSV in the attached PDF.
I used the PDFData server to upload the PDF and view the CSV.
I used the PDFData server to upload the PDF and download the CSV.

Alas, the CSV output is gibberish in both the view and download cases.

Input PDF: health.pdf
The input CSV (GitHub forced me to compress it in order to attach it to this report):
12s0214.xls

Here's the resulting PDF with bad embedded data:
health_data1-bad.pdf

store CSV but generate triples for file attachments

you need to know some other info to generate qualified triples, like the pointer to the vocabulary items that define what "Galway" and "20 degrees" mean.

Error using PDFData to view data embedded in a PDF produced by PDFData tool

I'm using https://pdf.abe.im.

I downloaded a section of the Statistical Abstract of the US (2012) and used the PDFData tool to embed one of its source tables in it, then took the output of that and added a second CSV to it. I did not add fragment identifiers to either one.
health_data_data.pdf.zip

When I uploaded the resulting file (attached to this issue) to https://pdf.abe.im/read/upload to view it, I received this error:

Whitelabel Error Page

This application has no explicit mapping for /error, so you are seeing this as a fallback.
Fri Sep 02 22:58:37 UTC 2016
There was an unexpected error (type=Internal Server Error, status=500).
Error: Expected a long type at offset 4390893, instead got 'olumn_right'

Mention WCAG for PDF as a role model

What documents should we produce?
Consider the analogy with accessibility: there's WCAG (Web Content Accessibility Guidelines) which establishes various levels of accessibility, and there are notes defining how to accomplish those levels for various formats, including PDF (https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf.html) but also others.

For linked data, there are levels of 'open data' (right now we have 5 stars), and then there are techniques for accomplishing those.

This community group could focus on 'techniques for Data in PDF (covering directions for adding to a generic format), but this would provide a more inclusive roadmap.

update interface to store/retrieve multiple files -- handle CSV cases

http://w3c.github.io/csvw/html-note/
is the place to start -- see https://lists.w3.org/Archives/Public/public-pdf-open-data/2016Sep/0000.html
for a description.

Get ready for demo

Documentation
Support all formats
Better design
Attachment location (nameddest?)

default reader should generate triples based on PDF's XMP

(for actual document metadata, like title, author, dates, the reader should enumerate the attributes as triples

build issues

Maybe make a separate BUILD.md to isolate build instructions from README.md?

Currently, getting errors

Error:(3, 34) java: package com.fasterxml.jackson.core does not exist
Error:(4, 38) java: package com.fasterxml.jackson.databind does not exist
Error:(85, 54) java: cannot find symbol
symbol: class JsonProcessingException
location: class me.abje.xmptest.Table
Error:(97, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table
Error:(99, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table

Replace "Whitelabel Error Page" with something a little less generic

I just triggered another error, and it struck me that the error page could use a better title and explanatory text, if not a better font!

Write requirements document

We should have a document which describes the problem we're solving. it should be specific about the benchmark of "PDF with data is as good as HTML with RDF/a more or less" identifying the workflows we think these tools will support.

one reader, multiple writers

this probably turns into "write a design document" and then, after review, some implementation tasks.

The general idea is that there's a single "get data from PDF file" which looks for data in any of the ways it might be stored. And multiple "modify PDF to have data" utilities, at least one for each kind of storage.

"get data from PDF file" returns data in one of several formats (controlled by some output-type). JSON-LD is my favorite (or a stream of JSON-LD data), but other formats for RDF triples are Turtle or, for diehards, RDF/XML.

"get data from PDF file" looks in the PDF's XMP for "hasData" attributes. This is a kind of extensible index of ways in which data is stored.

Attachment -- attachments could have special names that indicate they're data attachments.
Annotations -- annotations could start with special text string that indicates it's a data annotation

There could be a special kind of attachment and we could use annotations as an alternate representation, with a little utility that translated PDF with text annotations <-> PDF with data attachments.

That would give a workflow for human creation of PDF data attachments similar to what they can do for adding RDF/a to HTML.

standalone JAR file for experimenting?

Would it be possible to release a stand-alone JAR file to try out this functionality?

Thanks,
Walter Chang
Adobe Research