abejellinek / pdfdata Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://pdf.abe.im
License: Other
Home Page: https://pdf.abe.im
License: Other
Current examples are broken with new storage format.
Allow attaching XLS and XLSX files in addition to CSV.
It would be convenient if the download file had the ".zip" extension.
It would be icing on the cake if the main part of the file name bore some relationship to the original name of the embedded CSV, CSV fragment name, or even the PDF file's name. In that last case, if the input PDF were health.pdf
, the extracted CSV could be named health.csv.zip
, for instance.
Would it be possible to put up a simple web service for trying out this functionality?
Thanks!
Walter Chang
Adobe Research
there's a discussion about the naming of attached files on the CG mailing list.
Follow one or more of those, to assess usability.
Here's my latest idea on file naming:
If you add in the CSVW metadata in JSON, there will be two files for each table, one in CSV and the other in JSON, each with file names which reliably identify them as to their role and relation. So the CSV file could be called
csvname = pdfname [“-“ pagenum [“-“ top “.” left “.” w “.” d ]] “.csv”
the name of the csv file is made filling a template with the PDFd name, optional page number, and optional viewport on the page. This would let you figure out which page and region the table/chart (first) appears.
The metadata files are named after the csv files:
Metadatafilename = csvname “-metadata.json”
I think it's worth experimenting with the idea of a 'manifest', a particular metadata XMP property for each data source.
My Statistical Abstract example uses one chapter of the Statistical Abstract of the US (2012) and 5 of the chapter's tables, converted from XLS to CSV.
As luck would have it, I get an error when embedding file number 5.
Similar to what the HTML/RDFa renderer does, the output should preferably in terms of triples, in Turtle or OWL or JSON-LD. Perhaps the library has a call-back to generate triples.
I had previously encountered a crash bug when embedding a CSV in a file that already had one embedded in it. The fix for that was to update PDFData to use the latest PDFBox.
The good news is that the server no longer crashes. The bad news is that the server is now producing bad output.
Test case:
Alas, the CSV output is gibberish in both the view and download cases.
Input PDF: health.pdf
The input CSV (GitHub forced me to compress it in order to attach it to this report):
12s0214.xls
Here's the resulting PDF with bad embedded data:
health_data1-bad.pdf
you need to know some other info to generate qualified triples, like the pointer to the vocabulary items that define what "Galway" and "20 degrees" mean.
I'm using https://pdf.abe.im.
I downloaded a section of the Statistical Abstract of the US (2012) and used the PDFData tool to embed one of its source tables in it, then took the output of that and added a second CSV to it. I did not add fragment identifiers to either one.
health_data_data.pdf.zip
When I uploaded the resulting file (attached to this issue) to https://pdf.abe.im/read/upload to view it, I received this error:
Whitelabel Error Page
This application has no explicit mapping for /error, so you are seeing this as a fallback.
Fri Sep 02 22:58:37 UTC 2016
There was an unexpected error (type=Internal Server Error, status=500).
Error: Expected a long type at offset 4390893, instead got 'olumn_right'
What documents should we produce?
Consider the analogy with accessibility: there's WCAG (Web Content Accessibility Guidelines) which establishes various levels of accessibility, and there are notes defining how to accomplish those levels for various formats, including PDF (https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf.html) but also others.
For linked data, there are levels of 'open data' (right now we have 5 stars), and then there are techniques for accomplishing those.
This community group could focus on 'techniques for Data in PDF (covering directions for adding to a generic format), but this would provide a more inclusive roadmap.
http://w3c.github.io/csvw/html-note/
is the place to start -- see https://lists.w3.org/Archives/Public/public-pdf-open-data/2016Sep/0000.html
for a description.
(for actual document metadata, like title, author, dates, the reader should enumerate the attributes as triples
Maybe make a separate BUILD.md to isolate build instructions from README.md?
Currently, getting errors
Error:(3, 34) java: package com.fasterxml.jackson.core does not exist
Error:(4, 38) java: package com.fasterxml.jackson.databind does not exist
Error:(85, 54) java: cannot find symbol
symbol: class JsonProcessingException
location: class me.abje.xmptest.Table
Error:(97, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table
Error:(99, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table
We should have a document which describes the problem we're solving. it should be specific about the benchmark of "PDF with data is as good as HTML with RDF/a more or less" identifying the workflows we think these tools will support.
this probably turns into "write a design document" and then, after review, some implementation tasks.
The general idea is that there's a single "get data from PDF file" which looks for data in any of the ways it might be stored. And multiple "modify PDF to have data" utilities, at least one for each kind of storage.
"get data from PDF file" returns data in one of several formats (controlled by some output-type). JSON-LD is my favorite (or a stream of JSON-LD data), but other formats for RDF triples are Turtle or, for diehards, RDF/XML.
"get data from PDF file" looks in the PDF's XMP for "hasData" attributes. This is a kind of extensible index of ways in which data is stored.
Attachment -- attachments could have special names that indicate they're data attachments.
Annotations -- annotations could start with special text string that indicates it's a data annotation
There could be a special kind of attachment and we could use annotations as an alternate representation, with a little utility that translated PDF with text annotations <-> PDF with data attachments.
That would give a workflow for human creation of PDF data attachments similar to what they can do for adding RDF/a to HTML.
Would it be possible to release a stand-alone JAR file to try out this functionality?
Thanks,
Walter Chang
Adobe Research
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.