Giter Site home page Giter Site logo

pdfdata's People

Contributors

abejellinek avatar hjellinek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pdfdata's Issues

PDFData Read produces a ZIP file named "upload" with no extension

It would be convenient if the download file had the ".zip" extension.

It would be icing on the cake if the main part of the file name bore some relationship to the original name of the embedded CSV, CSV fragment name, or even the PDF file's name. In that last case, if the input PDF were health.pdf, the extracted CSV could be named health.csv.zip, for instance.

demonstrate file attachment naming methods

there's a discussion about the naming of attached files on the CG mailing list.
Follow one or more of those, to assess usability.

Here's my latest idea on file naming:

If you add in the CSVW metadata in JSON, there will be two files for each table, one in CSV and the other in JSON, each with file names which reliably identify them as to their role and relation. So the CSV file could be called

csvname = pdfname [“-“ pagenum [“-“ top “.” left “.” w “.” d ]] “.csv”

the name of the csv file is made filling a template with the PDFd name, optional page number, and optional viewport on the page. This would let you figure out which page and region the table/chart (first) appears.

The metadata files are named after the csv files:

Metadatafilename = csvname “-metadata.json”

demonstrate use of manifest

I think it's worth experimenting with the idea of a 'manifest', a particular metadata XMP property for each data source.

output should default to returning triples

Similar to what the HTML/RDFa renderer does, the output should preferably in terms of triples, in Turtle or OWL or JSON-LD. Perhaps the library has a call-back to generate triples.

Regression: PDFData produces or extracts gibberish

I had previously encountered a crash bug when embedding a CSV in a file that already had one embedded in it. The fix for that was to update PDFData to use the latest PDFBox.

The good news is that the server no longer crashes. The bad news is that the server is now producing bad output.

Test case:

  • I used the PDFData server to embed the attached CSV in the attached PDF.
  • I used the PDFData server to upload the PDF and view the CSV.
  • I used the PDFData server to upload the PDF and download the CSV.

Alas, the CSV output is gibberish in both the view and download cases.

Input PDF: health.pdf
The input CSV (GitHub forced me to compress it in order to attach it to this report):
12s0214.xls

Here's the resulting PDF with bad embedded data:
health_data1-bad.pdf

Error using PDFData to view data embedded in a PDF produced by PDFData tool

I'm using https://pdf.abe.im.

I downloaded a section of the Statistical Abstract of the US (2012) and used the PDFData tool to embed one of its source tables in it, then took the output of that and added a second CSV to it. I did not add fragment identifiers to either one.
health_data_data.pdf.zip

When I uploaded the resulting file (attached to this issue) to https://pdf.abe.im/read/upload to view it, I received this error:

Whitelabel Error Page

This application has no explicit mapping for /error, so you are seeing this as a fallback.
Fri Sep 02 22:58:37 UTC 2016
There was an unexpected error (type=Internal Server Error, status=500).
Error: Expected a long type at offset 4390893, instead got 'olumn_right'

Mention WCAG for PDF as a role model

What documents should we produce?
Consider the analogy with accessibility: there's WCAG (Web Content Accessibility Guidelines) which establishes various levels of accessibility, and there are notes defining how to accomplish those levels for various formats, including PDF (https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf.html) but also others.

For linked data, there are levels of 'open data' (right now we have 5 stars), and then there are techniques for accomplishing those.

This community group could focus on 'techniques for Data in PDF (covering directions for adding to a generic format), but this would provide a more inclusive roadmap.

Get ready for demo

  • Documentation
  • Support all formats
  • Better design
  • Attachment location (nameddest?)

build issues

Maybe make a separate BUILD.md to isolate build instructions from README.md?

Currently, getting errors

Error:(3, 34) java: package com.fasterxml.jackson.core does not exist
Error:(4, 38) java: package com.fasterxml.jackson.databind does not exist
Error:(85, 54) java: cannot find symbol
symbol: class JsonProcessingException
location: class me.abje.xmptest.Table
Error:(97, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table
Error:(99, 24) java: cannot find symbol
symbol: class ObjectMapper
location: class me.abje.xmptest.Table

Write requirements document

We should have a document which describes the problem we're solving. it should be specific about the benchmark of "PDF with data is as good as HTML with RDF/a more or less" identifying the workflows we think these tools will support.

one reader, multiple writers

this probably turns into "write a design document" and then, after review, some implementation tasks.

The general idea is that there's a single "get data from PDF file" which looks for data in any of the ways it might be stored. And multiple "modify PDF to have data" utilities, at least one for each kind of storage.

"get data from PDF file" returns data in one of several formats (controlled by some output-type). JSON-LD is my favorite (or a stream of JSON-LD data), but other formats for RDF triples are Turtle or, for diehards, RDF/XML.

"get data from PDF file" looks in the PDF's XMP for "hasData" attributes. This is a kind of extensible index of ways in which data is stored.

Attachment -- attachments could have special names that indicate they're data attachments.
Annotations -- annotations could start with special text string that indicates it's a data annotation

There could be a special kind of attachment and we could use annotations as an alternate representation, with a little utility that translated PDF with text annotations <-> PDF with data attachments.

That would give a workflow for human creation of PDF data attachments similar to what they can do for adding RDF/a to HTML.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.