Giter Site home page Giter Site logo

readpdf's Introduction

This package provides functionality to work with XML documents representing the contents of PDF documents. These XML documents are generated by pdftohtml, specifically our extended version.

The functionality includes

  • parsing the document with an associated class
  • easy access to pages as if the document were a list, e.g. doc[[1]]
  • loop over the pages with lapply()/sapply()
  • extract the title of the document
  • extract the dates an academic article was submitted, revised, published
  • determine if the PDF document was scanned
  • get the header and footer for pages
  • get the text or words for a page or entire document
  • get the text arranged by column
  • get the location for each text segment
  • get font information for each piece of text
  • display the contents of a page, including lines, rectangles, image boxes,

Installation

  1. Install dev version of the XML package

Currently, ReadPDF requires the development/Github version of the XML package. This can be installed in R using the devtools package:

devtools::install_github("omegahat/XML")
  1. (Recommended) Install extended version of pdftohtml

Additionally, while the package will work with other versions of pdftohtml, some functions will not work without our extended version.

Clone or download our extended version of pdftohtml

Then, build the binary executible (requires make and a C++ compiler),

cd pdftohtml
make

You can move the binary (once built) to your system directory (e.g., /usr/bin on Unix systems),

cp src/pdftohtml /usr/bin

or you can specify the location of this binary in R via the env variable PDFTOHTML.

options(PDFTOHTML = "path/to/pdftohtml")
  1. Install ReadPDF
devtools::install_github("dsidavis/ReadPDF")

readpdf's People

Contributors

duncantl avatar jcarlen avatar mespe avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

readpdf's Issues

findsectionheaders

findSectionHeaders function is splitting headers that are on the same line and have the same font. A reproducible example occurs in

testassay = convertPDF2XML('Andriamandimby-2011-Crimean-Congo hemorrhagic.pdf')
findSectionHeaders(testassay)

[[6]]
<text top="230" left="94" width="34" height="12" font="23" rotation="0.000000">Study</text> 

[[7]]
<text top="230" left="131" width="39" height="12" font="23" rotation="0.000000">design</text> 

ConvertPDF2XML XMLParse

In Chua's pdf there is a reoccurring /001 exit character which is a special hyphen the author's use in the pdf. The error occurs in the headers between the numbers 265-275 on every page. This error is also prevalent in the reference section. Another exit character occurred when the author used a + symbol on page 7. When using a gsub for these exit characters we can then xmlParse the pdf as normal.

convertPDF2XML

When using this function to convert .pdf files to xml, I got some Errors and warnings thrown out.
There are two different types of them
1.
Error in if (file.exists(file) == FALSE) if (!missing(asText) && asText == :
argument is of length zero
In addition: Warning message:
running command 'pdftohtml -q -xml -stdout '54664A345348.pdf'' had status 1
2.
"PCDATA invalid Char value 1"
For pdfs:
[2] "Chua-2003-Nipah virus outbreak in Malaysia1.pdf"
[3] "Chua-2003-Nipah virus outbreak in Malaysia.pdf"
[4] "de Thoisy-2004-Wild terrestrial rainforest ma1.pdf"
[5] "de Thoisy-2004-Wild terrestrial rainforest mam.pdf"
[6] "Nakgoi-2014-Dengue, Japanese Encephalitis and.pdf"
[7] "Oliveira-2009-Genetic characterization of a J1.pdf"
[8] "Oliveira-2009-Genetic characterization of a Ju.pdf"
[9] "Sottosanti-2005-Serological study of the lymph.pdf"
[10] "Switzer-2005-The epidemiology of simian immuno.pdf"

isScanned2 misidentification

isScanned2 returns "TRUE" for documents which have the appearance of being scanned (pixelated font on zooming in, etc.), but actually contain text in the XML. Other functions return valid results for these documents, e.g. getPublicationDates.

Example documents:

[1] "internal-pdf://1141406970/Rollin-1995-Isolation of black creek canal vir.pdf"                                                                   [2] "internal-pdf://1670255814/Hendra.pdf"                                                                                                           
 [3] "internal-pdf://3552799091/Timoney-1976-Encephalitis caused by louping il.pdf"                                                                   
 [4] "internal-pdf://1615163449/Kaplan-1980-Evidence of infection by viruses i.pdf"                                                                   
 [5] "internal-pdf://4021195741/Shepherd-1987-Antibody to Crimean-Congo hemorr.pdf"          
 [6] "internal-pdf://2088887942/Hanson-1952-The natural history of vesicular s.pdf"                          
 [7] "internal-pdf://3332319627/Monath-1980-Yellow fever in the Gambia, 1978--.pdf"                                                                   

XPath error when using ReadPDF library functions

I get the following error when trying to run some of the library's functions, such as
txt = getSectionText(f) and

lm = readPDFXML(f)
h = findSectionHeaders(lm)

from the sections.R script.

xmlXPathCompOpEval: function lower-case not found
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //text[(contains(lower-case(normalize-space(.)), 'introduction') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'background') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'conclusions') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'discussion') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'materials and methods') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'literature cited') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'references cited') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'the study') and isNum(normalize-space(.)))]

getDatePublished

The xpath selection for Received:, Published:, and Accepted: fails to search for the lower cased version of those words and the colon prevents a search of those words by themselves. This results in errors on pdfs: J Infect Dis.-2015-Ogawa-infdis-jiv063, Holsomback-2009-Bayou virus detected in non-Or.pdf, and Frances-2004-Occurrence of Ross River virus an.pdf. Removing the capitals and colons for these pdfs seemed to allow for the function to find the dates. Also I had a problem with the line of code that used the function structure. The structure function led to an lapply bug in my code. Simply by removing the function and adding a return(val) instead I was able to acquire the line with the dates published, received, and Accepted.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.