dsidavis / readpdf Goto Github PK

Tools for working with PDF documents, currently converted to XML via a modified pdftohtml

R 99.71% Makefile 0.20% TeX 0.09%

readpdf's Introduction

This package provides functionality to work with XML documents representing the contents of PDF documents. These XML documents are generated by pdftohtml, specifically our extended version.

The functionality includes

parsing the document with an associated class
easy access to pages as if the document were a list, e.g. doc[[1]]
loop over the pages with lapply()/sapply()
extract the title of the document
extract the dates an academic article was submitted, revised, published
determine if the PDF document was scanned
get the header and footer for pages
get the text or words for a page or entire document
get the text arranged by column
get the location for each text segment
get font information for each piece of text
display the contents of a page, including lines, rectangles, image boxes,

Installation

Install dev version of the XML package

Currently, ReadPDF requires the development/Github version of the XML package. This can be installed in R using the devtools package:

devtools::install_github("omegahat/XML")

(Recommended) Install extended version of pdftohtml

Additionally, while the package will work with other versions of pdftohtml, some functions will not work without our extended version.

Clone or download our extended version of pdftohtml

Then, build the binary executible (requires make and a C++ compiler),

cd pdftohtml
make

You can move the binary (once built) to your system directory (e.g., /usr/bin on Unix systems),

cp src/pdftohtml /usr/bin

or you can specify the location of this binary in R via the env variable PDFTOHTML.

options(PDFTOHTML = "path/to/pdftohtml")

Install ReadPDF

devtools::install_github("dsidavis/ReadPDF")

readpdf's People

Contributors

Stargazers

Watchers

Forkers

rpbradystadavis sssantos samuelcarvalho1 datalab-dev

readpdf's Issues

findsectionheaders

findSectionHeaders function is splitting headers that are on the same line and have the same font. A reproducible example occurs in

testassay = convertPDF2XML('Andriamandimby-2011-Crimean-Congo hemorrhagic.pdf')
findSectionHeaders(testassay)

[[6]]
<text top="230" left="94" width="34" height="12" font="23" rotation="0.000000">Study</text> 

[[7]]
<text top="230" left="131" width="39" height="12" font="23" rotation="0.000000">design</text>

In Chua's pdf there is a reoccurring /001 exit character which is a special hyphen the author's use in the pdf. The error occurs in the headers between the numbers 265-275 on every page. This error is also prevalent in the reference section. Another exit character occurred when the author used a + symbol on page 7. When using a gsub for these exit characters we can then xmlParse the pdf as normal.

convertPDF2XML

When using this function to convert .pdf files to xml, I got some Errors and warnings thrown out.
There are two different types of them
1.
Error in if (file.exists(file) == FALSE) if (!missing(asText) && asText == :
argument is of length zero
In addition: Warning message:
running command 'pdftohtml -q -xml -stdout '54664A345348.pdf'' had status 1
2.
"PCDATA invalid Char value 1"
For pdfs:
[2] "Chua-2003-Nipah virus outbreak in Malaysia1.pdf"
[3] "Chua-2003-Nipah virus outbreak in Malaysia.pdf"
[4] "de Thoisy-2004-Wild terrestrial rainforest ma1.pdf"
[5] "de Thoisy-2004-Wild terrestrial rainforest mam.pdf"
[6] "Nakgoi-2014-Dengue, Japanese Encephalitis and.pdf"
[7] "Oliveira-2009-Genetic characterization of a J1.pdf"
[8] "Oliveira-2009-Genetic characterization of a Ju.pdf"
[9] "Sottosanti-2005-Serological study of the lymph.pdf"
[10] "Switzer-2005-The epidemiology of simian immuno.pdf"

ReadPDF Dependency

We need to add that ReadPDF depends on XML.

isScanned2 misidentification

isScanned2 returns "TRUE" for documents which have the appearance of being scanned (pixelated font on zooming in, etc.), but actually contain text in the XML. Other functions return valid results for these documents, e.g. getPublicationDates.

Example documents:

[1] "internal-pdf://1141406970/Rollin-1995-Isolation of black creek canal vir.pdf"                                                                   [2] "internal-pdf://1670255814/Hendra.pdf"                                                                                                           
 [3] "internal-pdf://3552799091/Timoney-1976-Encephalitis caused by louping il.pdf"                                                                   
 [4] "internal-pdf://1615163449/Kaplan-1980-Evidence of infection by viruses i.pdf"                                                                   
 [5] "internal-pdf://4021195741/Shepherd-1987-Antibody to Crimean-Congo hemorr.pdf"          
 [6] "internal-pdf://2088887942/Hanson-1952-The natural history of vesicular s.pdf"                          
 [7] "internal-pdf://3332319627/Monath-1980-Yellow fever in the Gambia, 1978--.pdf"

XPath error when using ReadPDF library functions

I get the following error when trying to run some of the library's functions, such as
txt = getSectionText(f) and

lm = readPDFXML(f)
h = findSectionHeaders(lm)

from the sections.R script.

xmlXPathCompOpEval: function lower-case not found
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces,  : 
  error evaluating xpath expression //text[(contains(lower-case(normalize-space(.)), 'introduction') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'background') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'conclusions') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'discussion') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'materials and methods') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'literature cited') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'references cited') and isNum(normalize-space(.))) or (contains(lower-case(normalize-space(.)), 'the study') and isNum(normalize-space(.)))]

getDatePublished

The xpath selection for Received:, Published:, and Accepted: fails to search for the lower cased version of those words and the colon prevents a search of those words by themselves. This results in errors on pdfs: J Infect Dis.-2015-Ogawa-infdis-jiv063, Holsomback-2009-Bayou virus detected in non-Or.pdf, and Frances-2004-Occurrence of Ross River virus an.pdf. Removing the capitals and colons for these pdfs seemed to allow for the function to find the dates. Also I had a problem with the line of code that used the function structure. The structure function led to an lapply bug in my code. Simply by removing the function and adding a return(val) instead I was able to acquire the line with the dates published, received, and Accepted.