This package provides functionality to work with XML documents representing the contents of PDF documents. These XML documents are generated by pdftohtml, specifically our extended version.
The functionality includes
- parsing the document with an associated class
- easy access to pages as if the document were a list, e.g. doc[[1]]
- loop over the pages with lapply()/sapply()
- extract the title of the document
- extract the dates an academic article was submitted, revised, published
- determine if the PDF document was scanned
- get the header and footer for pages
- get the text or words for a page or entire document
- get the text arranged by column
- get the location for each text segment
- get font information for each piece of text
- display the contents of a page, including lines, rectangles, image boxes,
Installation
- Install dev version of the XML package
Currently, ReadPDF requires the development/Github version of the XML package. This can be installed in R using the devtools package:
devtools::install_github("omegahat/XML")
- (Recommended) Install extended version of pdftohtml
Additionally, while the package will work with other versions of pdftohtml, some functions will not work without our extended version.
Clone or download our extended version of pdftohtml
Then, build the binary executible (requires make
and a C++ compiler),
cd pdftohtml
make
You can move the binary (once built) to your system directory (e.g., /usr/bin
on Unix
systems),
cp src/pdftohtml /usr/bin
or you can specify the location of this binary in R via the env variable PDFTOHTML.
options(PDFTOHTML = "path/to/pdftohtml")
- Install ReadPDF
devtools::install_github("dsidavis/ReadPDF")