allancameron / pdfr Goto Github PK

An R package to extract text from pdf.

License: Other

R 3.78% C++ 96.22%

pdf-format pdf extract-text data-scientists

pdfr's Introduction

PDFR

The goal of PDFR is to aid data scientists who need the ability to extract data from files in pdf format. PDFR is a new C++ based R library to extract usable text from portable document format (pdf) files.

The majority of the code base is written in C++ with a view to being ported to other languages, but at present it is constructed to be built as an R package.

Installation

You can install the development version of PDFR from GitHub with:

# install.packages("pak")
pak::pkg_install("AllanCameron/PDFR")

Usage

The main function used to extract all data from a pdf page to an R data frame is pdfpage(). This accepts either the path to a pdf or a raw data vector representing a pdf. For example, this is how you extract all text from page 1 of the barcodes PDF from pdfr_paths:

library(PDFR)

barcodes <- system.file("extdata", "barcodes.pdf", package = "PDFR")
pdfpage(barcodes, 1)
#>                               text  left right bottom   top    font size
#> 1                             None  53.5  74.4  774.2 782.2 Courier    8
#> 2                   Acute medicine 187.4 255.9  774.2 782.2 Courier    8
#> 3                                / 258.8 264.8  774.2 782.2 Courier    8
#> 4                             ward 267.8 288.6  774.2 782.2 Courier    8
#> 5                               11 291.6 303.5  774.2 782.2 Courier    8
#> 6 [email protected] 318.3 470.1  774.2 782.2 Courier    8
#> 7                              211 473.0 490.9  774.2 782.2 Courier    8
#> 8                             5719 493.9 514.7  774.2 782.2 Courier    8

Background

The current version is at an early stage of development. It will work with most pdfs, but there are some unsupported features which may lead to some pdfs producing runtime errors.

Documents encrypted using the standard method and which can be opened without a password are supported. Password-based encryption is currently unsupported.

If there are any suggestions for development please submit a feature request, or let me know about pdfs that break the package.

Motivation

Extracting useful data from pdf is difficult for two reasons. Firstly, the pdf format primarily consists of binary data, which is laid out in such a way as to provide quick random access to pdf objects as required by a pdf reader. The text elements as seen on the page are usually encoded in a binary stream within the document. Even when the binary stream is decoded, the text items exist as individual elements within a page description program, which has to be parsed before the text can be extracted. It is therefore not a trivial matter to extract the “raw text” from a pdf file into a format in which it can be read by R, though there exist some excellent tools that can do this quickly. In particular, pdftools provides an R interface to some of Poppler’s pdf tools, and can quickly and reliably extract text wholesale from pdf.

The second problem is that, unlike some other common file types used to exchange information on the internet (e.g. html, xml, csv, JSON), the raw text extracted from a pdf does not have a fixed structure to provide semantic information about the data to allow it to be processed easily by a data scientist.

The mismatch between the fact that humans can read data from pdfs so easily yet the format is so difficult to convert into machine-readable data is explained by the fact that humans use the structure of the page layout to provide the semantic context to the data. When the structure is lost (as it often is with copy and pasting from PDF), it becomes very difficult for a human reader to interpret. The computer does not know how to interpret the characters’ positions, so it cannot classify the characters by semantics as a human reader (usually) can.

The idea behind PDFR is to try to extract raw text then use the positioning and formatting data from the extracted text to reconstruct some of the semantic content that would otherwise be lost. For example, identifying and grouping letters into words, words into paragraphs or into tables.

Ultimately, to extract useful data, the user will need the option to control how and to what extent text elements are grouped. For example, they may need the fine control of having every letter’s position on the page (e.g. to accurately reconstruct a part of the document on a plot), or may wish to extract a corpus of plain text from a book as a set of paragraphs or even whole pages.

PDFR is written in C++ 11 and has no external dependencies, but makes extensive use of the C++ standard libraries. Rather than being based on an existing library such as xpdf or Poppler, it was written from scratch with the specific goal of making text extraction easier for R users. Most of the design is new, an attempt to implement the text extraction elements of the pdf standard ISO 32000, though it borrows some concepts from existing open-source libraries such as Poppler and pdfjs.

Clearly, the package would not exist without the excellent Rcpp package. Much of the pdf parsing would take too long to do in R, but having the facility to write C++ extensions makes pdf parsing feasible, and even pretty quick in some cases.

Related projects

pdftools: Text Extraction, Rendering and Converting of PDF Documents.
qpdf: Content-preserving transformations transformations of PDF files such as split, combine, and compress. This package interfaces directly to the ‘qpdf’ C++ API and does not require any command line utilities.
tabulizer: Bindings for Tabula PDF Table Extractor Library
PDE: The PDE (Pdf Data Extractor) allows the extraction of information and tables optionally based on search words from PDF (Portable Document Format) files and enables the visualization of the results, both by providing a convenient user-interface.
xmpdf: Edit XMP metadata and PDF bookmarks/documentation info.

pdfr's People

Contributors

Stargazers

Watchers

Forkers

elipousson myeongseongpark nemochina2008

pdfr's Issues

Add documentation and minor refactoring to allow PDFR to pass `devtools::check()` without errors

Hello @AllanCameron! I really appreciate you creating this package – I'm using it to extract text from some older reports at work and reformat the text into a tabular structure. When I went to look-up the documentation for pdfpage() I realized that exporting the documentation is one of the things that remained incomplete with the package.

I just forked the package and went ahead and made the following changes to get the documentation filled in and get the package to pass devtools::check() without errors. Here are all the changes I made:

Remove existing NAMESPACE file to replace with NAMESPACE generated by roxygen2
Add httr and grDevices to Imports and move Rcpp11 to Suggests
Add package level documentation to handle imports from ggplot2, Rcpp, and httr
Add package level metadata with codemetar::write_codemeta()
Update function documentation to import from grid and grDevices
Re-formated the DESCRIPTION with use_tidy_description() (and added URLs + Authors)
Update license with use_mit_license()
Added a utils.R file to add a call utils::globalVariables()
Replace the markdown README with a Rmd file
Exported the testfiles data with use_data()
Disabled execution of a broken test in test-pdrf.R
Disabled execution of a broken example for draw_glyph()

If this all looks good, I'm happy to open a pull request.

I also noticed that the size parameter in pdfgraphics() may need to be changed to linewidth to work with the most recent version of ggplot2 (see this post for more information). I can test this out and add it to the same pull request or open a separate issue if you'd like to discuss.

Couldn't find string in dictionary

I've just installed this package and tried to use pdfpage() and pdfdoc() to read in some text.
In both cases I immediately get the error

Error in .pdfdoc(pdf) : Couldn't find string in dictionary.

and no other output.

Can you help?

Couldn't open file.

downloaded PDFR. Seems to be installed correctly. Tried pdfboxes ( "my.pdf", 1) received Error: Couldn't open file. Note that pdftools can read this file.

Get encodings from type 1 fonts

Some fonts have no encodings specified except in the font program (e.g. some book-form pdfs from Project Gutenburg). Although a chunk of the file program is binary, the header contains a text-form encoding map that may be used for ligatures etc.

Installation Problem: Failed to build PDFR 0.1.0

I am using the latest release of R Studio on a Surface Pro 7. Windows 10.
RStudio 2023.03.0+386 "Cherry Blossom" Release (3c53477afb13ab959aeb5b34df1f10c237b256c3, 2023-03-09) for Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) RStudio/2023.03.0+386 Chrome/108.0.5359.179 Electron/22.0.3 Safari/537.36

install.packages("pak")
Installing package into 'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2'
(as 'lib' is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.2/pak_0.4.0.zip'
Content type 'application/zip' length 11106824 bytes (10.6 MB)
downloaded 10.6 MB

package 'pak' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
C:\Users\sbalakrishnan\AppData\Local\Temp\Rtmp6bsTg6\downloaded_packages

pak::pkg_install("AllanCameron/PDFR")
! Using bundled GitHub PAT. Please add your own PAT using gitcreds::gitcreds_set().
✔ Updated metadata database: 4.94 MB in 12 files.
✔ Updating metadata database ... done
→ Will install 1 package.
→ Will download 1 package with unknown size.

PDFR 0.1.0 [bld][cmp][dl] (GitHub: 9d9806c)
ℹ Getting 1 pkg with unknown size
✔ Got PDFR 0.1.0 (source) (2.07 MB)
✔ Downloaded 1 package (2.07 MB)in 1.9s
ℹ Packaging PDFR 0.1.0
✔ Packaged PDFR 0.1.0 (17.9s)
ℹ Building PDFR 0.1.0
✖ Failed to build PDFR 0.1.0
Error:
! error in pak subprocess
Caused by error in stop_task_build(state, worker):
! Failed to build source package 'PDFR'
Full installation output:

installing source package 'PDFR' ...
** using non-staged installation via StagedInstall field
** libs
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c RcppExports.cpp -o RcppExports.o
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c adobetounicode.cpp -o adobetounicode.o
g++ -std=gnu++11 -I"C:/PROGRA~~1/R/R-42~~1.3/include" -DNDEBUG -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/Rcpp/include' -I'C:/Users/sbalakrishnan/AppData/Local/R/win-library/4.2/testthat/include' -I"C:/rtools42/x86_64-w64-mingw32.static.posix/include" -O2 -Wall -mfpmath=sse -msse2 -mstackrealign -Wall -pedantic -fdiagnostics-color=always -c box.cpp -o box.o
In file included from box.cpp:13
box.h:n constructor 'Box::Box(std::vector)
box.h:119:40:error: runtime_erroris not a member of 'std
119 | if (floats.size() != 4) throw std::runtime_error needs four floats");
| ^~~~~~~~~~~~~
box.h:n member function 'float Box::Edge(int) const
box.h:145:27:error: runtime_erroris not a member of 'std
145 | default: throw std::runtime_erroralid box index");
| ^~~~~~~~~~~~~
make: *** [C:/PROGRA~~1/R/R-42~~1.3/etc/x64/Makeconf:260: box.o] Error 1
ERROR: compilation failed for package 'PDFR'
removing 'C:/Users/SBALAK~1/AppData/Local/Temp/RtmpMzQTtA/pkg-lib52243e9e7641/PDFR'
Type .Last.error to see the more details.

Fix ligatures

Need to convert the Unicode ligatures and digraphs to letter pairs