This is the software toolchain used in the conference paper "A Scoping Review of Engineering Textbooks to Quantify the Teaching of Uncertainty". This toolchain uses Python code to consolidate digitized textbook indexes into a single spreadsheet (the masterlist.csv
).
Download the contents of this repository (or clone it locally) and run the following steps:
- (Install Python dependencies for textbook scraper) Using a python installation with a functional pip installation. Use the
requirements.txt
to install all python dependencies, i.e. with the command line invocationpip install -r requirements.txt
from your terminal.
- If you haven't used a terminal before, you may want to check out this tutorial
- (Prepare your source PDFs) Collect PDFs of the indexes you want to scrape. Make sure to truncate these PDFs so only the pages of the index are in each file. (You can use a PDF editor to do this.)
- Name each PDF with the following format:
lastname_ocrN_isbn13.pdf
For instance, Sheppard, Sheri D., Thalia Anagnos, and Sarah L. Billington. Engineering mechanics: Statics: Modeling and analyzing systems in equilibrium. Wiley Global Education, 2017. ISBN-13: 978-1119725138 would translate to:
sheppard_orcN_9781119725138.pdf
Note: We use the ocr
flag to denote PDFs that need Optical Character Recognition (OCR) to deal with scans that do not have digital text. This is part of an automated OCR workflow that we tested but could not get working reliably.
-
(Collect your source PDFs) Place all of your trimmed and properly-named PDFs in the
data_pdf
folder. -
Run the scraping tool with the terminal invocation
python scrape_all.py
. This will either generate or overwrite the filedata_proc/masterlist.csv
. After running thescape_all.py
utility, themasterlist.csv
file will contain all lines from all indexes in a single table. For instance, this is the top of our table:
| Term | ISBN | | 726 Index | 9780521883030 | | Virtual movement | 9780521883030 | | see also | 9780521883030 |
We used a variety of R scripts to analyze the scraped index data; you may find some useful code in analysis/analyze_terms.Rmd
.