Giter Site home page Giter Site logo

textbook-scraper's Introduction

textbook-scraper

This is the software toolchain used in the conference paper "A Scoping Review of Engineering Textbooks to Quantify the Teaching of Uncertainty". This toolchain uses Python code to consolidate digitized textbook indexes into a single spreadsheet (the masterlist.csv).

Scraping Workflow (Python)

Download the contents of this repository (or clone it locally) and run the following steps:

  1. (Install Python dependencies for textbook scraper) Using a python installation with a functional pip installation. Use the requirements.txt to install all python dependencies, i.e. with the command line invocation pip install -r requirements.txt from your terminal.
  • If you haven't used a terminal before, you may want to check out this tutorial
  1. (Prepare your source PDFs) Collect PDFs of the indexes you want to scrape. Make sure to truncate these PDFs so only the pages of the index are in each file. (You can use a PDF editor to do this.)
  • Name each PDF with the following format:

lastname_ocrN_isbn13.pdf

For instance, Sheppard, Sheri D., Thalia Anagnos, and Sarah L. Billington. Engineering mechanics: Statics: Modeling and analyzing systems in equilibrium. Wiley Global Education, 2017. ISBN-13: 978-1119725138 would translate to:

sheppard_orcN_9781119725138.pdf

Note: We use the ocr flag to denote PDFs that need Optical Character Recognition (OCR) to deal with scans that do not have digital text. This is part of an automated OCR workflow that we tested but could not get working reliably.

  1. (Collect your source PDFs) Place all of your trimmed and properly-named PDFs in the data_pdf folder.

  2. Run the scraping tool with the terminal invocation python scrape_all.py. This will either generate or overwrite the file data_proc/masterlist.csv. After running the scape_all.py utility, the masterlist.csv file will contain all lines from all indexes in a single table. For instance, this is the top of our table:

| Term | ISBN | | 726 Index | 9780521883030 | | Virtual movement | 9780521883030 | | see also | 9780521883030 |

Suggestions on analysis

We used a variety of R scripts to analyze the scraped index data; you may find some useful code in analysis/analyze_terms.Rmd.

textbook-scraper's People

Contributors

zdelrosario avatar madans2984 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.