Collection of open textbooks for use in language modeling. The focus of this repo is computable and easily downloadable representations of open textbooks, including their source PDF, text (converted using pdftotext
), and key metadata (e.g., licensing).
- Open Textbook Library 802/1221
- OpenStax 75/75
- B.C. Open Collection 179/301
We downloaded any PDF that was directly accessible via the hosting website. Note, some textbooks are not available in PDF or require more complex download procedures.
- metadata.tsv
key, source, title, category, license, url
- data.tsv
key, source, text
The original PDFs are available via Google Drive.