Internet Archive's Projects
Python interface to ACS4
Common components and utilities for the Archiving & Data Services (ADS) team at the Internet Archive
Parse OCR result files for pagenos, tables of contents, etc.
The Hypothesis web-based annotation client.
This is used to store and update just the build directory of annotate-client.
PDF.js + Hypothesis viewer / annotator
Web application for distributed compute analysis of Archive-It web archive collections.
Tools to analyze web archives
Efficient hOCR tooling
Fast PDF generation and compression. Deals with millions of pages daily.
archive.org e2e tests
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.
ARK minter, binder, resolver
The most powerful and flexible mocking framework for PHPUnit / Codeception.
This package contains a tool for automatically cropping and deskewing images of book pages captured by an Internet Archive Scribe bookscanner.
Experimenting with Apache Pig.
The Internet Archive BookReader
Archive.org OPDS Bookserver - A standard for digital book distribution
brozzler - distributed browser-based web crawler
Command line retrieval of torrents using transmission-daemon (via transmission-remote)
Summarize web archive capture index (CDX) files.
Python script to create CDX index files of WARC data
Go library for connecting to CertStream
Wikibase bot for updating identifiers and citation relationships
journal-level metadata munging. part of fatcat project