Gutenberg Book Normalize
Normalize project Gutenberg books to a format easier for statistical models and machine learning to consume
Installation
git clone [email protected]:ChristianMurphy/gutenberg-book-normalize.git
cd gutenberg-book-normalize
npm install
Usage
Download books
Download all project Gutenberg English languages books in HTML format
Uses project Gutenberg's official robot access guide recommendations https://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages
npm run gutenberg-download
Extract books
Unzips content into files and folders
npm run gutenberg-extract
Normalize books
Normalizes HTML content into an easier to process JSON format
npm run gutenberg-normalize
Example output:
{
"type": "book",
"title": "lorem ipsum",
"author": "lorem ipsum",
"children": [
{
"type": "chapter",
"title": "lorem ipsum",
"level": "h2",
"children": [
{
"type": "paragraph",
"value": "lorem ipsum"
}
]
}
]
}
๐ format conforms to unist. Any of the unist utilities can be used to further process the content.