Light

gnd / archive_downloader Goto Github PK

View Code? Open in Web Editor NEW

73.0 3.0 10.0 38 KB

A node.js book downloader from Archive.org

License: GNU General Public License v3.0

JavaScript 52.12% Shell 47.88%

ebook ebook-process ocr archive-dot-org pdf pdf-converter

archive_downloader's Introduction

Archive_downloader

A node.js book downloader from Archive.org

Install

For downloading borrowed books from Archive.org you will first need:

apt-get install npm
npm install sleep
npm install request

To convert and OCR the downloaded images into a pdf with make_pdf.sh you will also need:

apt-get install imagemagick tesseract-ocr poppler-tools

Downloading a book:

Install EditThisCookie for Chrome, or use something else for cookie extraction
Login to archive.org
Borrow a book
Copy your cookies:

In EditThisCookie options, first set the preferred export format to 'Semicolon separated name=value pairs'
Click export and paste just the cookies (without comments) into the cookies = ''; in the node_dl.js
If you are using another way to retrieve your cookies, just put your cookies into the cookies variable in node_dl.js
Set other variables like ua (user-agent), pages (how many pages the book has), local_name (where to download and how to name the files)

You might want to create a directory for the files, eg. books/my_book. In that case the local_name should be books/my_book/book_name
Run node node_dl.js

Converting downloaded files into searchable OCR'ed pdf:

Run make_pdf.sh books/my_book output_name
This will convert all jp2 files in the folder books/my_book into jpg's, OCR those jpg files and output into separate pdfs and finally join all pdfs into output_name.pdf

archive_downloader's People

Contributors

Stargazers

Watchers

Forkers

exside perce-neige ehsaaaaaan arandomicy cmschmtt nurembergwitch darkcrafter logitech-byte nowshadsub15 marhbd68

archive_downloader's Issues

undefined cookie

Empty .jp2 files

Hello, guys!
Does it still work?
Because I get empty .jp2 files and all of them are the same - 49 bytes of nothing...

Maybe I did something wrong?

improvements to be worked in

> > --> nie poppler-tools, ale poppler-utils
> > 
> > okrem tych par veci som musel rucne doinstalovat:  
> >> npm install sleep
> > inak mi to hadzalo:
> > internal/modules/cjs/loader.js:638
> >     throw err;
> > Error: Cannot find module 'sleep'
> > 
> > a aj toto: 
> >> npm install --save readline-sync
> > inak hadzalo:
> > internal/modules/cjs/loader.js:638
> >     throw err;
> >     ^
> > Error: Cannot find module 'readline-sync'
> > 
> > v readme.md sa pise ze cookie sa pejstuje do node_dl.js, subor sa ale vola archive_dl.js (mozno si ho medzicasom premenoval na tento)
> > 
> > trochu kriticka chyba je ze "cookies" v config.json je mylne ako "cookie"
> > 
> > do config.json treba vlozit aj url_stub, ten v readme.md neni spomenuty, i ked v archive_dl.js v poznamkach je..
> > 
> > no a v archive_dl.js som musel zmenit
> > /* for (var i = 1; i < config.pages + 1; i++) {*/
> > na:
> >   for (var i = 1; i <= config.pages; i++) {
> > inak mi tu premennu z nejakeho dovodu ignoroval a siel bez ohladu na nu do 50
> > 
> > v make_pdf.sh mi OCRko neslo bez toho aby som zmenil:
> > #        tesseract $BOOK/$img -l eng $BOOK/$imgname pdf
> >         tesseract $BOOK/$img $BOOK/$imgname -l eng --dpi 150 pdf```

broken

downloader says Image downloaded but all downloaded files are

Error serving request:
  Image error: not found

maybe the query params are missing after the image file names

this works: https://github.com/MiniGlome/Archive.org-Downloader
set email, password, url ... and get the pdf. simple!

.

Image quality?

Hi,is the image quality using your downloader, better if I borrow normally (eg download the book via ACSM)?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.