Giter Site home page Giter Site logo

gnd / archive_downloader Goto Github PK

View Code? Open in Web Editor NEW
73.0 3.0 10.0 38 KB

A node.js book downloader from Archive.org

License: GNU General Public License v3.0

JavaScript 52.12% Shell 47.88%
ebook ebook-process ocr archive-dot-org pdf pdf-converter

archive_downloader's Introduction

Archive_downloader

A node.js book downloader from Archive.org

Install

For downloading borrowed books from Archive.org you will first need:

apt-get install npm
npm install sleep
npm install request

To convert and OCR the downloaded images into a pdf with make_pdf.sh you will also need:

apt-get install imagemagick tesseract-ocr poppler-tools

Downloading a book:

  1. Install EditThisCookie for Chrome, or use something else for cookie extraction
  2. Login to archive.org
  3. Borrow a book
  4. Copy your cookies:
  • In EditThisCookie options, first set the preferred export format to 'Semicolon separated name=value pairs'
  • Click export and paste just the cookies (without comments) into the cookies = ''; in the node_dl.js
  • If you are using another way to retrieve your cookies, just put your cookies into the cookies variable in node_dl.js
  • Set other variables like ua (user-agent), pages (how many pages the book has), local_name (where to download and how to name the files)
  1. You might want to create a directory for the files, eg. books/my_book. In that case the local_name should be books/my_book/book_name
  2. Run node node_dl.js

Converting downloaded files into searchable OCR'ed pdf:

  1. Run make_pdf.sh books/my_book output_name
  2. This will convert all jp2 files in the folder books/my_book into jpg's, OCR those jpg files and output into separate pdfs and finally join all pdfs into output_name.pdf

archive_downloader's People

Contributors

gnd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

archive_downloader's Issues

Empty .jp2 files

Hello, guys!
Does it still work?
Because I get empty .jp2 files and all of them are the same - 49 bytes of nothing...

Maybe I did something wrong?

improvements to be worked in

> > --> nie poppler-tools, ale poppler-utils
> > 
> > okrem tych par veci som musel rucne doinstalovat:  
> >> npm install sleep
> > inak mi to hadzalo:
> > internal/modules/cjs/loader.js:638
> >     throw err;
> > Error: Cannot find module 'sleep'
> > 
> > a aj toto: 
> >> npm install --save readline-sync
> > inak hadzalo:
> > internal/modules/cjs/loader.js:638
> >     throw err;
> >     ^
> > Error: Cannot find module 'readline-sync'
> > 
> > v readme.md sa pise ze cookie sa pejstuje do node_dl.js, subor sa ale vola archive_dl.js (mozno si ho medzicasom premenoval na tento)
> > 
> > trochu kriticka chyba je ze "cookies" v config.json je mylne ako "cookie"
> > 
> > do config.json treba vlozit aj url_stub, ten v readme.md neni spomenuty, i ked v archive_dl.js v poznamkach je..
> > 
> > no a v archive_dl.js som musel zmenit
> > /* for (var i = 1; i < config.pages + 1; i++) {*/
> > na:
> >   for (var i = 1; i <= config.pages; i++) {
> > inak mi tu premennu z nejakeho dovodu ignoroval a siel bez ohladu na nu do 50
> > 
> > v make_pdf.sh mi OCRko neslo bez toho aby som zmenil:
> > #        tesseract $BOOK/$img -l eng $BOOK/$imgname pdf
> >         tesseract $BOOK/$img $BOOK/$imgname -l eng --dpi 150 pdf```

broken

downloader says Image downloaded but all downloaded files are

Error serving request:
  Image error: not found

maybe the query params are missing after the image file names

this works: https://github.com/MiniGlome/Archive.org-Downloader
set email, password, url ... and get the pdf. simple!

Image quality?

Hi,is the image quality using your downloader, better if I borrow normally (eg download the book via ACSM)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.