miniglome / archive.org-downloader Goto Github PK

Python3 script to download archive.org books in PDF format

Python 100.00%

archive.org-downloader's Introduction

Archive.org-Downloader

Python3 script to download archive.org books in PDF format

About The Project

There are many great books available on https://openlibrary.org/ and https://archive.org/, however, you can only borrow them for 1 hour to 14 days and you don't have the option to download it as a PDF to read it offline or share it with your friends. I created this program to solve this problem and retrieve the original book in pdf format for FREE!

Of course, the download takes a few minutes depending on the number of pages and the quality of the images you have selected. You must also create an account on https://archive.org/ for the script to work.

Getting Started

To get started you need to have python3 installed. If it is not the case you can download it here : https://www.python.org/downloads/

Installation

Make sure you've already git installed. Then you can run the following commands to get the scripts on your computer:

git clone https://github.com/MiniGlome/Archive.org-Downloader.git
cd Archive.org-Downloader

The script requires the modules requests, tqdm and img2pdf, you can install them all at once with this command:

pip install -r requirements.txt

Usage

usage: archive-org-downloader.py [-h] -e EMAIL -p PASSWORD [-u URL] [-d DIR] [-f FILE] [-r RESOLUTION] [-t THREADS] [-j]

optional arguments:
  -h, --help            show this help message and exit
  -e EMAIL, --email EMAIL
                        Your archive.org email
  -p PASSWORD, --password PASSWORD
                        Your archive.org password
  -u URL, --url URL     Link to the book (https://archive.org/details/XXXX). You can use this argument several times
                        to download multiple books
  -d DIR, --dir DIR     Output directory
  -f FILE, --file FILE  File where are stored the URLs of the books to download
  -r RESOLUTION, --resolution RESOLUTION
                        Image resolution (10 to 0, 0 is the highest), [default 3]
  -t THREADS, --threads THREADS
                        Maximum number of threads, [default 50]
  -j, --jpg             Output to individual JPG's rather than a PDF
  -m, --meta            Output the metadata of the book to a json file

The email and password fields are required, so to use this script you must have a registered account on archive.org. The -r argument specifies the resolution of the images (0 is the best quality). The PDF are downloaded in the current folder

Example

This command will download the 3 books as pdf in the best possible quality. To only download the individual images you can use --jpg.

python3 archive-org-downloader.py -e [email protected] -p Passw0rd -r 0 -u https://archive.org/details/IntermediatePython -u https://archive.org/details/horrorgamispooky0000bidd_m7r1 -u https://archive.org/details/elblabladelosge00gaut

If you want to download a lot of books, you can paste the urls of the books in a .txt file (one per line) and use --file

python3 archive-org-downloader.py -e [email protected] -p Passw0rd --file books_to_download.txt

Donation

If you want to support my work, you can send 2 or 3 Bitcoins 🙃 to this address:

bc1q4nq8tjuezssy74d5amnrrq6ljvu7hd3l880m7l

archive.org-downloader's People

Contributors

Stargazers

Watchers

Forkers

locussta hirajanwin milahu aliafshany mataka123 alobeep ypyd sharpenyoursword arhipov-rp s0902rx darnn mrcodechef dudko210 retrocib alefcamargo ramk143 erdoukki mq4l30ns dimitrinaama zqq5054 enuetrino kenneth-lin pemd-sys 1989shack gladiopeace tim1512 nanaao mafsi 34zy faroverninethousand sam-6174 killbugs user087 moodykeke nonsensestudio 5l1v3r1 afshanyteam devendermahto oppailee mankoc kafaimak logitech-byte turkishman41 sxrx7 yilmazchef maximka1812 bruceanhuifs arturobernalg jbaznik niyaz-ahmad bariscanbilgin sphekes neelum23 rendouan hdde omssp sherpadai chicode-dev marmar04 github-userx curtis18 spielers pabloecancino cerumo johannrm mafm fecc82 realaz23 byelliot nopeanuts aristotllforks 06b iq-scm markismus hana758 voeboe tuannm93 almakedon exponent4806 sleeplessinva xlymian luismedel socho009 belumg ryzhykovoleksandr chriskisskiss aurorum-studio arturbernardo cafercangundogdu xtremeperf sammadox aakba10000 andgandolfi iooox fd2013 amanofcharacter mxdoerfler tarsioonofrio murder46 acloserview

archive.org-downloader's Issues

request: online downloader, addon extension or plugin for jdownloader?

Hi, I'm wondering if its possible this code can be applied towards something that's more automated to download all of the original high-res scanned jpg files after a publication is borrowed for the 1 hour.

Such as copying the archive.org link into an online downloader or a Firefox addon extension, or as plugin for jdownloader.

I'm not a developer or coder so I'm searching for a simpler solution.

Here's a link to the type of publications I need to grab:
https://archive.org/details/pub_interview?sort=-addeddate

Error while downloading a book.

1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "archive-org-downloader.py", line 200, in
session = login(email, password)
File "archive-org-downloader.py", line 52, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

suggestion: use tesseract instead of img2pdf

Hi,

Many thanks for this highly useful tool!

Currently I download using --jpg and manually rename to the correct order (see also #53). Then to get a PDF with selectable/searchable text I use Tesseract OCR to analyse the images and make a PDF. The process I used last time (on Debian) was:

# download bookname with -r 0 -j
cd bookname
rename s/^/0/ ?.jpg
rename s/^/0/ ??.jpg
# rename s/^/0/ ???.jpg  # repeat as needed for books with 000s of pages...
cd ..
ls -1 bookname/*.jpg > index.txt
tesseract index.txt bookname pdf
# output in bookname.pdf

It would be great if this could be automated. I might attempt to implement it myself.

Advantages of Tesseract: selectable searchable text
Disadvantages of Tesseract: can be much slower

Both img2pdf and Tesseract keep JPGs as-is without re-encoding at all.

Cheers!

Clipboard support and other things

I made script modification with added support for clipboard (to instantly get URL without any typing) and other stuff, including some of not merged pull requests here.

You can check at https://github.com/maximka1812/AD---Archive-Download-Tool

Suggestion

Hello. Personally I have zero knowledge in coding. I tried to apply the indicated steps, but I couldn't download books. Maybe I should have done something else before following the instructions.Can you can add a more detailed description of the steps for those who need an eli5

This book does not need to be borrowed

It downloads the book but it does not borrow it and most pages are unavailable. If I manually borrow the book I can see all pages, but running the code does nothing but display repeatedly that the book does not need to be borrowed.

Trying to get https://archive.org/details/workbookforwheel00paul and https://archive.org/details/wheelockslatinre00whee.

It worked with other books without any problems!

Thanks for everything anyway!

script error when trying load the downloader

when trying to to use the downloader i get the error that the module requests doesn't exist, but when trying to install it shows me that it is already installed. I'm using python 3.10.4

cannot download all files when there are more than one viewable files in a "Book"

for example, this book has more than one files:
https://archive.org/details/bdrc-W1KG16651/bdrc-W1KG16651-11/page/478/mode/1up

Archive.org-Downloader can only download the first file.

img2pdf.ImageOpenError after downloading

Hello,

Thanks a lot for this package. I tried downloading two books from archive.org. First worked successfully, for the second I get an error regarding img2pdf. All requirements seem to be met.

[+] Successful login
[+] Successful loan
[+] Found 262 pages
Donwloading pages...
100%|█████████████████████████████████████████████████████████████████| 262/262 [02:59<00:00, 1.46it/s]
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1349, in read_images
imgdata = Image.open(im)
File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 2958, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/Downloads/archive_dl/Archive.org-Downloader/downloader.py", line 123, in
pdf = img2pdf.convert(images)
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 2032, in convert
) in read_images(rawdata, kwargs["colorspace"], kwargs["first_frame_only"]):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1353, in read_images
raise ImageOpenError(
img2pdf.ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>

Happy to share the archive.org book url, not sure if that violates github TOS.

"Invalid credentials" error

Preface: I attempted to run this script on this setup:

Windows 7 Ultimate x64 machine, fully updated up to current via ESU
Python ver. 3.8.10; other necessary components like git were installed and ran correctly.

When trying to run the script to grab a currently borrowed book, I kept getting the same error referenced in (#36). Below are two examples of the variations that I typed in an attempt to fix the error:

archive-org-downloader.py -e [email protected] -p 'P55sw0Rd' -u https://archive.org/details/metalheartismove0000lind -d C:\Users\Name\Desktop\folder -r 0
archive-org-downloader.py -e '[email protected]' -p 'P55sw0Rd' -u https://archive.org/details/metalheartismove0000lind -d C:\Users\Name\Desktop\folder -r 0

I still get an 'Invalid credentails!' error for both, so I am unsure what I'm doing wrong here.

directory variable gets longer and longer - weird concatenation

when using the -f option to pull urls off of a text file, the download directories start nesting
like this

Thanks for the script!

Option to not return book?

Thanks for the tool, it works beautifully. Just wondering if it would be possible to add a flag that disables the auto-return of loaned books? I was trying to download both the pdf and jpgs of a particular book and had to reloan the book in order to make the two downloads, would be nice to be able to do multiple runs, in case something goes wrong, then choose to return the book. Thanks!

Something went wrong trying to borrow this book

1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/artofhungarianco00benn/
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"No identifier provided."}

This is the error I get no matter what, whether I've borrowed the book or not.

archive-org-downloader.py: error: At least one of --url and --file required

Receive the error with all examples and my own:

archive-org-downloader.py: error: At least one of --url and --file required

Linux Mint 19
Python 3.7.11

Only downloads a few pages.

I'm trying to download a rather large book but when it finishes only 10 or so pages are complete, the rest are a page that says "Page Temporarily unavailable. This page is part of a limited preview. Please try again tomorrow. Use your free account to borrow this book and gain access to all pages."

https://archive.org/details/shinmeikaikokugo0000unse/page/n9/mode/2up
here's the book if that helps.

Any help would be appreciated.

OCR

Is OCR the pdf possible using something like Tesseract or OCRmyPDF?

Weird units...

[00:02<00:00, 7.08it/s]

is it/s some weird unit, or is it Mbits/s or similar that's being weirdly cropped?

Gratefulness

I didn't know where to say thanks. Your dev effort is a really amazing.

Thanks a lot, @MiniGlome ... Wish you the best 💯

InvalidURL "no host supplied" error after successful loan, doesn't download

Archive.org-Downloader-main>archive-org-downloader.py -e email -p password -u https://a
rchive.org/details/masteropticaltec0000deva -r 0
1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/masteropticaltec0000deva
[+] Successful loan
Traceback (most recent call last):
  File "Archive.org-Downloader-main\archive-org-downloader.py", line 209, in <module>
    title, links = get_book_infos(session, url)
  File "Archive.org-Downloader-main\archive-org-downloader.py", line 22, in get_book_infos
    response = session.get(infos_url)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 542, in get
    return self.request('GET', url, **kwargs)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 515, in request
    prep = self.prepare_request(req)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 443, in prepare_request
    p.prepare(
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
18, in prepare
    self.prepare_url(url, params)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
95, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)

The URL for the file is https://archive.org/details/masteropticaltec0000deva

I had been using this program successfully for some time now, even downloading this file beforehand in jpeg form, so not sure what's changed- I got this error with the version I was using, then updated and found no difference.

Might I have uninstalled something required? I'm pretty sure I haven't since I last used it, just checking if it's that sort of mistake. I tried an extra slash at the end of the url, then left off the resolution flag- nothing.

I am on Windows 10 64 bit. The rest of the error is long and is attached. Thanks for any help!
restoferror.txt

Is it working anymore ?

Hi I used to download a few books with this script a couple of months ago, but now it always gives me this weird very long output with html lines and a bunch of numbers without downloading anything. I suppose Archive.org might have changed something ?
It connects and identifies the book right and borrows it, but doesn't download.

Remote end closed connection without response

ive tried the solution by human but it didn't work. Please help...my young brother has sent me a list of the books needed for his drama class and i need your help
my script...

$ python3 archive-org-downloader.py -e [email protected] -p 00000000 -r 0 -u https://archive.org/details/bullshtartistlea0000klei/page/5/mode/2up
1 Book(s) to download
Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 450, in send
timeout=timeout
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\util\retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\packages\six.py", line 769, in reraise
raise value.with_traceback(tb)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "archive-org-downloader.py", line 191, in
session = login(email, password)
File "archive-org-downloader.py", line 50, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 577, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

can't run the scripit

Hi, I'd like to use this script.
I installed python and git and python is already configured in my environment variables.
When I check python version inside git I can see I have 3.10.7 version installed.
At the moment of running your script nevertheless it says "python not found".
I followed your instructions to the letter but still I could't use the script.
I admit I am totally illiterate in programming. Thank you.

preserve unique identifier of download item

Sometimes multiple variants of the same book are available and it might be desirable to download all of them for comparison purposes, in order to choose the best quality version.

Unfortunately, the download folder's unique name is normally renamed into the long book title. As it often happens to be the very same title for each unique variant, downloading multiple variants results in overwriting each other.

It is much more preferable to retain the unique identifier for each download variant, especially since it also enables to later clearly identify the original download source.

To ensure this, following modification does the trick:

--- archive-org-downloader.py    2021-10-21 08:35:41.589757183 +0200
+++ myarchive-org-downloader.py  2021-12-07 06:51:12.078410887 +0200
@@ -197,7 +197,7 @@
                session = loan(session, book_id)
                title, links = get_book_infos(session, url)
 
-               directory = os.path.join(os.getcwd(), title)
+               directory = os.path.join(os.getcwd(), book_id)
                if not os.path.isdir(directory):
                        os.makedirs(directory)

Attempting to download books that don't need to be borrowed results in an error

For books in the public domain, etc.
For example, trying to download this:
https://archive.org/details/dli.ernet.247978
Gives this error:
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"You do not currently have this book borrowed."}

Granted, on the book's page you can simply download a zip with the individual images, but if I'm just copying the URLs of several books, I'm usually not checking individually whether they need to be borrowed or not. In the case of this author, for instance, this book does not need to be borrowed, but the rest do:
https://archive.org/search.php?query=creator%3A(Mure%20Pierre)%20AND%20mediatype%3A(texts)

NoADirectoryError: directory name is invalid?

1 Book(s) to download
[+] Successful login

Current book: https://archive.org/details/fromempiretoeuro0000owen
[+] Successful loan
[+] Found 550 pages
Traceback (most recent call last):
File "archive-org-downloader.py", line 197, in
os.makedirs(directory)
File "C:\Users\Sivn\AppData\Local\Programs\Python\Python38\lib\os.py", line 223, in makedirs
mkdir(name, mode)
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\Users\Sivn\Archive.org-Downloader\From_empire_to_Europe_:_the_decline_and_revival_of_British_industry_since_the_Second_World_War'

I'm new to python and coding, so I'm not entirely sure what's causing this :(

Script does handle long names; suggested fix

When trying to download files with long names, such as this one, https://archive.org/details/weiblauesschwarz0000fend, the file names are too long (on macOS 11.6, which has a 255 character file name limit). Adding title = title[:251] at line 18 works to trim the title if it is longer than 251 characters, and allows enough room for the addition of ".pdf" later in the process.

Using —file for different URLs but with same name overwrites the previous PDFs

When using the file list option and downloading four volumes from the same series — which have the same name in Internet Archive — they are given the same name by this downloader when the PDF is created, and therefore will overwrite each other.

For example, the first four results on this search are all different, even though they have the same title on their resepective pages.

https://archive.org/search.php?query=Schweizer+lexikon&and[]=mediatype%3A%22texts%22

If you add all four URLs to your download list, at the end you will end up with just the PDF of the final volume.

For the moment, to drastically decrease the chances of this happening, I have used an available variable to add the page count to the file name when writing the PDF. This won't always prevent overwrites, however.

The changed code:

def make_pdf(pdf, title): with open(f"{title}-{len(links)}pp.pdf","wb") as f: f.write(pdf) print(f"[+] PDF saved as \"{title}-{len(links)} pp.pdf\"")

Suggestion: Add option for output directory

Just a suggestion to allow the user to specify the output directory. I see you already have the directory variable in main so it is actually a matter of adding it to the args being handled, plus adding it as a parameter for make_pdf.

Script no longer works when before it used to work

Initially I used an older version, but it also does not work on the latest version.

The command I inputted was (with email and password redacted):

python3 archive-org-downloader.py -e [email protected] -p password -u https://archive.org/details/anarchistvoiceso0000avri

It gave the following output before quitting without generating any files.

1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/anarchistvoiceso0000avri
[+] Successful loan
Traceback (most recent call last):
  File "archive-org-downloader.py", line 209, in <module>
    title, links = get_book_infos(session, url)
  File "archive-org-downloader.py", line 22, in get_book_infos
    response = session.get(infos_url)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 452, in prepare_request
    p.prepare(
  File "/usr/lib/python3/dist-packages/requests/models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "/usr/lib/python3/dist-packages/requests/models.py", line 390, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n    / _` | \'_/ _| \' \\| |\\ V / -_)\n    \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n  <head data-release=8a2d548c>\n   ....

I redacted the rest after "...." because it seems to be all html and it doesn't fit into the comment.

Page numbering of .JPGs causing wrong order upon re-assembly

I don't know the name / terms for this issue, but it's the same described here https://www.tenforums.com/general-support/165181-sort-problem-i-get-1-10-11-2-rather-than-1-2-how-do-i-fix.html

The 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 order will end up as 1, 11, 12 etc when re-assembled into a pdf with another program.

This isn't an issue with another program, just a classic filename issue. I thought I had solved this with Irfanview batch but I messed up enough times to come and ask for a fix 😛

another tool for the chest

https://github.com/apprenticeharper/DeDRM_tools

"This book doesn't need to be borrowed" error for some books

Got such error and found the reason, sometimes URLs in TXT file can contain spaces after URL

It is best to add url=url.rstrip() inside main loop, as error is very hard to spot by user

Also strongly advice to output Current book name inside two " ", such was any error can be seen

pip install -r requirements.txt

Always give me a error in the MINGW32.
bash: pip: command not found

I will be nice to see a video explain how to install this first and download one book. Thanks.

Download directories are nested instead of being deleted for each downloaded file

I'm using the script on macOS 12.4 with Python 3.9.x (currently 3.9.13). I recently upgraded the script after being behind a few versions and the last two versions of the script have a bug where the directory that is created for each file isn't usually deleted. Instead, the next file's directory is created inside of that one, and the next one inside of that one, and so on.

 - downloadFileDirOne
   - downloadFileDirTwo
     - downloadFileDirThree

It seems to be deleting only the directory of the last file downloaded from a download list and not each directory in turn as as they are emptied after PDFs are made. The expected behavior would be that the directory would be deleted after the PDF is made.

I don't know enough about this, but my guess is that directory isn't being properly defined when looping for shutil.rmtree(directory) in line 225. Replacing this code at lines 209-216

		directory = os.path.join(directory, title)
		# Handle the case where multiple books with the same name are downloaded
		i = 1
		d = directory
		while os.path.isdir(directory):
			directory = f"{d}({i})"
			i += 1
		os.makedirs(directory)

with this code from an earlier version of the script solves the problem:

		directory = os.path.join(os.getcwd(), title)
		if not os.path.isdir(directory):
			os.makedirs(directory)

The "handle the case where multiple books with the same name are downloaded" doesn't seem to be necessary at 209-216, at least according to my testing, because the case is already handled at lines 141-145 with:

	# Handle the case where multiple books with the same name are downloaded
	i = 1
	while os.path.isfile(os.path.join(directory, file)):
		file = f"{title}({i}).pdf"
		i += 1

Additionally, if the nest of folders is too deep, then an error occurs:

Traceback (most recent call last): File "/Users/username/Archive.org-Downloader/archive-org-downloader.py", line 216, in <module> os.makedirs(directory) File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 63] File name too long: "

some books not downling right

I was having trouble downloading some books like on this link https://archive.org/details/brainsexrealdiff00moir/page/n263/mode/2up
it does the first 9 pages and then after that page goes temp unvailable :(

page goes temp unvailable

Memory usage

When working with many books from .txt list the script sums up memory after every book, so it goes up and up until 8000000 or so, and then hangs up.
I believe this shouldn't be so, it should free memory after every book, but I don't have enough knowledge to figure the cause.
Maybe you could look this up.
And thanks for the great script!

PS. I believe the problem is somewhere in pdf converter, because if i'm working with -j flag i don't get this extra memory usage issue.

Invalid credential

Hi,
This is the error message that I get. My credential are ok since I'm logged on the website.

Edited later:

Is working for passwords that don't have special characters.
for those who are & for n00bs you should specify that those password must passed with ' ' like 'P@:ssW0rd'

3 Feature Request - Bulk Download | Save as JPG | Return Book

Feature Request 1:
A way to bulk download multiple books. I have hard-coded the login details and book quality, all that's missing is some method to pass multiple URL's through the script. From maybe a text file?

Feature Request 2:
Output to individual JPG's in a folder or a compressed zip file, rather then a PDF.

Feature Request 3:
Ability to return book after successful save of book images.

(Your script works AMAZING! These 3 features would make it... literally the best!)

Problem with download

The download don't start: it's always on 0...why?

"No host supplied" when trying to download any book

If I try to download any book, say this one from the README:

python3 -m downloader -e [email protected] -p mypassword -u https://archive.org/details/elblabladelosge00gaut

I get hit with a big fat error message:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 190, in <module>
    title, links = get_book_infos(session, url)
  File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 15, in get_book_infos
    response = session.get(infos_url)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 393, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n    / _` | \'_/ _| \' \\| |\\ V / -_)\n    \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n  <head data-release=af39621e>\n    <title>El blablá de los gemelos : Gauthier, Bertrand, 1945- : Free Download, Borrow, and Streaming : Internet Archive</title>\n\n          <meta name="viewport" 

... snip - a gigantic amount of HTML ...

        });\n      </script>\n      </div>\n': No host supplied

interrupting download is harder than necessary

a single Ctrl-C doesn't do much, the downloads continue (tested on Debian Linux)

repeatedly mashing Ctrl-C does eventually work, but with a flood of stack traces in the terminal.

this seems to be a common issue with Python3 thread pools, currently trying to find out what the best fix is...

Just wanted to say that it works like a charm!

Great job, @MiniGlome ! :)

Error on the described query

I followed exact steps and everything looked fine. But running the query on this book :(https://archive.org/details/oraldiagnosis0000kerr) fails.
The exact query is as follows:
python3 archive-org-downloader.py -e [email protected] -p password -r 0 -u https://archive.org/details/oraldiagnosis0000kerr

`1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

reget for interrupted downloads

If for some reason a download session is interrupted or if fetched images are incomplete, any attempt to continue downloading always starts from zero again. This is very unfortunate in case of large books with hundreds of pages where multiple download attempts result in a multiplication of the actual download size.
It would be great if a reget feature for incomplete downloads like the one's we are used to with wget/curl could be added.
Would be just great for people with an unstable and/or slow or volume limited internet connection.
Alternatively, if downloads could be limited to a specific page range or specific single pages, then incomplete downloads could be selectively corrected.
Thanks!

Thank you so much! Your app is great!

As shown on the title, this is not an issue, but I don't know the how to say thank you except create one (I don't own any Bitcoins, unfortunately).
You might not know how happy I am when I could not find a way to access a book on Archive.org which there is no pdf available, until I found your app.
Downloaded, installed, ran it and boom! The pdf file is in my desktop! Fantastic!

Again, thank you very much for your hard work!
Wish you all the best in your life!

When downloading without converting to PDF, books in folders with the same name get overwritten

The title basically covers it. It doesn't seem like there's a check to see whether a folder that a book is being downloaded to already exists. I forget which actual books this happened to me with, but it's easily reproduced by downloading a book (with -j) and then immediately downloading it again.

Also, I don't know if you want to mention this in your documentation, but I managed to get this running in Cygwin (after installing all the dependencies, which as a novice was no easy feat in itself), but only after commenting out "import img2pdf", because img2pdf doesn't compile in Cygwin.

Small spelling error

Hi,

First of all, many thanks for this script! It works perfectly.

One small spelling error in the code:

Archive.org-Downloader/downloader.py

Line 119 in f1a72b3

print("Donwloading pages...")

"Donwloading" > "Downloading"

archive.org downloader, help

Hello, how are you?
Sorry to bother you, but I'd need a little help.

You published a very interesting code for downloading private files from "archive.org", but unfortunately I don't have much knowledge of python, I didn't understand where I should enter the data [email, password, desired link, quality and file type], everytime I run the program, it returns with the options and exits.

Could you please tell me where I should replace this information to perform the download?
Or maybe modify the code so that in the terminal it asks: "email, pass, desired link and quality"? In this way many people who don't know python can enjoy your great code.

By the way, I know that for you it seems very easy and it would be up to me to study python, however it wasn't for lack of will or commitment, but I don't have much aptitude for programming and I can't advance.

Thank you very much.

Best regards,
Montoro.

PS1: Do I need to borrow the book for your code to work?
PS2: img2pdf==0.4.0 doesn't install at all

Way to set resolution to whatever the normal would be when downloading a book from archive.org

When I tried downloading a book with various resolution levels, and then also just borrowed the book and downloaded it to adobe digital editions, the version that downloaded is about 50 MB, and downloading with -r 3 is 151 MB and with -r 4 is 48 MB. The 48 MB pdf is quite a bit smaller when fully zoomed in compared to the adobe digital editions version fully zoomed in.

I guess it would be great to be able to download a version that is the same size as what you get when downloading to adobe digital editions--both size in MB and size in inches when fully zoomed in.

Is this possible?

Thanks!

Allowing the script to handle non-ASCII special characters

I download a lot of books in other languages which include a lot of non-ASCII special characters in their titles. The script as written strips out all but ASCII characters and numbers. However, if I remove the code that does the stripping, it seems to handle non-ASCII characters just fine. Change line 17 to title = "".join([c for c in data['brOptions']['bookTitle']]).

Here's an example book: https://archive.org/details/dictionnairetymo0000bloc

It has é and ç in the title and they are preserved in the file name after the script change.

Here are some other example books that have non-ASCII characters in their titles which worked with this script modification:

https://archive.org/details/bdrc-W1AB6
https://archive.org/details/morisasakihindik0000unse
https://archive.org/details/hindijapanesedic00kazu
https://archive.org/details/kainantohogenkis008800

I'm guessing Python 2 users would really need the ASCII-only stripping, as it does not handle Unicode encodings automatically like Python 3 does. Yet the examples invoke python3, so that's what users should be using.

miniglome / archive.org-downloader Goto Github PK

archive.org-downloader's Introduction

Archive.org-Downloader

About The Project

Getting Started

Installation

Usage

Example

Donation

archive.org-downloader's People

Contributors

Stargazers

Watchers

Forkers

archive.org-downloader's Issues

1 Book(s) to download [+] Successful login

Recommend Projects

Recommend Topics

Recommend Org

1 Book(s) to download
[+] Successful login