miniglome / archive.org-downloader Goto Github PK
View Code? Open in Web Editor NEWPython3 script to download archive.org books in PDF format
Python3 script to download archive.org books in PDF format
I didn't know where to say thanks. Your dev effort is a really amazing.
Thanks a lot, @MiniGlome ... Wish you the best π―
When working with many books from .txt list the script sums up memory after every book, so it goes up and up until 8000000 or so, and then hangs up.
I believe this shouldn't be so, it should free memory after every book, but I don't have enough knowledge to figure the cause.
Maybe you could look this up.
And thanks for the great script!
PS. I believe the problem is somewhere in pdf converter, because if i'm working with -j flag i don't get this extra memory usage issue.
Feature Request 1:
A way to bulk download multiple books. I have hard-coded the login details and book quality, all that's missing is some method to pass multiple URL's through the script. From maybe a text file?
Feature Request 2:
Output to individual JPG's in a folder or a compressed zip file, rather then a PDF.
Feature Request 3:
Ability to return book after successful save of book images.
(Your script works AMAZING! These 3 features would make it... literally the best!)
I made script modification with added support for clipboard (to instantly get URL without any typing) and other stuff, including some of not merged pull requests here.
You can check at https://github.com/maximka1812/AD---Archive-Download-Tool
[00:02<00:00, 7.08it/s]
is it/s some weird unit, or is it Mbits/s or similar that's being weirdly cropped?
I'm using the script on macOS 12.4 with Python 3.9.x (currently 3.9.13). I recently upgraded the script after being behind a few versions and the last two versions of the script have a bug where the directory that is created for each file isn't usually deleted. Instead, the next file's directory is created inside of that one, and the next one inside of that one, and so on.
- downloadFileDirOne
- downloadFileDirTwo
- downloadFileDirThree
It seems to be deleting only the directory of the last file downloaded from a download list and not each directory in turn as as they are emptied after PDFs are made. The expected behavior would be that the directory would be deleted after the PDF is made.
I don't know enough about this, but my guess is that directory
isn't being properly defined when looping for shutil.rmtree(directory)
in line 225. Replacing this code at lines 209-216
directory = os.path.join(directory, title)
# Handle the case where multiple books with the same name are downloaded
i = 1
d = directory
while os.path.isdir(directory):
directory = f"{d}({i})"
i += 1
os.makedirs(directory)
with this code from an earlier version of the script solves the problem:
directory = os.path.join(os.getcwd(), title)
if not os.path.isdir(directory):
os.makedirs(directory)
The "handle the case where multiple books with the same name are downloaded" doesn't seem to be necessary at 209-216, at least according to my testing, because the case is already handled at lines 141-145 with:
# Handle the case where multiple books with the same name are downloaded
i = 1
while os.path.isfile(os.path.join(directory, file)):
file = f"{title}({i}).pdf"
i += 1
Additionally, if the nest of folders is too deep, then an error occurs:
Traceback (most recent call last): File "/Users/username/Archive.org-Downloader/archive-org-downloader.py", line 216, in <module> os.makedirs(directory) File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 63] File name too long: "
I don't know the name / terms for this issue, but it's the same described here https://www.tenforums.com/general-support/165181-sort-problem-i-get-1-10-11-2-rather-than-1-2-how-do-i-fix.html
The 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 order will end up as 1, 11, 12 etc when re-assembled into a pdf with another program.
This isn't an issue with another program, just a classic filename issue. I thought I had solved this with Irfanview batch but I messed up enough times to come and ask for a fix π
I followed exact steps and everything looked fine. But running the query on this book :(https://archive.org/details/oraldiagnosis0000kerr) fails.
The exact query is as follows:
python3 archive-org-downloader.py -e [email protected] -p password -r 0 -u https://archive.org/details/oraldiagnosis0000kerr
`1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "archive-org-downloader.py", line 191, in
session = login(email, password)
File "archive-org-downloader.py", line 50, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))`
Got such error and found the reason, sometimes URLs in TXT file can contain spaces after URL
It is best to add url=url.rstrip() inside main loop, as error is very hard to spot by user
Also strongly advice to output Current book name inside two " ", such was any error can be seen
1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/artofhungarianco00benn/
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"No identifier provided."}
This is the error I get no matter what, whether I've borrowed the book or not.
When trying to download files with long names, such as this one, https://archive.org/details/weiblauesschwarz0000fend, the file names are too long (on macOS 11.6, which has a 255 character file name limit). Adding title = title[:251]
at line 18 works to trim the title if it is longer than 251 characters, and allows enough room for the addition of ".pdf" later in the process.
ive tried the solution by human but it didn't work. Please help...my young brother has sent me a list of the books needed for his drama class and i need your help
my script...
$ python3 archive-org-downloader.py -e [email protected] -p 00000000 -r 0 -u https://archive.org/details/bullshtartistlea0000klei/page/5/mode/2up
1 Book(s) to download
Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 450, in send
timeout=timeout
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\util\retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\packages\six.py", line 769, in reraise
raise value.with_traceback(tb)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "archive-org-downloader.py", line 191, in
session = login(email, password)
File "archive-org-downloader.py", line 50, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 577, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Hi,
Many thanks for this highly useful tool!
Currently I download using --jpg
and manually rename to the correct order (see also #53). Then to get a PDF with selectable/searchable text I use Tesseract OCR to analyse the images and make a PDF. The process I used last time (on Debian) was:
# download bookname with -r 0 -j
cd bookname
rename s/^/0/ ?.jpg
rename s/^/0/ ??.jpg
# rename s/^/0/ ???.jpg # repeat as needed for books with 000s of pages...
cd ..
ls -1 bookname/*.jpg > index.txt
tesseract index.txt bookname pdf
# output in bookname.pdf
It would be great if this could be automated. I might attempt to implement it myself.
Advantages of Tesseract: selectable searchable text
Disadvantages of Tesseract: can be much slower
Both img2pdf and Tesseract keep JPGs as-is without re-encoding at all.
Cheers!
Is OCR the pdf possible using something like Tesseract or OCRmyPDF?
Preface: I attempted to run this script on this setup:
When trying to run the script to grab a currently borrowed book, I kept getting the same error referenced in (#36). Below are two examples of the variations that I typed in an attempt to fix the error:
I still get an 'Invalid credentails!' error for both, so I am unsure what I'm doing wrong here.
Hello,
Thanks a lot for this package. I tried downloading two books from archive.org. First worked successfully, for the second I get an error regarding img2pdf. All requirements seem to be met.
[+] Successful login
[+] Successful loan
[+] Found 262 pages
Donwloading pages...
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 262/262 [02:59<00:00, 1.46it/s]
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1349, in read_images
imgdata = Image.open(im)
File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 2958, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/Downloads/archive_dl/Archive.org-Downloader/downloader.py", line 123, in
pdf = img2pdf.convert(images)
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 2032, in convert
) in read_images(rawdata, kwargs["colorspace"], kwargs["first_frame_only"]):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1353, in read_images
raise ImageOpenError(
img2pdf.ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>
Happy to share the archive.org book url, not sure if that violates github TOS.
for example, this book has more than one files:
https://archive.org/details/bdrc-W1KG16651/bdrc-W1KG16651-11/page/478/mode/1up
Archive.org-Downloader can only download the first file.
Hi I used to download a few books with this script a couple of months ago, but now it always gives me this weird very long output with html lines and a bunch of numbers without downloading anything. I suppose Archive.org might have changed something ?
It connects and identifies the book right and borrows it, but doesn't download.
I was having trouble downloading some books like on this link https://archive.org/details/brainsexrealdiff00moir/page/n263/mode/2up
it does the first 9 pages and then after that page goes temp unvailable :(
page goes temp unvailable
If for some reason a download session is interrupted or if fetched images are incomplete, any attempt to continue downloading always starts from zero again. This is very unfortunate in case of large books with hundreds of pages where multiple download attempts result in a multiplication of the actual download size.
It would be great if a reget feature for incomplete downloads like the one's we are used to with wget/curl could be added.
Would be just great for people with an unstable and/or slow or volume limited internet connection.
Alternatively, if downloads could be limited to a specific page range or specific single pages, then incomplete downloads could be selectively corrected.
Thanks!
Receive the error with all examples and my own:
archive-org-downloader.py: error: At least one of --url and --file required
Linux Mint 19
Python 3.7.11
For books in the public domain, etc.
For example, trying to download this:
https://archive.org/details/dli.ernet.247978
Gives this error:
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"You do not currently have this book borrowed."}
Granted, on the book's page you can simply download a zip with the individual images, but if I'm just copying the URLs of several books, I'm usually not checking individually whether they need to be borrowed or not. In the case of this author, for instance, this book does not need to be borrowed, but the rest do:
https://archive.org/search.php?query=creator%3A(Mure%20Pierre)%20AND%20mediatype%3A(texts)
As shown on the title, this is not an issue, but I don't know the how to say thank you except create one (I don't own any Bitcoins, unfortunately).
You might not know how happy I am when I could not find a way to access a book on Archive.org which there is no pdf available, until I found your app.
Downloaded, installed, ran it and boom! The pdf file is in my desktop! Fantastic!
Again, thank you very much for your hard work!
Wish you all the best in your life!
It downloads the book but it does not borrow it and most pages are unavailable. If I manually borrow the book I can see all pages, but running the code does nothing but display repeatedly that the book does not need to be borrowed.
Trying to get https://archive.org/details/workbookforwheel00paul and https://archive.org/details/wheelockslatinre00whee.
It worked with other books without any problems!
Thanks for everything anyway!
.If I try to download any book, say this one from the README:
python3 -m downloader -e [email protected] -p mypassword -u https://archive.org/details/elblabladelosge00gaut
I get hit with a big fat error message:
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 190, in <module>
title, links = get_book_infos(session, url)
File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 15, in get_book_infos
response = session.get(infos_url)
File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 555, in get
return self.request('GET', url, **kwargs)
File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
prep = self.prepare_request(req)
File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
p.prepare(
File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
self.prepare_url(url, params)
File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 393, in prepare_url
raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n / _` | \'_/ _| \' \\| |\\ V / -_)\n \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n <head data-release=af39621e>\n <title>El blablΓ‘ de los gemelos : Gauthier, Bertrand, 1945- : Free Download, Borrow, and Streaming : Internet Archive</title>\n\n <meta name="viewport"
... snip - a gigantic amount of HTML ...
});\n </script>\n </div>\n': No host supplied
when trying to to use the downloader i get the error that the module requests doesn't exist, but when trying to install it shows me that it is already installed. I'm using python 3.10.4
When using the file list option and downloading four volumes from the same series β which have the same name in Internet Archive β they are given the same name by this downloader when the PDF is created, and therefore will overwrite each other.
For example, the first four results on this search are all different, even though they have the same title on their resepective pages.
https://archive.org/search.php?query=Schweizer+lexikon&and[]=mediatype%3A%22texts%22
If you add all four URLs to your download list, at the end you will end up with just the PDF of the final volume.
For the moment, to drastically decrease the chances of this happening, I have used an available variable to add the page count to the file name when writing the PDF. This won't always prevent overwrites, however.
The changed code:
def make_pdf(pdf, title): with open(f"{title}-{len(links)}pp.pdf","wb") as f: f.write(pdf) print(f"[+] PDF saved as \"{title}-{len(links)} pp.pdf\"")
I'm trying to download a rather large book but when it finishes only 10 or so pages are complete, the rest are a page that says "Page Temporarily unavailable. This page is part of a limited preview. Please try again tomorrow. Use your free account to borrow this book and gain access to all pages."
https://archive.org/details/shinmeikaikokugo0000unse/page/n9/mode/2up
here's the book if that helps.
Any help would be appreciated.
a single Ctrl-C doesn't do much, the downloads continue (tested on Debian Linux)
repeatedly mashing Ctrl-C does eventually work, but with a flood of stack traces in the terminal.
this seems to be a common issue with Python3 thread pools, currently trying to find out what the best fix is...
Hello, how are you?
Sorry to bother you, but I'd need a little help.
You published a very interesting code for downloading private files from "archive.org", but unfortunately I don't have much knowledge of python, I didn't understand where I should enter the data [email, password, desired link, quality and file type], everytime I run the program, it returns with the options and exits.
Could you please tell me where I should replace this information to perform the download?
Or maybe modify the code so that in the terminal it asks: "email, pass, desired link and quality"? In this way many people who don't know python can enjoy your great code.
By the way, I know that for you it seems very easy and it would be up to me to study python, however it wasn't for lack of will or commitment, but I don't have much aptitude for programming and I can't advance.
Thank you very much.
Best regards,
Montoro.
PS1: Do I need to borrow the book for your code to work?
PS2: img2pdf==0.4.0 doesn't install at all
Thanks for the tool, it works beautifully. Just wondering if it would be possible to add a flag that disables the auto-return of loaned books? I was trying to download both the pdf and jpgs of a particular book and had to reloan the book in order to make the two downloads, would be nice to be able to do multiple runs, in case something goes wrong, then choose to return the book. Thanks!
Archive.org-Downloader-main>archive-org-downloader.py -e email -p password -u https://a
rchive.org/details/masteropticaltec0000deva -r 0
1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/masteropticaltec0000deva
[+] Successful loan
Traceback (most recent call last):
File "Archive.org-Downloader-main\archive-org-downloader.py", line 209, in <module>
title, links = get_book_infos(session, url)
File "Archive.org-Downloader-main\archive-org-downloader.py", line 22, in get_book_infos
response = session.get(infos_url)
File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
542, in get
return self.request('GET', url, **kwargs)
File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
515, in request
prep = self.prepare_request(req)
File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
443, in prepare_request
p.prepare(
File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
18, in prepare
self.prepare_url(url, params)
File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
95, in prepare_url
raise InvalidURL("Invalid URL %r: No host supplied" % url)
The URL for the file is https://archive.org/details/masteropticaltec0000deva
I had been using this program successfully for some time now, even downloading this file beforehand in jpeg form, so not sure what's changed- I got this error with the version I was using, then updated and found no difference.
Might I have uninstalled something required? I'm pretty sure I haven't since I last used it, just checking if it's that sort of mistake. I tried an extra slash at the end of the url, then left off the resolution flag- nothing.
I am on Windows 10 64 bit. The rest of the error is long and is attached. Thanks for any help!
restoferror.txt
Hello. Personally I have zero knowledge in coding. I tried to apply the indicated steps, but I couldn't download books. Maybe I should have done something else before following the instructions.Can you can add a more detailed description of the steps for those who need an eli5
Hi,
First of all, many thanks for this script! It works perfectly.
One small spelling error in the code:
Archive.org-Downloader/downloader.py
Line 119 in f1a72b3
"Donwloading" > "Downloading"
Just a suggestion to allow the user to specify the output directory. I see you already have the directory
variable in main
so it is actually a matter of adding it to the args being handled, plus adding it as a parameter for make_pdf
.
When I tried downloading a book with various resolution levels, and then also just borrowed the book and downloaded it to adobe digital editions, the version that downloaded is about 50 MB, and downloading with -r 3 is 151 MB and with -r 4 is 48 MB. The 48 MB pdf is quite a bit smaller when fully zoomed in compared to the adobe digital editions version fully zoomed in.
I guess it would be great to be able to download a version that is the same size as what you get when downloading to adobe digital editions--both size in MB and size in inches when fully zoomed in.
Is this possible?
Thanks!
Initially I used an older version, but it also does not work on the latest version.
The command I inputted was (with email and password redacted):
python3 archive-org-downloader.py -e [email protected] -p password -u https://archive.org/details/anarchistvoiceso0000avri
It gave the following output before quitting without generating any files.
1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/anarchistvoiceso0000avri
[+] Successful loan
Traceback (most recent call last):
File "archive-org-downloader.py", line 209, in <module>
title, links = get_book_infos(session, url)
File "archive-org-downloader.py", line 22, in get_book_infos
response = session.get(infos_url)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 546, in get
return self.request('GET', url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 519, in request
prep = self.prepare_request(req)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 452, in prepare_request
p.prepare(
File "/usr/lib/python3/dist-packages/requests/models.py", line 313, in prepare
self.prepare_url(url, params)
File "/usr/lib/python3/dist-packages/requests/models.py", line 390, in prepare_url
raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n / _` | \'_/ _| \' \\| |\\ V / -_)\n \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n <head data-release=8a2d548c>\n ....
I redacted the rest after "...." because it seems to be all html and it doesn't fit into the comment.
Great job, @MiniGlome ! :)
Always give me a error in the MINGW32.
bash: pip: command not found
I will be nice to see a video explain how to install this first and download one book. Thanks.
The title basically covers it. It doesn't seem like there's a check to see whether a folder that a book is being downloaded to already exists. I forget which actual books this happened to me with, but it's easily reproduced by downloading a book (with -j) and then immediately downloading it again.
Also, I don't know if you want to mention this in your documentation, but I managed to get this running in Cygwin (after installing all the dependencies, which as a novice was no easy feat in itself), but only after commenting out "import img2pdf", because img2pdf doesn't compile in Cygwin.
Hi,
This is the error message that I get. My credential are ok since I'm logged on the website.
Edited later:
' '
like 'P@:ssW0rd'
Current book: https://archive.org/details/fromempiretoeuro0000owen
[+] Successful loan
[+] Found 550 pages
Traceback (most recent call last):
File "archive-org-downloader.py", line 197, in
os.makedirs(directory)
File "C:\Users\Sivn\AppData\Local\Programs\Python\Python38\lib\os.py", line 223, in makedirs
mkdir(name, mode)
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\Users\Sivn\Archive.org-Downloader\From_empire_to_Europe_:_the_decline_and_revival_of_British_industry_since_the_Second_World_War'
I'm new to python and coding, so I'm not entirely sure what's causing this :(
1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "archive-org-downloader.py", line 200, in
session = login(email, password)
File "archive-org-downloader.py", line 52, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Hi, I'm wondering if its possible this code can be applied towards something that's more automated to download all of the original high-res scanned jpg files after a publication is borrowed for the 1 hour.
Such as copying the archive.org link into an online downloader or a Firefox addon extension, or as plugin for jdownloader.
I'm not a developer or coder so I'm searching for a simpler solution.
Here's a link to the type of publications I need to grab:
https://archive.org/details/pub_interview?sort=-addeddate
I download a lot of books in other languages which include a lot of non-ASCII special characters in their titles. The script as written strips out all but ASCII characters and numbers. However, if I remove the code that does the stripping, it seems to handle non-ASCII characters just fine. Change line 17 to title = "".join([c for c in data['brOptions']['bookTitle']])
.
Here's an example book: https://archive.org/details/dictionnairetymo0000bloc
It has Γ© and Γ§ in the title and they are preserved in the file name after the script change.
Here are some other example books that have non-ASCII characters in their titles which worked with this script modification:
https://archive.org/details/bdrc-W1AB6
https://archive.org/details/morisasakihindik0000unse
https://archive.org/details/hindijapanesedic00kazu
https://archive.org/details/kainantohogenkis008800
I'm guessing Python 2 users would really need the ASCII-only stripping, as it does not handle Unicode encodings automatically like Python 3 does. Yet the examples invoke python3, so that's what users should be using.
Hi, I'd like to use this script.
I installed python and git and python is already configured in my environment variables.
When I check python version inside git I can see I have 3.10.7 version installed.
At the moment of running your script nevertheless it says "python not found".
I followed your instructions to the letter but still I could't use the script.
I admit I am totally illiterate in programming. Thank you.
Sometimes multiple variants of the same book are available and it might be desirable to download all of them for comparison purposes, in order to choose the best quality version.
Unfortunately, the download folder's unique name is normally renamed into the long book title. As it often happens to be the very same title for each unique variant, downloading multiple variants results in overwriting each other.
It is much more preferable to retain the unique identifier for each download variant, especially since it also enables to later clearly identify the original download source.
To ensure this, following modification does the trick:
--- archive-org-downloader.py 2021-10-21 08:35:41.589757183 +0200
+++ myarchive-org-downloader.py 2021-12-07 06:51:12.078410887 +0200
@@ -197,7 +197,7 @@
session = loan(session, book_id)
title, links = get_book_infos(session, url)
- directory = os.path.join(os.getcwd(), title)
+ directory = os.path.join(os.getcwd(), book_id)
if not os.path.isdir(directory):
os.makedirs(directory)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.