Giter Site home page Giter Site logo

archive.org-downloader's People

Contributors

bigchipbag avatar cerumo avatar claudeha avatar milahu avatar miniglome avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archive.org-downloader's Issues

Gratefulness

I didn't know where to say thanks. Your dev effort is a really amazing.

Thanks a lot, @MiniGlome ... Wish you the best πŸ’―

Memory usage

When working with many books from .txt list the script sums up memory after every book, so it goes up and up until 8000000 or so, and then hangs up.
I believe this shouldn't be so, it should free memory after every book, but I don't have enough knowledge to figure the cause.
Maybe you could look this up.
And thanks for the great script!

PS. I believe the problem is somewhere in pdf converter, because if i'm working with -j flag i don't get this extra memory usage issue.

3 Feature Request - Bulk Download | Save as JPG | Return Book

Feature Request 1:
A way to bulk download multiple books. I have hard-coded the login details and book quality, all that's missing is some method to pass multiple URL's through the script. From maybe a text file?

Feature Request 2:
Output to individual JPG's in a folder or a compressed zip file, rather then a PDF.

Feature Request 3:
Ability to return book after successful save of book images.

(Your script works AMAZING! These 3 features would make it... literally the best!)

Weird units...

[00:02<00:00, 7.08it/s]

is it/s some weird unit, or is it Mbits/s or similar that's being weirdly cropped?

Download directories are nested instead of being deleted for each downloaded file

I'm using the script on macOS 12.4 with Python 3.9.x (currently 3.9.13). I recently upgraded the script after being behind a few versions and the last two versions of the script have a bug where the directory that is created for each file isn't usually deleted. Instead, the next file's directory is created inside of that one, and the next one inside of that one, and so on.

 - downloadFileDirOne
   - downloadFileDirTwo
     - downloadFileDirThree

It seems to be deleting only the directory of the last file downloaded from a download list and not each directory in turn as as they are emptied after PDFs are made. The expected behavior would be that the directory would be deleted after the PDF is made.

I don't know enough about this, but my guess is that directory isn't being properly defined when looping for shutil.rmtree(directory) in line 225. Replacing this code at lines 209-216

		directory = os.path.join(directory, title)
		# Handle the case where multiple books with the same name are downloaded
		i = 1
		d = directory
		while os.path.isdir(directory):
			directory = f"{d}({i})"
			i += 1
		os.makedirs(directory)

with this code from an earlier version of the script solves the problem:

		directory = os.path.join(os.getcwd(), title)
		if not os.path.isdir(directory):
			os.makedirs(directory)

The "handle the case where multiple books with the same name are downloaded" doesn't seem to be necessary at 209-216, at least according to my testing, because the case is already handled at lines 141-145 with:

	# Handle the case where multiple books with the same name are downloaded
	i = 1
	while os.path.isfile(os.path.join(directory, file)):
		file = f"{title}({i}).pdf"
		i += 1

Additionally, if the nest of folders is too deep, then an error occurs:

Traceback (most recent call last): File "/Users/username/Archive.org-Downloader/archive-org-downloader.py", line 216, in <module> os.makedirs(directory) File "/usr/local/Cellar/[email protected]/3.9.13_1/Frameworks/Python.framework/Versions/3.9/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) OSError: [Errno 63] File name too long: "

Page numbering of .JPGs causing wrong order upon re-assembly

I don't know the name / terms for this issue, but it's the same described here https://www.tenforums.com/general-support/165181-sort-problem-i-get-1-10-11-2-rather-than-1-2-how-do-i-fix.html

The 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 order will end up as 1, 11, 12 etc when re-assembled into a pdf with another program.

This isn't an issue with another program, just a classic filename issue. I thought I had solved this with Irfanview batch but I messed up enough times to come and ask for a fix πŸ˜›

Error on the described query

I followed exact steps and everything looked fine. But running the query on this book :(https://archive.org/details/oraldiagnosis0000kerr) fails.
The exact query is as follows:
python3 archive-org-downloader.py -e [email protected] -p password -r 0 -u https://archive.org/details/oraldiagnosis0000kerr

`1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1344, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 276, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "archive-org-downloader.py", line 191, in
session = login(email, password)
File "archive-org-downloader.py", line 50, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))`

"This book doesn't need to be borrowed" error for some books

Got such error and found the reason, sometimes URLs in TXT file can contain spaces after URL

It is best to add url=url.rstrip() inside main loop, as error is very hard to spot by user

Also strongly advice to output Current book name inside two " ", such was any error can be seen

Something went wrong trying to borrow this book

1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/artofhungarianco00benn/
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"No identifier provided."}

This is the error I get no matter what, whether I've borrowed the book or not.

Remote end closed connection without response

ive tried the solution by human but it didn't work. Please help...my young brother has sent me a list of the books needed for his drama class and i need your help
my script...

$ python3 archive-org-downloader.py -e [email protected] -p 00000000 -r 0 -u https://archive.org/details/bullshtartistlea0000klei/page/5/mode/2up
1 Book(s) to download
Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 450, in send
timeout=timeout
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\util\retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\packages\six.py", line 769, in reraise
raise value.with_traceback(tb)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 710, in urlopen
chunked=chunked,
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\urllib3\connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1369, in getresponse
response.begin()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 310, in begin
version, status, reason = self._read_status()
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.2544.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 279, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "archive-org-downloader.py", line 191, in
session = login(email, password)
File "archive-org-downloader.py", line 50, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 577, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 529, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\sessions.py", line 645, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Kristen\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

suggestion: use tesseract instead of img2pdf

Hi,

Many thanks for this highly useful tool!

Currently I download using --jpg and manually rename to the correct order (see also #53). Then to get a PDF with selectable/searchable text I use Tesseract OCR to analyse the images and make a PDF. The process I used last time (on Debian) was:

# download bookname with -r 0 -j
cd bookname
rename s/^/0/ ?.jpg
rename s/^/0/ ??.jpg
# rename s/^/0/ ???.jpg  # repeat as needed for books with 000s of pages...
cd ..
ls -1 bookname/*.jpg > index.txt
tesseract index.txt bookname pdf
# output in bookname.pdf

It would be great if this could be automated. I might attempt to implement it myself.

Advantages of Tesseract: selectable searchable text
Disadvantages of Tesseract: can be much slower

Both img2pdf and Tesseract keep JPGs as-is without re-encoding at all.

Cheers!

OCR

Is OCR the pdf possible using something like Tesseract or OCRmyPDF?

"Invalid credentials" error

Preface: I attempted to run this script on this setup:

  • Windows 7 Ultimate x64 machine, fully updated up to current via ESU
  • Python ver. 3.8.10; other necessary components like git were installed and ran correctly.

When trying to run the script to grab a currently borrowed book, I kept getting the same error referenced in (#36). Below are two examples of the variations that I typed in an attempt to fix the error:

I still get an 'Invalid credentails!' error for both, so I am unsure what I'm doing wrong here.

img2pdf.ImageOpenError after downloading

Hello,

Thanks a lot for this package. I tried downloading two books from archive.org. First worked successfully, for the second I get an error regarding img2pdf. All requirements seem to be met.

[+] Successful login
[+] Successful loan
[+] Found 262 pages
Donwloading pages...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 262/262 [02:59<00:00, 1.46it/s]
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1349, in read_images
imgdata = Image.open(im)
File "/usr/lib/python3.9/site-packages/PIL/Image.py", line 2958, in open
raise UnidentifiedImageError(
PIL.UnidentifiedImageError: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/Downloads/archive_dl/Archive.org-Downloader/downloader.py", line 123, in
pdf = img2pdf.convert(images)
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 2032, in convert
) in read_images(rawdata, kwargs["colorspace"], kwargs["first_frame_only"]):
File "/usr/lib/python3.9/site-packages/img2pdf.py", line 1353, in read_images
raise ImageOpenError(
img2pdf.ImageOpenError: cannot read input image (not jpeg2000). PIL: error reading image: cannot identify image file <_io.BytesIO object at 0x7f46f76bd450>

Happy to share the archive.org book url, not sure if that violates github TOS.

Is it working anymore ?

Hi I used to download a few books with this script a couple of months ago, but now it always gives me this weird very long output with html lines and a bunch of numbers without downloading anything. I suppose Archive.org might have changed something ?
It connects and identifies the book right and borrows it, but doesn't download.

reget for interrupted downloads

If for some reason a download session is interrupted or if fetched images are incomplete, any attempt to continue downloading always starts from zero again. This is very unfortunate in case of large books with hundreds of pages where multiple download attempts result in a multiplication of the actual download size.
It would be great if a reget feature for incomplete downloads like the one's we are used to with wget/curl could be added.
Would be just great for people with an unstable and/or slow or volume limited internet connection.
Alternatively, if downloads could be limited to a specific page range or specific single pages, then incomplete downloads could be selectively corrected.
Thanks!

Attempting to download books that don't need to be borrowed results in an error

For books in the public domain, etc.
For example, trying to download this:
https://archive.org/details/dli.ernet.247978
Gives this error:
Something went wrong when trying to borrow the book, maybe you can't borrow this book
<Response [400]>
{"error":"You do not currently have this book borrowed."}

Granted, on the book's page you can simply download a zip with the individual images, but if I'm just copying the URLs of several books, I'm usually not checking individually whether they need to be borrowed or not. In the case of this author, for instance, this book does not need to be borrowed, but the rest do:
https://archive.org/search.php?query=creator%3A(Mure%20Pierre)%20AND%20mediatype%3A(texts)

Thank you so much! Your app is great!

As shown on the title, this is not an issue, but I don't know the how to say thank you except create one (I don't own any Bitcoins, unfortunately).
You might not know how happy I am when I could not find a way to access a book on Archive.org which there is no pdf available, until I found your app.
Downloaded, installed, ran it and boom! The pdf file is in my desktop! Fantastic!

Again, thank you very much for your hard work!
Wish you all the best in your life!

"No host supplied" when trying to download any book

If I try to download any book, say this one from the README:

python3 -m downloader -e [email protected] -p mypassword -u https://archive.org/details/elblabladelosge00gaut

I get hit with a big fat error message:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 190, in <module>
    title, links = get_book_infos(session, url)
  File "/home/jack/Code/ArchiveDownloader/Codebase/downloader.py", line 15, in get_book_infos
    response = session.get(infos_url)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 528, in request
    prep = self.prepare_request(req)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/sessions.py", line 456, in prepare_request
    p.prepare(
  File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 316, in prepare
    self.prepare_url(url, params)
  File "/home/jack/.local/lib/python3.8/site-packages/requests/models.py", line 393, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n    / _` | \'_/ _| \' \\| |\\ V / -_)\n    \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n  <head data-release=af39621e>\n    <title>El blablΓ‘ de los gemelos : Gauthier, Bertrand, 1945- : Free Download, Borrow, and Streaming : Internet Archive</title>\n\n          <meta name="viewport" 

... snip - a gigantic amount of HTML ...

        });\n      </script>\n      </div>\n': No host supplied

script error when trying load the downloader

when trying to to use the downloader i get the error that the module requests doesn't exist, but when trying to install it shows me that it is already installed. I'm using python 3.10.4

Using β€”file for different URLs but with same name overwrites the previous PDFs

When using the file list option and downloading four volumes from the same series β€” which have the same name in Internet Archive β€” they are given the same name by this downloader when the PDF is created, and therefore will overwrite each other.

For example, the first four results on this search are all different, even though they have the same title on their resepective pages.

https://archive.org/search.php?query=Schweizer+lexikon&and[]=mediatype%3A%22texts%22

If you add all four URLs to your download list, at the end you will end up with just the PDF of the final volume.

For the moment, to drastically decrease the chances of this happening, I have used an available variable to add the page count to the file name when writing the PDF. This won't always prevent overwrites, however.

The changed code:

def make_pdf(pdf, title): with open(f"{title}-{len(links)}pp.pdf","wb") as f: f.write(pdf) print(f"[+] PDF saved as \"{title}-{len(links)} pp.pdf\"")

interrupting download is harder than necessary

a single Ctrl-C doesn't do much, the downloads continue (tested on Debian Linux)

repeatedly mashing Ctrl-C does eventually work, but with a flood of stack traces in the terminal.

this seems to be a common issue with Python3 thread pools, currently trying to find out what the best fix is...

archive.org downloader, help

Hello, how are you?
Sorry to bother you, but I'd need a little help.

You published a very interesting code for downloading private files from "archive.org", but unfortunately I don't have much knowledge of python, I didn't understand where I should enter the data [email, password, desired link, quality and file type], everytime I run the program, it returns with the options and exits.

Could you please tell me where I should replace this information to perform the download?
Or maybe modify the code so that in the terminal it asks: "email, pass, desired link and quality"? In this way many people who don't know python can enjoy your great code.

By the way, I know that for you it seems very easy and it would be up to me to study python, however it wasn't for lack of will or commitment, but I don't have much aptitude for programming and I can't advance.

Thank you very much.

Best regards,
Montoro.

PS1: Do I need to borrow the book for your code to work?
PS2: img2pdf==0.4.0 doesn't install at all

Option to not return book?

Thanks for the tool, it works beautifully. Just wondering if it would be possible to add a flag that disables the auto-return of loaned books? I was trying to download both the pdf and jpgs of a particular book and had to reloan the book in order to make the two downloads, would be nice to be able to do multiple runs, in case something goes wrong, then choose to return the book. Thanks!

InvalidURL "no host supplied" error after successful loan, doesn't download

Archive.org-Downloader-main>archive-org-downloader.py -e email -p password -u https://a
rchive.org/details/masteropticaltec0000deva -r 0
1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/masteropticaltec0000deva
[+] Successful loan
Traceback (most recent call last):
  File "Archive.org-Downloader-main\archive-org-downloader.py", line 209, in <module>
    title, links = get_book_infos(session, url)
  File "Archive.org-Downloader-main\archive-org-downloader.py", line 22, in get_book_infos
    response = session.get(infos_url)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 542, in get
    return self.request('GET', url, **kwargs)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 515, in request
    prep = self.prepare_request(req)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\sessions.py", line
 443, in prepare_request
    p.prepare(
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
18, in prepare
    self.prepare_url(url, params)
  File "AppData\Local\Programs\Python\Python310\lib\site-packages\requests\models.py", line 3
95, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)

The URL for the file is https://archive.org/details/masteropticaltec0000deva

I had been using this program successfully for some time now, even downloading this file beforehand in jpeg form, so not sure what's changed- I got this error with the version I was using, then updated and found no difference.

Might I have uninstalled something required? I'm pretty sure I haven't since I last used it, just checking if it's that sort of mistake. I tried an extra slash at the end of the url, then left off the resolution flag- nothing.

I am on Windows 10 64 bit. The rest of the error is long and is attached. Thanks for any help!
restoferror.txt

Suggestion

Hello. Personally I have zero knowledge in coding. I tried to apply the indicated steps, but I couldn't download books. Maybe I should have done something else before following the instructions.Can you can add a more detailed description of the steps for those who need an eli5

Suggestion: Add option for output directory

Just a suggestion to allow the user to specify the output directory. I see you already have the directory variable in main so it is actually a matter of adding it to the args being handled, plus adding it as a parameter for make_pdf.

Way to set resolution to whatever the normal would be when downloading a book from archive.org

When I tried downloading a book with various resolution levels, and then also just borrowed the book and downloaded it to adobe digital editions, the version that downloaded is about 50 MB, and downloading with -r 3 is 151 MB and with -r 4 is 48 MB. The 48 MB pdf is quite a bit smaller when fully zoomed in compared to the adobe digital editions version fully zoomed in.

I guess it would be great to be able to download a version that is the same size as what you get when downloading to adobe digital editions--both size in MB and size in inches when fully zoomed in.

Is this possible?

Thanks!

Script no longer works when before it used to work

Initially I used an older version, but it also does not work on the latest version.

The command I inputted was (with email and password redacted):

python3 archive-org-downloader.py -e [email protected] -p password -u https://archive.org/details/anarchistvoiceso0000avri

It gave the following output before quitting without generating any files.

1 Book(s) to download
[+] Successful login
========================================
Current book: https://archive.org/details/anarchistvoiceso0000avri
[+] Successful loan
Traceback (most recent call last):
  File "archive-org-downloader.py", line 209, in <module>
    title, links = get_book_infos(session, url)
  File "archive-org-downloader.py", line 22, in get_book_infos
    response = session.get(infos_url)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 519, in request
    prep = self.prepare_request(req)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 452, in prepare_request
    p.prepare(
  File "/usr/lib/python3/dist-packages/requests/models.py", line 313, in prepare
    self.prepare_url(url, params)
  File "/usr/lib/python3/dist-packages/requests/models.py", line 390, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:TYPE html>\n<html lang="en">\n<!-- __ _ _ _ __| |_ (_)__ _____\n    / _` | \'_/ _| \' \\| |\\ V / -_)\n    \\__,_|_| \\__|_||_|_| \\_/\\___| -->\n  <head data-release=8a2d548c>\n   .... 

I redacted the rest after "...." because it seems to be all html and it doesn't fit into the comment.

pip install -r requirements.txt

Always give me a error in the MINGW32.
bash: pip: command not found

I will be nice to see a video explain how to install this first and download one book. Thanks.

When downloading without converting to PDF, books in folders with the same name get overwritten

The title basically covers it. It doesn't seem like there's a check to see whether a folder that a book is being downloaded to already exists. I forget which actual books this happened to me with, but it's easily reproduced by downloading a book (with -j) and then immediately downloading it again.

Also, I don't know if you want to mention this in your documentation, but I managed to get this running in Cygwin (after installing all the dependencies, which as a novice was no easy feat in itself), but only after commenting out "import img2pdf", because img2pdf doesn't compile in Cygwin.

Invalid credential

Hi,
This is the error message that I get. My credential are ok since I'm logged on the website.

Edited later:

  • Is working for passwords that don't have special characters.
  • for those who are & for n00bs you should specify that those password must passed with ' ' like 'P@:ssW0rd'

NoADirectoryError: directory name is invalid?

1 Book(s) to download
[+] Successful login

Current book: https://archive.org/details/fromempiretoeuro0000owen
[+] Successful loan
[+] Found 550 pages
Traceback (most recent call last):
File "archive-org-downloader.py", line 197, in
os.makedirs(directory)
File "C:\Users\Sivn\AppData\Local\Programs\Python\Python38\lib\os.py", line 223, in makedirs
mkdir(name, mode)
NotADirectoryError: [WinError 267] The directory name is invalid: 'C:\Users\Sivn\Archive.org-Downloader\From_empire_to_Europe_:_the_decline_and_revival_of_British_industry_since_the_Second_World_War'

I'm new to python and coding, so I'm not entirely sure what's causing this :(

Error while downloading a book.

1 Book(s) to download
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 719, in urlopen
retries = retries.increment(
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 400, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 702, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 665, in urlopen
httplib_response = self._make_request(
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 421, in _make_request
six.raise_from(e, None)
File "", line 3, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 416, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.8/http/client.py", line 1348, in getresponse
response.begin()
File "/usr/lib/python3.8/http/client.py", line 316, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.8/http/client.py", line 285, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "archive-org-downloader.py", line 200, in
session = login(email, password)
File "archive-org-downloader.py", line 52, in login
response = session.post("https://archive.org/account/login", data=data, headers=headers)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

request: online downloader, addon extension or plugin for jdownloader?

Hi, I'm wondering if its possible this code can be applied towards something that's more automated to download all of the original high-res scanned jpg files after a publication is borrowed for the 1 hour.

Such as copying the archive.org link into an online downloader or a Firefox addon extension, or as plugin for jdownloader.

I'm not a developer or coder so I'm searching for a simpler solution.

Here's a link to the type of publications I need to grab:
https://archive.org/details/pub_interview?sort=-addeddate

Allowing the script to handle non-ASCII special characters

I download a lot of books in other languages which include a lot of non-ASCII special characters in their titles. The script as written strips out all but ASCII characters and numbers. However, if I remove the code that does the stripping, it seems to handle non-ASCII characters just fine. Change line 17 to title = "".join([c for c in data['brOptions']['bookTitle']]).

Here's an example book: https://archive.org/details/dictionnairetymo0000bloc

It has Γ© and Γ§ in the title and they are preserved in the file name after the script change.

Here are some other example books that have non-ASCII characters in their titles which worked with this script modification:

https://archive.org/details/bdrc-W1AB6
https://archive.org/details/morisasakihindik0000unse
https://archive.org/details/hindijapanesedic00kazu
https://archive.org/details/kainantohogenkis008800

I'm guessing Python 2 users would really need the ASCII-only stripping, as it does not handle Unicode encodings automatically like Python 3 does. Yet the examples invoke python3, so that's what users should be using.

can't run the scripit

Hi, I'd like to use this script.
I installed python and git and python is already configured in my environment variables.
When I check python version inside git I can see I have 3.10.7 version installed.
At the moment of running your script nevertheless it says "python not found".
I followed your instructions to the letter but still I could't use the script.
I admit I am totally illiterate in programming. Thank you.

preserve unique identifier of download item

Sometimes multiple variants of the same book are available and it might be desirable to download all of them for comparison purposes, in order to choose the best quality version.

Unfortunately, the download folder's unique name is normally renamed into the long book title. As it often happens to be the very same title for each unique variant, downloading multiple variants results in overwriting each other.

It is much more preferable to retain the unique identifier for each download variant, especially since it also enables to later clearly identify the original download source.

To ensure this, following modification does the trick:

--- archive-org-downloader.py    2021-10-21 08:35:41.589757183 +0200
+++ myarchive-org-downloader.py  2021-12-07 06:51:12.078410887 +0200
@@ -197,7 +197,7 @@
                session = loan(session, book_id)
                title, links = get_book_infos(session, url)
 
-               directory = os.path.join(os.getcwd(), title)
+               directory = os.path.join(os.getcwd(), book_id)
                if not os.path.isdir(directory):
                        os.makedirs(directory)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.