Giter Site home page Giter Site logo

archivenow's People


a-mabe avatar evil-wayback avatar ibnesayeed avatar lebnan avatar machawk1 avatar myano avatar ruebot avatar shawnmjones avatar veloute avatar waybackarchiver avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar


 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archivenow's Issues

[Feature request] Add retry logic


The program should have a retry logic in case the request to the archive service fails. In my experience, this happens a lot with The Internet Archive. For example:

$ archivenow --ia --is
Error (The Internet Archive): HTTPSConnectionPool(host='', port=443): Max retries exceeded with url: /save/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6d61a34d10>, 'Connection to timed out. (connect timeout=120)'))

I would prefer that the command does not complete before it actually succeeds with the requests to all of the given archive services, or at least before a certain number of maximum retries (per service) is reached. The retry count should be configurable, via a command line option (e.g. --max-retries 20), and it should have a reasonable default (5?) in case the option isn’t given by the user.

Currently, the user has to manually issue new archivals for the services for which the request was unsuccessful.

Archive images in IA

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>\(\)]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

Docker and

The docker image contains a very outdated version of /app/archivenow/handlers/

There are three versions of this file: The one in Docker (original), the one in pip install (more updates), and the one on this repo (fully redone with Selenium support).

If you have a key to avoid captchas the one you want is in the pip install. If you have no key, you can try Selenium support but some users have reported it unsuccessful.

ModuleNotFoundError: No module named '__init__'

When installed on Heroku with pip here's the error I get.

from archivenow import archivenow

Triggers this:

Traceback (most recent call last):
  File "", line 22, in <module>
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/", line 364, in execute_from_command_line
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/", line 356, in execute
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/", line 330, in execute
    output = self.handle(*args, **options)
  File "/app/archive/management/commands/", line 10, in handle
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/", line 380, in __call__
    return*args, **kwargs)
  File "/app/archive/", line 13, in is_memento
    from archivenow import archivenow
  File "/app/.heroku/python/lib/python3.6/site-packages/archivenow/", line 10, in <module>
    from __init__ import __version__ as archiveNowVersion
ModuleNotFoundError: No module named '__init__'

Web Service?

This is an amazing tool, thank you for building and publishing it! Do you by chance know if anyone is hosting a web service that utilizes this tool to allow users to paste a url once and generate archives across all of the supported archive service providers in one go? That would be amazing. If now, I may be able interested in building such a tool. Let me know what you think.

bug(windows): Error: No enabled archive handler found

1. Summary

I can't begin to use archivenow CLI on my Windows.

2. Environment

  • Windows 10 Enterprise LTSB 64-bit EN
  • Python 3.7.2
  • archivenow 2019.

3. Steps to reproduce

I install archivenow to virtual environment:

D:\SashaDebugging>mkvirtualenv archivenowenv
Using base prefix 'c:\\python37'
New python executable in C:\Users\SashaChernykh\Envs\archivenowenv\Scripts\python.exe
Installing setuptools, pip, wheel…

(archivenowenv) D:\SashaDebugging>toggleglobalsitepackages

    Disabled global site-packages

(archivenowenv) D:\SashaDebugging>pip install archivenow
Collecting archivenow
  Using cached
Collecting flask (from archivenow)
  Using cached
Collecting requests (from archivenow)
  Using cached
Collecting itsdangerous>=0.24 (from flask->archivenow)
  Using cached
Collecting Werkzeug>=0.14 (from flask->archivenow)
  Using cached
Collecting Jinja2>=2.10 (from flask->archivenow)
  Using cached
Collecting click>=5.1 (from flask->archivenow)
  Using cached
Collecting idna<2.9,>=2.5 (from requests->archivenow)
  Using cached
Collecting urllib3<1.25,>=1.21.1 (from requests->archivenow)
  Using cached
Collecting chardet<3.1.0,>=3.0.2 (from requests->archivenow)
  Using cached
Collecting certifi>=2017.4.17 (from requests->archivenow)
  Using cached
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->archivenow)
  Using cached
Installing collected packages: itsdangerous, Werkzeug, MarkupSafe, Jinja2, click, flask, idna, urllib3, chardet, certifi, requests, archivenow
Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 archivenow-2019. certifi-2018.11.29 chardet-3.0.4 click-7.0 flask-1.0.2 idna-2.8 itsdangerous-1.1.0 requests-2.21.0 urllib3-1.24.1

I try run commands as in examples.

4. Expected behavior

Save web-pages on archiving services.

5. Actual behavior

I get Error: No enabled archive handler found any time.

(archivenowenv) D:\SashaDebugging>archivenow

 Error: No enabled archive handler found

(archivenowenv) D:\SashaDebugging>archivenow

 Error: No enabled archive handler found

(archivenowenv) D:\SashaDebugging>archivenow -all

 Error: No enabled archive handler found

(archivenowenv) D:\SashaDebugging>archivenow -ia

 Error: No enabled archive handler found


Add a Dockerfile

In order to deploy it on our server, please add a Dockerfile in it and also add a corresponding image in DockerHub. nesting

Support the nesting of Hypothesis links as an archive link.
where is the prefix and is the URL (encodeURIComponent in JS function)
Same should be applied to all sites

Add Support for

If possible, please add support for

I've found some snippets of code around the Internet but when I've tried doing requests with the information from these projects, I always get URL as the res.url and nothing useful in the res.headers in the response from the server.

I've tried replicating the cookies back but sometimes I get the error from their server "「Cookieが無効な状態」" which means it is complaining about them.

Anyone have any thoughts on how to submit URLs to in Python?

Archive Web Site

Can you add the ability to archive a complete web site

  • spidering from a given directory to any depth or a specified depth
  • up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

Archive sites in addition to submitting URIs

One of the use cases in is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia would use wget to create a WARC of and store it locally at news.arc but also submit the URI to IA. fails, the site presents a captcha challenge

When trying to archive an URL to through archivenow --is URL, it always returns:

Error (The 429 Client Error: Too Many Requests for url:

I have Firefox and geckodriver installed and available in my PATH.

When submitting a URL on the site regularly through a browser, the site returns 429 on submit and requires the completion of a reCAPTCHA challenge, and then proceeds to archive the URL.

Reduce nesting

Too much nesting is generally a bad idea unless it is necessary. Try refactoring your code to reduce nesting in general. The file can use glob method to filter files with specific name pattern and reduce the nesting while making the code more readable.

Self-report module version number

In #7 I had to resort to pip to verify the version of the library I was using. This is report on installation but I have found it common that a module can self-report version.

Allow archivenow -v and archivenow --version to print the version of the module to stdout. This should help with debugging.

Will submit pr: submit

Cool site that uses webrecorder render and also archives videos.

The endpoint for submitting an archive is "/archive", and it is a POST request. Once request is submitted, it will redirect (302) you to the URL where the archive would be stored.

Will submit pr myself, but any objections before i do?

ImportError: No module named pathlib

In Ubuntu 18.04
pip install archivenow
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 Werkzeug-1.0.0 archivenow-2019. certifi-2019.11.28 chardet-3.0.4 click-7.1.1 flask-1.1.1 idna-2.9 itsdangerous-1.1.0 requests-2.23.0 urllib3-1.25.8
I tried running a test with
archivenow --all
The response was:

Traceback (most recent call last):
  File "/home/myusername/.local/bin/archivenow", line 7, in <module>
    from archivenow.archivenow import args_parser
  File "/home/myusername/.local/lib/python2.7/site-packages/archivenow/", line 13, in <module>
    from pathlib import Path
ImportError: No module named pathlib

documentation incorrect on how to pass parameters to a handler

Current readme suggests this code for use in Python:
But actually, when push() is called directly, it seems to expect additional parameters in object form, e.g.:

502 bad gateway error


I am getting the following error both from python and cli usage. archive.cli is working fine though.

Error (The Internet Archive): 502 Server Error: Bad Gateway for url:

I used it successfully last month. However, currently I am getting this error. I tried using the web page and it is working fine.


Handle case where no optional parameters are specified

I attempted to specify no optional parameters but simply the URI positional parameter via:

archivenow http://some-urir

and was supplied the command-line help functionality. It would be better to handle this usage in a smarter manner, i.e., triggering the "--all" or "--ia" flags when no archive is explicitly specified.

Support Python 3

It is now safe and preferred to write code for Python 3 as Python 2 is reaching the end of its extended life. There is no need to support Python 2 anymore in libraries like these.

archivenow --ia "" returns incorrect URL

archivenow --ia "" returns the following for me:!1m0!3m2!1sen!2sus!4v1492711765912!6m8!1m7!1sUl8AEIci9YYO2dP_SwO1oQ!2m2!1d40.71478671950204!2d-73.99018606424495!3f303.3614672677981!4f-5.896130905148539!5f0.7820865974627469

Which is an embed on the site, but not the top level site itself — I'd expect it to return something like:


Better defaults in the UI

For better user experience you might want to:

  • make the first three checkboxes checked by default
  • add a link to the page where API key can be generated, and
  • make the API key persist in user's browser's localstorage (if entered)

Restructure the response JSON

I would suggest that response

	"results": [
		"Error (The Archive): An API KEY is required"

should be changed to

	"uri": "",
	"request-datetime": "20170209143321",
	"mementos": {
		"": "",
		"": "",
		"": "",
		"": "Error: An API KEY is required"

Archiving resources with relative Content-Location

archivenow --ia

See also from curl where a resource returns Content-Location:

curl -I
content-location: Overview.html

in comparison to the ones that don't:

curl -I

So, when I do something like:

curl -ki '' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

I get:

content-location: Overview.html

And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).

This is in comparison to say:

curl -ki '' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

which gives a nice workable:

content-location: /web/20190708123256/


Problems with pushing mementos into Internet Archive

I noticed this when I was using ArchiveNow this morning.

# archivenow
Error (The Internet Archive): 445 Client Error:  for url:

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ then it works.

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/

def push(URI, arc_id, p_args={}):
global handlers
global res_uris
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
for th in threads:
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
del res_uris[res_uris_idx]
return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.