ikreymer / pywb-webrecorder Goto Github PK

Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io

License: MIT License

HTML 22.59% Python 77.41%

pywb-webrecorder's Introduction

pywb Wayback Web Recorder (Archiver)

Note: this is an older prototype. We suggest taking a look at https://github.com/webrecorder/webrecorder, the Docker deployment for https://webrecorder.io/ which includes improved features and will be more maintained than this prototype

This project provides a bare-bones example of how to create a simple web recording and replay system.

This project demonstrates how to create a simple web recorder tool by combining pywb (python wayback) web archive replay tools and warcprox HTTP/S recording WARC proxy.

For additional reference, please consult the pywb and warcprox docs.

For more reference, https://webrecorder.io is a hosted service built using some of the same tools.

Basic Usage

To start, simply install with pip install -r requirements.txt under a Python 2.7.x environment.

Then, run python pywb-webrecorder.py

The pywb-webrecorder.py script will start an instance of pywb, warcprox and timed cdx index updater. pywb will be running on port 8080 and warcprox on port 9001 by default.

warcprox will store each WARC that is being written to (one at a time) into the ./recording/ directory. Once completed (or on shutdown), WARCs will be moved to the ./done/ directory.

(All settings can be adjusted in config.yaml)

The pywb web app running on port 8080 will have the following endpoints available:

/live/url -- Fetch a live version of url (same as live-rewrite-server in pywb)
/record/url -- Fetch a live version of url but through warcprox recording proxy, recording all traffic.
/replay/url -- Replay an archived version of url if found from ./recording or ./done dirs. Display 404 if not archived. Standard pywb Wayback behavior.
/replay-record/url -- Replay an archived version of url if found from ./recording or ./done dirs. If not available, internally call the /record/ handler to record a new copy of url.

Archive On-Demand

The replay-record endpoint demonstrates way to auto-record any missing resources from an existing archive.

The first time a resource is requested, it will be recorded. On each subsequent request (after the cdx has been updated), it will be replayed from an existing WARC.

The banner will contain either live fetch or archived page to indicate whether the page was live or archived.

How it Works

pywb features a 'live rewrite' replay mode which fetches live web content and displays it same as if it was read from an archive file. (See the live-rewrite-server tool).

With pywb >= 0.5.0, it is now possible to specify a proxy server for the live fetching. This allows the live fetching to go through warcprox, which proxies HTTP/S traffic and records it to WARC files.

The /record/ endpoint is configured to fetch live content via the proxy on port 9001, while /live/ access point just fetches live without recording.

In some cases, it is useful to record only when content is missing from an archive. pywb 0.5.0 includes a new fallback mechanism which allows pywb to call a different handler instead of showing a 404.

The /replay-record/ endpoint uses this feature to provide replay of archive content from WARCS in either ./recording or ./done. However, if a resource is not found, the request is delegated to /record/ and a new recording is made. (The /replay/ endpoint just provides regular replay without auto recording)

Index Updating

All the above functionality is provided by pywb and warcprox side-by-side.

The last missing piece is automatically updating the CDX index for pywb.

pywb does not provide a way to dynamically add CDX indexs on the fly. However, since the cdx is read on each request, it is possible (and more efficient) to simply update an existing CDX index while pywb is running.

pywb starts with two cdx files ./recording/index.cdx and ./done/index.cdx, which may be updated as new content is recorded.

This pywb-webrecorder.py bootstrap script launches pywb and warcprox as subprocesses, then starts a periodic CDX updater, running every few seconds (configured by update_freq property in config.yaml)

Of course, There are many ways to do this. For simplicity, the following approach is taken:

The periodic updater finds the latest WARC open by warcprox, a file ending in .warc.gz.open, and checks to see if it has been updated. If it has, the updater calls the pywb cdx-indexer on the open file to create a new sorted ./recording/index.cdx.

When warcprox is finished with a file, the .open extension is dropped. The updater also checks for any .warc.gz files and moves them to the ./done directory and regenerates ./done/index.cdx. This happens on startup, shutdown or whenever the curr open file is no longer accessible.

On graceful shutdown (with SIGTERM), pywb-webrecorder.py also shuts down pywb and warcprox.

After graceful shutdown, the ./done/ dir should contain all the finished warcs and recording should be empty.

Other Settings

The config.yaml file contains the command line settings for starting pywb and warcprox. Please refer to warcprox README for command line options, such as changing the max WARC size or idle before rotating warcs, filenames, etc...

The max WARC size and max idle time options may be especially useful for adjusting how long a WARC file remains open and when it is moved to ./done/ directory.

For instance, to set a WARC file to be considered done when no new content has been recorded for 60 seconds OR when size exceeds 1Kb, the recorder_exec setting in the config can be modified as follows: recorder_exec: 'warcprox --rollover-idle-time 60 -s 1000 ...

uWSGI is used to run pywb but other WSGI containers can of course be used instead.

The config also demonstrates use of custom home page and error pages with pywb:

index.html is a simple custom home page for pywb-webrecorder
error.html modifies the standard pywb error page to also include an explicit /record/ link for 'not found' errors (only makes sense when using /replay/ endpoint).

A note on Dedup and Revisits

warcprox uses its own dedup db, written to dedup.db by default. The dedup scheme is decoupled from the actual WARC file being present/available. Thus, if removing warcs from ./done, be sure to also delete dedup.db to avoid revisit records to WARCs that no longer exist (unless that is the intent). By default, dedup.db is persisted when pywb-webrecorder is shutdown. When starting pywb-webrecorder, the dedup.db can be automatically deleted and created anew via the -f flag: python pywb-webrecorder.py -f

Contributions

This project is intended as a demo of different web recording scenarios that could be used by combining pywb and warcprox. The project is under the MIT license and can be used freely (although pywb and warcprox may have different licenses).

Changes and adaptions to different use cases is encouraged. Feedback and pull requests encouraged!

pywb-webrecorder's People

Contributors

Stargazers

Watchers

Forkers

dhamaniasad ssalonen pombredanne treora tarekjor leonirlopes dheerajpai

pywb-webrecorder's Issues

wrongly rewritten URLs

This relatively simple page's image URLs are rewritten wrongly:
http://speichern-unter.net/alle-einreichungen.html

Plaintext URLs in Facebook posts get rewritten with the URL of the webrecorder being used

If a facebook post contains plaintext links just pasted in by the user, like http://rhizome.org, they become the full recorder URL, like http://localhost:8080/record/http://rhizome.org, in the visible text.

timezone in live/record view not correct

only a minor bug.
it seems that the timezone in live and record view is not correct, in replay and replay-record it is correctly shown.

For example on record:
This is a live page loaded on Thu, Jul 31 2014 19:56:50

on replay:
This is an archived page from Thu, Jul 31 2014 21:56:50

also on webrecorder.io it seems, timezone is not correct. (Recorded on Thu, Jul 31 2014 20:07:09)

4chan button issues

On 4chan.org, the 'I Agree' Buttons when entering a channel from the frontpage are not clickable.

On channel pages, the paging buttons don't work. Clicking on page numbers works fine.

No WARCs recorded when browsing

I was able to install by following the instructions in README.md. On execution, I received the below (1) output. When browsing, nothing occurs in the log. After some time, whether I have browsed or sat and stared at the log, I receive an exception (2) on the command-line. Only zero byte index.cdx files reside in /done and /recording. MacOS X 10.9.5, Python 2.7.6

(1)
mymachine:pywb-webrecorder mkelly$ python pywb-webrecorder.py
[uWSGI] getting INI configuration from uwsgi.ini
*** Starting uWSGI 2.0.6 (64bit) on [Tue Oct 28 10:20:10 2014] ***
compiled with version: 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51) on 28 October 2014 10:12:18
os: Darwin-13.4.0 Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64
nodename: mymachine.local
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /Users/mkelly/Desktop/pywb-webrecorder
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
your processes number limit is 709
your memory page size is 4096 bytes
detected max file descriptor number: 256
lock engine: OSX spinlocks
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :8080 fd 3
Python version: 2.7.6 (default, Mar 11 2014, 13:16:04) [GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x7ff70b700500
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 1476189 bytes (1441 KB) for 10 cores
*** Operational MODE: preforking ***
2014-10-28 10:20:11,118 3827 INFO MainThread warcprox.DedupDb.init(warcprox.py:676) creating new deduplication database ./dedup.db
2014-10-28 10:20:11,134 3827 INFO MainThread warcprox.CertificateAuthority._read_ca(warcprox.py:128) read CA key+cert from ./ca-cert.pem
2014-10-28 10:20:11,136 3827 INFO MainThread warcprox.WarcProxy.server_activate(warcprox.py:493) WarcProxy listening on 127.0.0.1:9001
2014-10-28 10:20:11,137 3827 INFO MainThread warcprox.WarcproxController.run_until_shutdown(warcprox.py:1107) SIGTERM will initiate graceful shutdown
2014-10-28 10:20:11,137 3827 INFO WarcWriterThread warcprox.WarcWriterThread.run(warcprox.py:930) WarcWriterThread starting, directory=/Users/mkelly/Desktop/pywb-webrecorder/recording gzip=True rollover_size=1000000000 rollover_idle_time=None prefix=rec port=9001
2014-10-28 10:20:11,487: [DEBUG]:
2014-10-28 10:20:11,495: [DEBUG]: Adding Search Page: ui/search.html
2014-10-28 10:20:11,496: [DEBUG]: Adding Frame Insert: ui/frame_insert.html
2014-10-28 10:20:11,523: [DEBUG]: Live Rewrite via proxy http://localhost:9001
2014-10-28 10:20:11,524: [DEBUG]: Adding HeadInsert: ui/head_insert.html, Banner banner.html
2014-10-28 10:20:11,524: [DEBUG]: Adding Search Page: ui/search.html
2014-10-28 10:20:11,524: [DEBUG]: Adding Frame Insert: ui/frame_insert.html
2014-10-28 10:20:11,543: [DEBUG]: Live Rewrite Direct (no proxy)
2014-10-28 10:20:11,543: [DEBUG]: Adding HeadInsert: ui/head_insert.html, Banner banner.html
2014-10-28 10:20:11,544: [DEBUG]: Adding Captures Page: ui/query.html
2014-10-28 10:20:11,544: [DEBUG]: CDX Surt-Ordered? True
2014-10-28 10:20:11,587: [DEBUG]: CustomCanonilizer? True
2014-10-28 10:20:11,587: [DEBUG]: FuzzyMatcher? True
2014-10-28 10:20:11,587: [DEBUG]: Adding CDX Source: CDX File - ./recording/index.cdx
2014-10-28 10:20:11,587: [DEBUG]: Adding CDX Source: CDX File - ./done/index.cdx
2014-10-28 10:20:11,587: [DEBUG]: Adding Search Page: ui/search.html
2014-10-28 10:20:11,588: [DEBUG]: Adding Frame Insert: ui/frame_insert.html
2014-10-28 10:20:11,588: [DEBUG]: Adding Archive Path Source: ./recording/
2014-10-28 10:20:11,588: [DEBUG]: Adding Archive Path Source: ./done/
2014-10-28 10:20:11,607: [DEBUG]: Adding HeadInsert: ui/head_insert.html, Banner banner.html
2014-10-28 10:20:11,607: [DEBUG]: Adding Collection: replay
2014-10-28 10:20:11,607: [DEBUG]: Adding Captures Page: ui/query.html
2014-10-28 10:20:11,608: [DEBUG]: CDX Surt-Ordered? True
2014-10-28 10:20:11,646: [DEBUG]: CustomCanonilizer? True
2014-10-28 10:20:11,647: [DEBUG]: FuzzyMatcher? True
2014-10-28 10:20:11,647: [DEBUG]: Adding CDX Source: CDX File - ./recording/index.cdx
2014-10-28 10:20:11,647: [DEBUG]: Adding CDX Source: CDX File - ./done/index.cdx
2014-10-28 10:20:11,647: [DEBUG]: Adding Search Page: ui/search.html
2014-10-28 10:20:11,647: [DEBUG]: Adding Frame Insert: ui/frame_insert.html
2014-10-28 10:20:11,647: [DEBUG]: Adding Archive Path Source: ./recording/
2014-10-28 10:20:11,647: [DEBUG]: Adding Archive Path Source: ./done/
2014-10-28 10:20:11,666: [DEBUG]: Adding HeadInsert: ui/head_insert.html, Banner banner.html
2014-10-28 10:20:11,666: [DEBUG]: Adding Collection: replay-record
2014-10-28 10:20:11,713: [DEBUG]: Adding Home Page: ./html/index.html
2014-10-28 10:20:11,714: [DEBUG]: Adding Error Page: ./html/error.html
2014-10-28 10:20:11,714: [DEBUG]: *** pywb app inited with config from "create_wb_router"!

WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x7ff70b700500 pid: 3828 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 3828)
spawned uWSGI worker 1 (pid: 3829, cores: 1)
spawned uWSGI worker 2 (pid: 3830, cores: 1)
spawned uWSGI worker 3 (pid: 3831, cores: 1)
spawned uWSGI worker 4 (pid: 3832, cores: 1)
spawned uWSGI worker 5 (pid: 3833, cores: 1)
spawned uWSGI worker 6 (pid: 3834, cores: 1)
spawned uWSGI worker 7 (pid: 3835, cores: 1)
spawned uWSGI worker 8 (pid: 3836, cores: 1)
spawned uWSGI worker 9 (pid: 3837, cores: 1)
spawned uWSGI worker 10 (pid: 3838, cores: 1)

(2)
Exception in thread WarcWriterThread:
Traceback (most recent call last):
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/warcprox/warcprox.py", line 968, in run
self.dedup_db.sync()
File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/warcprox/warcprox.py", line 684, in sync
self.db.sync()

Logged-in facebook capturing works on webrecorder.io, but not on the local pywb-webrecorder

Locally, a 'Please Log In' alert box is shown constantly when recording locally.

This might be because facebook checks for https protocol?

4chan URL issue

When typing the URL 4chan.org in the record field of webrecorder, the user is forwarded to the address http://<domain>/record/http://chan.org, the 4 is lost.

Using the protocol in front of the URL (so actually providing a formally valid URL) prevents this behaviour.

gzipped WARCs not indexable in OpenWayback 2.1 until unzipped

We used WebRecorder to record http://www.nsi.bg/census2011/pagebg2.php?p2=175&sp2=218 and the linked PDF pages one hop out but were unable to index the resulting WARCs into OpenWayback 2.1 until we unzipped them. Not sure whether this reflects an issue with WebRecorder WARC writing or OpenWayback 2.1 indexing. It's also possible that this isn't an issue in OpenWayback 2.2. Happy to provide a link to the WARC files in question, if that's helpful.

Logged-in Facebook: List of Likes 'Show More' doesn't work

When clicking on a '299 Likes' link (just a number above the hundreds), Facebook displays a modal with a list of users who liked an item.

If the number is very high, the list has a 'Show More' button at the bottom, but clicking it doesn't work.

warc alway open

at the moment only on restart recorded url´s are recognized, on restart the warc.gz.open is moved to warc.gz. is it possible to change this on runtime, when url record is finished? updater move it only, if warc.gz.open is changed to warc.gz.