Giter Site home page Giter Site logo

reprozip-news-apps / reprozip-web Goto Github PK

View Code? Open in Web Editor NEW
17.0 9.0 2.0 34.96 MB

ReproZip for the Preservation of Web Applications

Home Page: https://reprozip-web.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 99.58% HTML 0.42%
reprozip preservation data-journalism

reprozip-web's Introduction

Prototype Web Archiving App

A work-in-progress app that leverages ReproZip and Webrecorder to capture archival packages of data journalism websites.

Prerequisites

You will need to install the Docker server and have it running on your system. See https://docs.docker.com

You will also need python3 and pip. One way to do this is using Pyenv. For example, on OSX (using Homebrew):

brew install pyenv

On Debian/Ubuntu:

sudo apt install python3.7 python3.7-dev virtualenv docker.io

Development Install

At some point the app will likely be installed from a registry, like most Python libraries. For now, it must be installed from a local directory.

Recommendation: Use pyenv and virtualenv (or pipenv) to create a self-contained virtual environment:

$ pyenv local 3.7
$ pip install virtualenv
$ virtualenv .
$ source bin/activate

Now clone the repo and cd into it:

$ git clone https://github.com/reprozip-news-apps/reprozip-web
$ cd reprozip-web

Now install dependencies and the app into your virtualenv. Note that reprounzip-docker must be installed from Github for now.

$ pip install -r requirements.txt
$ pip install -e .

Step 1: Package a site using ReproZip

Skip to step 2 if you already have an RPZ package. Otherwise, see reprozip documentation:

https://reprozip.readthedocs.io/en/1.0.x/packing.html

Step 2: Record the site assets from the RPZ using Webrecorder

You need an RPZ package and you need to know what port the packaged application runs on.

For example:

reprounzip dj record web-app.rpz target --port 3000

Note that the port number will depend on the webserver you captured in step 1. A Rails app will likely run on port 3000, a NodeJS app will likely run on port 8000.

You should see the WARC_DATA directory in the package now. For example:

$ tar -t -f web-app.rpz
-rw-------  0 root   root 729415801 Mar  9  2017 DATA.tar.gz
-rw-------  0 root   root        19 Mar  9  2017 METADATA/version
-rw-r--r--  0 root   root   5912576 Mar  9  2017 METADATA/trace.sqlite3
-rw-------  0 root   root    293142 Mar  9  2017 METADATA/config.yml
-rw-r--r--  0 hoffman staff   807498 Jan 11 09:16 WARC_DATA/rec-20190111141622981410-anything.local.warc.gz
-rw-r--r--  0 hoffman staff    37089 Jan 11 09:16 WARC_DATA/autoindex.cdxj

Step 3: Replay the site and verify fidelity

$ reprounzip dj playback web-app.rpz target --port 3000

Now tab to your Chromium browser, turn off your wifi, and hit reload! Press Enter in your terminal session to shut everything down.

Skipping reprounzip unpacking on subsequent runs

When you finish recording, or exit a playback session, the unpacked container will be destroyed. You can prevent that from happening by using the --skip-destroy flag:

$ reprounzip dj playback web-app.rpz target --port 3000 --skip-destroy

Then you can reuse the container on another playback session:

$ reprounzip dj playback web-app.rpz target --port 3000 --skip-setup --skip-run

Packing and Recording Simultaneously

You can run reprozip trace and record at the same time, using two different terminals (both on the site host, or one on the site host and one on a different host).

Terminal 1:

$ cd /path/to/your/project
$ reprozip trace .runserver

Terminal 2:

$ mkdir /path/to/target
$ reprounzip dj live-record http://localhost:3000 /path/to/target

Wait for the recorder to finish, then go back to Terminal 1 and press CTRL-C.

Terminal 1:

$ reprozip pack /path/to/captured-site.rpz

The final step is to merge the recorded data into the reprozip package:

$ reprounzip dj record /path/to/captured-site.rpz /path/to/target --skip-record

Using Wayback as a standalone frontend

If you don't want to use a bespoke browser, or want to share an archive over the web, you can use the --standalone flag to play the site back like any other WARC collection:

$ reprounzip dj playback web-app.rpz target --port 3000 --standalone
$ curl http://localhost:8080/http://web-app.rpz

reprozip-web's People

Contributors

bofei5675 avatar fchirigati avatar quoideneuf avatar remram44 avatar vickyrampin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

quoideneuf

reprozip-web's Issues

Error when following the packing and recording simultaneously instructions

Hi, nice working in creating this prototype!
I was trying to follow the instructions for doing the recording + packing option:

reprounzip dj live-record http://localhost:3000 /path/to/target

and this resulted in the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/reprozip-web/py37/bin/reprounzip", line 11, in <module>
    load_entry_point('reprounzip==1.0.13', 'console_scripts', 'reprounzip')()
  File "/home/ubuntu/reprozip-web/py37/lib/python3.7/site-packages/reprounzip/main.py", line 144, in main
    args.func(args)
  File "/home/ubuntu/reprozip-web/py37/lib/python3.7/site-packages/reprozip_web-0.1-py3.7.egg/reprounzip/unpackers/dj.py", line 494, in live_record
    record(args)
  File "/home/ubuntu/reprozip-web/py37/lib/python3.7/site-packages/reprozip_web-0.1-py3.7.egg/reprounzip/unpackers/dj.py", line 452, in record
    WARCPacker.no_second_pass(args.pack[0])
AttributeError: 'Namespace' object has no attribute 'pack'

Preserve host name

Right now the reproduced app is getting the host name "rpzdj-repl.ay" which might not work.

Renames

This should be reprozip-web

Things to rename:

  • GitHub repo/organization
    • Should there be a reprozip organization?
  • PyPI package
  • ReadTheDocs

Clean up container after recording

Currently when reprounzip dj record exits, the target directory and the reprounzip_image_xxx image still exist (from reprounzip-docker), and a container is still running (reprounzip_detached_xxx).

Those should probably get cleaned up automatically (I'm fine with a --skip-destoy here, matching --skip-setup)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.