Giter Site home page Giter Site logo

mwscrape's Introduction

mwscrape downloads rendered articles from MediaWiki sites via web API and stores them in CouchDB to enable further offline processing.

Installation

mwscrape depends on the following:

Consult your operating system documentation and these projects’ websites for installation instructions.

For example, on Ubuntu 18.04, the following command installs required packages:

sudo apt-get install python2.7 python-virtualenv git

To install CouchDB first enable the Apache CouchDB package repository:

echo "deb https://apache.bintray.com/couchdb-deb bionic main" | sudo tee -a /etc/apt/sources.list

Then install the repository key:

curl -L https://couchdb.apache.org/repo/bintray-pubkey.asc | sudo apt-key add -

And finally update the repository cache and install the package:

sudo apt-get update && sudo apt-get install couchdb

By default CouchDB uses snappy for file compression. Change file_compression configuration parameter in couchdb config section to deflate_6. This reduces database disc space usage significantly.

Create new Python virtual environment:

virtualenv env-mwscrape -p python2.7

Activate it:

source env-mwscrape/bin/activate

Install mwscrape from source:

pip install git+https://github.com/itkach/mwscrape

Usage

   
usage: mwscrape [-h] [--site-path SITE_PATH] [--site-ext SITE_EXT] [-c COUCH]
                [--db DB] [--titles TITLES [TITLES ...]] [--start START]
                [--changes-since CHANGES_SINCE] [--recent-days RECENT_DAYS]
                [--recent] [--timeout TIMEOUT] [-S] [-r [SESSION ID]]
                [--sessions-db-name SESSIONS_DB_NAME] [--desc]
                [--delete-not-found] [--speed {0,1,2,3,4,5}]
                [site]

positional arguments:
  site                  MediaWiki site to scrape (host name), e.g.
                        en.m.wikipedia.org

optional arguments:
  -h, --help            show this help message and exit
  --site-path SITE_PATH
                        MediaWiki site API pathDefault: /w/
  --site-ext SITE_EXT   MediaWiki site API script extensionDefault: .php
  -c COUCH, --couch COUCH
                        CouchDB server URL. Default: http://localhost:5984
  --db DB               CouchDB database name. If not specified, the name will
                        be derived from Mediawiki host name.
  --titles TITLES [TITLES ...]
                        Download article pages with these names (titles). It
                        name starts with @ it is interpreted as name of file
                        containing titles, one per line, utf8 encoded.
  --start START         Download all article pages beginning with this name
  --changes-since CHANGES_SINCE
                        Download all article pages that change since specified
                        time. Timestamp format is yyyymmddhhmmss. See
                        https://www.mediawiki.org/wiki/Timestamp. Hours,
                        minutes and seconds can be omited
  --recent-days RECENT_DAYS
                        Number of days to look back for recent changes
  --recent              Download recently changed articles only
  --timeout TIMEOUT     Network communications timeout. Default: 30.0s
  -S, --siteinfo-only   Fetch or update siteinfo, then exit
  -r [SESSION ID], --resume [SESSION ID]
                        Resume previous scrape session. This relies on stats
                        saved in mwscrape database.
  --sessions-db-name SESSIONS_DB_NAME
                        Name of database where session info is stored.
                        Default: mwscrape
  --desc                Request all pages in descending order
  --delete-not-found    Remove non-existing pages from the database
  --speed {0,1,2,3,4,5}
                        Scrape speed

For example, to get English Wiktionary:

mwscrape en.m.wiktionary.org

to get the same but work through list of titles in reverse order:

mwscrape en.m.wiktionary.org --desc

Some sites expose Mediawiki API at path different from Wikipedia’s default, specify it with --site-path:

mwscrape lurkmore.to --site-path=/

mwscrape compares page revisions reported by MediaWiki API with revisions of previously scraped pages in CouchDB and requests parsed page data if new revision is available. CouchDB data dumps (compressed with xz) for some Wikipedia sites are available at http://dl.aarddict.org/mwcouch. Download database file to CouchDB’s data directory (e.g. /var/lib/couchdb) and decompress.

mwscrape also creates a CouchDB design document w with show function html to allow viewing article html returned by MediaWiki API and navigating to html of other collected articles. For example, to view rendered html for article A in database simple-m-wikipedia-org, in a web browser go to the following address (assuming CouchDB is running on localhost):

http://127.0.0.1:5984/simple-m-wikipedia-org/_design/w/_show/html/A

If databases are combined via replication articles with the same title will be stored as conflicts. mwresolvec script is provided to merge conflicting versions (combine aliases, select highest MediaWiki article revision, discard other revisions). Usage:

mwresolvec [-h] [-s START] [-b BATCH_SIZE] [-w WORKERS] [-v] couch_url

positional arguments:
  couch_url

optional arguments:
  -h, --help            show this help message and exit
  -s START, --start START
  -b BATCH_SIZE, --batch-size BATCH_SIZE
  -w WORKERS, --workers WORKERS
  -v, --verbose

Example:

mwresolvec http://localhost:5984/en-m-wikipedia-org

mwscrape's People

Contributors

itkach avatar sklart avatar korhoj avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.