Giter Site home page Giter Site logo

manas-tiwari / mwoffliner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from openzim/mwoffliner

0.0 0.0 0.0 8.69 MB

Scrape any online Mediawiki motorised wiki (like Wikipedia) to your local filesystem

Home Page: https://www.npmjs.com/package/mwoffliner

License: GNU General Public License v3.0

JavaScript 2.31% CSS 3.32% HTML 1.54% Dockerfile 0.37% Shell 0.61% TypeScript 91.69% PHP 0.17%

mwoffliner's Introduction

MWoffliner

MWoffliner is a tool for making a local offline HTML snapshot of any online Mediawiki instance. It goes through all online articles (or a selection if specified) and create the corresponding ZIM file. It has mainly been tested against Wikimedia projects like Wikipedia, Wiktionary, ... But it should also work for any recent Mediawiki.

Read CONTRIBUTING.md to know more about MWoffliner development.

NPM

npm Docker Image Build Status codecov CodeFactor License

Features

  • Scrape with or without image thumbnail
  • Scrape with or without audio/video multimedia content
  • S3 cache (optional)
  • Image size optimiser / Webp converter
  • Scrape all articles in namespaces or title list based
  • Specify additional/non-main namespaces to scrape

Run mwoffliner --help to get all the possible options.

Prerequisites

  • *NIX Operating System (GNU/Linux, macOS, ...)
  • Redis
  • NodeJS version 10 or greater
  • Libzim (On GNU/Linux & macOS we automatically download it)
  • Various build tools which are probably already installed on your machine (packages libjpeg-dev, autoconf, automake, gcc on Debian/Ubuntu)

... and an online Mediawiki with its API available.

Usage

To install MWoffliner globally:

npm i -g mwoffliner

You might need to run this command with the sudo command, depending how your npm is configured.

npm permission checking can be a bit annoying for a newcommer. Please read the documentation carefully if you hit problems: https://docs.npmjs.com/cli/v7/using-npm/scripts#user

Then to run it:

mwoffliner --help

To use MWoffliner with a S3 cache, you should provide a S3 URL like this:

--optimisationCacheUrl="https://wasabisys.com/?bucketName=my-bucket&keyId=my-key-id&secretAccessKey=my-sac"

API

MWoffliner provides also an API and therefore can be used as a NodeJS library. Here a stub example:

const mwoffliner = require('mwoffliner');
const parameters = {
    mwUrl: "https://es.wikipedia.org",
    adminEmail: "[email protected]",
    verbose: true,
    format: "nopic",
    articleList: "./articleList"
};
mwoffliner.execute(parameters); // returns a Promise

Background

Complementary information about MWoffliner:

  • MediaWiki software is used by thousands of wikis, the most famous ones being the Wikimedia ones, including Wikipedia.
  • MediaWiki is a PHP wiki runtime engine.
  • Wikitext is the name of the markup language that MediaWiki uses.
  • MediaWiki includes a parser for WikiText into HTML, and this parser creates the HTML pages displayed in your browser.
  • There is another WikiText parser, called Parsoid, implemented in Javascript/NodeJS. MWoffliner uses Parsoid.
  • Parsoid is planned to eventually become the main parser for MediaWiki.
  • MWoffliner calls Parsoid and then post-processes the results for offline format.

GNU/Linux - Debian based distributions

Install NodeJS:

curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.33.11/install.sh | bash && \
source ~/.bashrc && \
nvm install stable && \
node --version

Install Redis:

sudo apt-get install redis-server

Troubleshooting

Older GNU/Linux distributions and/or versions of Node.js might be shipped with a deprecated version of npm. Older versions of npm have incompatbilities with certain versions of Node.js and might simply fail to install mwoffliner package.

We recommend to use a recent version of npm. Recent versions can perfectly deal with older Node.js 10. Do install the packaged version of npm and then use it to install a newer version like:

sudo npm install --unsafe-perm -g npm

Don't forget to remove the packaged version of npm afterward.

License

GPLv3 or later, see LICENSE for more details.

mwoffliner's People

Contributors

isnit0 avatar kelson42 avatar bakshiutkarsha avatar mananjethwani avatar midik avatar subbuss avatar skylsmoi avatar dependabot[bot] avatar tamasfabi avatar bradyhunsaker avatar rgaudin avatar vss-devel avatar dnohales avatar gregbarcza avatar automactic avatar snyk-bot avatar code-factor avatar servis avatar cscott avatar bshishov avatar baturin avatar anas-ahmad-siddiqui avatar fattredd avatar sivaraam avatar legoktm avatar lidel avatar mdholloway avatar piotrblachnio avatar senayuki avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.