Giter Site home page Giter Site logo

wikiclean's Introduction

WikiClean

Build Status Maven Central LICENSE

WikiClean is a Java Wikipedia markup to plain text converter. It takes Wikipedia XML dumps with articles in wikimedia markup and generates clean plain text.

Why?

For text processing applications, we often need access to plain text, unadulterated by wikimedia markup. This is surprisingly non-trivial, as Wikipedia articles are full of complexities such as references, image captions, tables, infoboxes, etc., which are not useful for many applications.

Before setting out to write this package, I explored many of the Java alternatives for parsing Wikipedia pages described here and found none of them to be adequate for generating clean plain text. The primarily challenge is that most of these packages aspire to be complete Wikipedia parsers (e.g., for rendering), whereas WikiClean was designed with a much simpler goal — wiki markup to plain text conversion (nothing more, nothing less).

Usage

It's simple to use WikiClean:

WikiClean cleaner = new WikiClean.Builder().build();
String content = cleaner.clean(raw);

Where raw is the raw Wikpedia XML.

The builder allows you to specify a few options:

  • withTitle to specify whether to prepend the article title in the plain text.
  • withFooter to specify whether to keep the sections "See also", "Reference", "Further reading", and "External links".

By default, both options are set to false.

Also, use withLanguage to set the language. Currently, 17 are supported:

The corresponding classes are in org.wikiclean.languages.

Contributions for providing additional language support welcome!

Putting everything together, the default builder is equivalent to:

WikiClean cleaner =
    new WikiClean.Builder()
        .withLanguage(new English())
        .withTitle(false)
        .withFooter(false).build();
String content = cleaner.clean(raw);

Sample command-line invocation to read a Wikipedia dump and output plain text:

./scripts/wikipedia-articles-dump -input enwiki-20161220-pages-articles.xml.bz2 | less

Maven Artifacts

Latest releases of Maven artifacts are available at Maven Central.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

wikiclean's People

Contributors

chongwf avatar fabrichter avatar jimmy0017 avatar lintool avatar rosequ avatar sebkur avatar shivahr avatar tballison avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.