Giter Site home page Giter Site logo

wikimedia's Introduction

Wikimedia To Text Corpus

Python MIT license Last Updated DOI

Wikimedia is the driving force behind Wikipedia. They provide a monthly full backup of all the data on Wikipedia as well as their properties. The purpose of this repo is to convert the Wikimedia dump from the given format into the text corpus format we use. I.E.

  • The full corpus consisting of one or more TXT files in a single folder
  • One or more articles in a single TXT file
  • Each article will have a header in the form "--- {id} ---"
  • Each article will have its abstract and body extracted
  • One sentence per line
  • Paragraphs are separated by a blank line

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall wikimedia
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/wikimedia.git

or if you have the code local

pip uninstall wikimedia
python -OO -m pip install -v c:/repos/TextCorpusLabs/wikimedia

Run

You are responsible for getting the source files. They can be found at this site. You will need to further navigate into particular wiki you want to download.

You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.

The reason you are responsible is because the dump files are a single MASSIVE file. Sometimes Wikimedia will be busy and the download will be slow. Modern browsers support resume for exactly this case. As of 2023/01/22 it is over 90 GB in .xml form. You must make sure you have enough space before you start.

All the below commands assume the corpus is an extracted .xml file.

  1. Extracts the metadata from the corpus.
wikimedia metadata -source d:/data/wiki/enwiki.xml -dest d:/data/wiki/enwiki.meta.csv

The following are required parameters:

  • source is the .xml file sourced from Wikimedia.
  • dest is the CSV file used to store the metadata.

The following are optional parameters:

  • log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).
  1. Convert the data to our standard format.
wikimedia convert -source d:/data/wiki/enwiki.xml -dest d:/data/wiki.std

The following are required parameters:

  • source is the .xml file sourced from Wikimedia.
  • dest is the folder for the converted TXT files.

The following are optional parameters:

  • lines is the number of lines per TXT file. The default is 1000000.
  • dest_pattern is the format of the TXT file name. It defaults to wikimedia.{id:04}.txt. id is an increasing value that increments after lines are stored in a file.
  • log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).

Debug/Test

The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).

pip uninstall wikimedia
python -m pip install -e c:/repos/TextCorpusLabs/wikimedia

Academic boilerplate

Below is the suggested text to add to the "Methods and Materials" section of your paper when using this process. The references can be found here

The 2022/10/01 English version of Wikipedia [@wikipedia2020] was downloaded using Wikimedia's download service [@wikimedia2020]. The single-file data dump was then converted to a corpus of plain text articles using the process described in [@wikicorpus2020].

wikimedia's People

Contributors

markanewman avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

4datox4d

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.