Wikimedia To Text Corpus

Wikimedia is the driving force behind Wikipedia. They provide a monthly full backup of all the data on Wikipedia as well as their properties. The purpose of this repo is to convert the Wikimedia dump from the given format into the text corpus format we use. I.E.

The full corpus consisting of one or more TXT files in a single folder
One or more articles in a single TXT file
Each article will have a header in the form "--- {id} ---"
Each article will have its abstract and body extracted
One sentence per line
Paragraphs are separated by a blank line

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall wikimedia
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/wikimedia.git

or if you have the code local

pip uninstall wikimedia
python -OO -m pip install -v c:/repos/TextCorpusLabs/wikimedia

Run

You are responsible for getting the source files. They can be found at this site. You will need to further navigate into particular wiki you want to download.

You are responsible for un-compressing and validating the source files. I recommend using 7zip. I installed my copy using Chocolatey.

The reason you are responsible is because the dump files are a single MASSIVE file. Sometimes Wikimedia will be busy and the download will be slow. Modern browsers support resume for exactly this case. As of 2023/01/22 it is over 90 GB in .xml form. You must make sure you have enough space before you start.

All the below commands assume the corpus is an extracted .xml file.

Extracts the metadata from the corpus.

wikimedia metadata -source d:/data/wiki/enwiki.xml -dest d:/data/wiki/enwiki.meta.csv

The following are required parameters:

source is the .xml file sourced from Wikimedia.
dest is the CSV file used to store the metadata.

The following are optional parameters:

log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).

Convert the data to our standard format.

wikimedia convert -source d:/data/wiki/enwiki.xml -dest d:/data/wiki.std

The following are required parameters:

source is the .xml file sourced from Wikimedia.
dest is the folder for the converted TXT files.

The following are optional parameters:

lines is the number of lines per TXT file. The default is 1000000.
dest_pattern is the format of the TXT file name. It defaults to wikimedia.{id:04}.txt. id is an increasing value that increments after lines are stored in a file.
log is the folder of raw XML chunks that did not process. It defaults to empty (not saved).

Debug/Test

The code in this repo is setup as a module. Debugging and testing are based on the assumption that the module is already installed. In order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).

pip uninstall wikimedia
python -m pip install -e c:/repos/TextCorpusLabs/wikimedia

Academic boilerplate

Below is the suggested text to add to the "Methods and Materials" section of your paper when using this process. The references can be found here

The 2022/10/01 English version of Wikipedia [@wikipedia2020] was downloaded using Wikimedia's download service [@wikimedia2020]. The single-file data dump was then converted to a corpus of plain text articles using the process described in [@wikicorpus2020].

textcorpuslabs / wikimedia Goto Github PK

wikimedia's Introduction

Wikimedia To Text Corpus

Operation

Install

Run

Debug/Test

Academic boilerplate

wikimedia's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent