Giter Site home page Giter Site logo

macbre / mediawiki-dump Goto Github PK

View Code? Open in Web Editor NEW
22.0 3.0 3.0 292 KB

Python package for working with MediaWiki XML content dumps

Home Page: https://pypi.org/project/mediawiki_dump/

License: MIT License

Makefile 0.48% Python 99.52%
wikipedia wikipedia-corpus wikipedia-dump wikia fandom python3-library mediawiki-dump xml-dump python

mediawiki-dump's Introduction

mediawiki-dump

PyPI Downloads CI Coverage Status

pip install mediawiki_dump

Python3 package for working with MediaWiki XML content dumps.

Wikipedia (bz2 compressed) and Wikia (7zip) content dumps are supported.

Dependencies

In order to read 7zip archives (used by Wikia's XML dumps) you need to install libarchive:

sudo apt install libarchive-dev

API

Tokenizer

Allows you to clean up the wikitext:

from mediawiki_dump.tokenizer import clean
clean('[[Foo|bar]] is a link')
'bar is a link'

And then tokenize the text:

from mediawiki_dump.tokenizer import tokenize
tokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')
['juni', 'varð', 'kunngjørt', 'at', 'Svínoyar', 'kommuna', 'verður', 'løgd', 'saman', 'við', 'Klaksvíkar', 'kommunu', 'eftir', 'komandi', 'bygdaráðsval']

Dump reader

Fetch and parse dumps (using a local file cache):

from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReader

dump = WikipediaDump('fo')
pages = DumpReader().read(dump)

[page.title for page in pages][:10]

['Main Page', 'Brúkari:Jon Harald Søby', 'Forsíða', 'Ormurin Langi', 'Regin smiður', 'Fyrimynd:InterLingvLigoj', 'Heimsyvirlýsingin um mannarættindi', 'Bólkur:Kvæði', 'Bólkur:Yrking', 'Kjak:Forsíða']

read method yields the DumpEntry object for each revision.

By using DumpReaderArticles class you can read article pages only:

import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikipediaDump('fo')
reader = DumpReaderArticles()
pages = reader.read(dump)

print([page.title for page in pages][:25])

print(reader.get_dump_language())  # fo

Will give you:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from <https://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2>...
INFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']

Reading Wikia's dumps

import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikiaDump('plnordycka')
pages = DumpReaderArticles().read(dump)

print([page.title for page in pages][:25])

Will give you:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from <https://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z>...
INFO:WikiaDump:HTTP 200 (129 kB will be fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']

Fetching full history

Pass full_history to BaseDump constructor to fetch the XML content dump with full history:

import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticles

dump = WikiaDump('macbre', full_history=True)  # fetch full history, including old revisions
pages = DumpReaderArticles().read(dump)

print('\n'.join([repr(page) for page in pages]))

Will give you:

INFO:DumpReaderArticles:Parsing completed, entries found: 384
<DumpEntry "Macbre Wiki" by Default at 2016-10-12T19:51:06+00:00>
<DumpEntry "Macbre Wiki" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2016-11-04T10:33:20+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2016-11-04T10:37:17+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2017-01-25T14:47:37+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:20:25+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:21:20+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2018-03-07T12:51:12+00:00>
<DumpEntry "Main Page" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:33+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:49+00:00>
...
<DumpEntry "YouTube tag" by FANDOMbot at 2018-06-05T11:45:44+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-06T08:51:24+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:13+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:36+00:00>
<DumpEntry "Scary transclusion" by Macbre at 2018-07-24T14:52:20+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:04:15+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:24+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:37+00:00>

Reading dumps of selected articles

You can use mwclient Python library and fetch "live" dumps of selected articles from any MediaWiki-powered site.

import mwclient
site = mwclient.Site('vim.fandom.com', path='/')

from mediawiki_dump.dumps import MediaWikiClientDump
from mediawiki_dump.reader import DumpReaderArticles

dump = MediaWikiClientDump(site, ['Vim documentation', 'Tutorial'])

pages = DumpReaderArticles().read(dump)

print('\n'.join([repr(page) for page in pages]))

Will give you:

<DumpEntry "Vim documentation" by Anonymous at 2019-07-05T09:39:47+00:00>
<DumpEntry "Tutorial" by Anonymous at 2019-07-05T09:41:19+00:00>

Finding pages with a specific parser tag

Let's find pages where no longer supported <place> tag is still used:

import logging; logging.basicConfig(level=logging.INFO)

from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReader

dump = WikiaDump('plpoznan')
pages = DumpReader().read(dump)

with_places_tag = [
    page.title
    for page in pages
    if '<place ' in page.content
]

logging.info('Pages found: %d', len(with_places_tag))

with open("pages.txt", mode="wt", encoding="utf-8") as fp:
    for entry in with_places_tag:
        fp.write(entry + "\n")

logging.info("pages.txt file created")

Reading dumps from local files

You can also read dumps from local, non-compressed XML files:

from mediawiki_dump.dumps import LocalFileDump
from mediawiki_dump.reader import DumpReader

dump = LocalFileDump(dump_file="test/fixtures/dump.xml")
reader = DumpReader()

pages = [entry.title for entry in reader.read(dump)]
print(dump, pages)

Reading dumps from compressed local files

Or any other iterators (like HTTP responses):

import bz2

from mediawiki_dump.dumps import IteratorDump
from mediawiki_dump.reader import DumpReader

def get_content(file_name: str):
    with bz2.open(file_name, mode="r") as fp:
        yield from fp

dump = IteratorDump(iterator=get_content(file_name="test/fixtures/dump.xml.bz2"))
reader = DumpReader()

pages = [entry.title for entry in reader.read(dump)]
print(dump, pages)

mediawiki-dump's People

Contributors

bobotig avatar dependabot[bot] avatar e1mo avatar macbre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mediawiki-dump's Issues

Wikipedia - use 7zip for full content dumps

fowiki-latest-pages-meta-history.xml.7z            03-Nov-2018 06:06            44014056
fowiki-latest-pages-meta-history.xml.7z-rss.xml    03-Nov-2018 06:06                 790
fowiki-latest-pages-meta-history.xml.bz2           03-Nov-2018 06:06            68574198

65 MiB vs 42 MiB

Implement a simple parser of templates parameters

https://lyrics.wikia.com/wiki/Hamfer%C3%B0?action=edit

{{ArtistHeader
|star       = Bronze
|homepage   = 
|facebook   = Hamferd
|myspace    = hamferd
|twitter    = hamferd
|wikipedia  = fo:Hamferð
|wikipedia2 = en:Hamferð
|country    = Faroe Islands
|state      = 
|hometown   = Tórshavn
}}

==[[Hamferð:Vilst Er Síðsta Fet (2010)|Vilst er síðsta fet (2010)]]==
{{Album Art||Vilst er síðsta fet}}
# '''[[Hamferð:Harra Guð Títt Dýra Navn Og Æra|Harra Guð títt dýra navn og æra]]'''
# '''[[Hamferð:Vráin|Vráin]]'''
# '''[[Hamferð:Aldan Revsar Eitt Vargahjarta|Aldan revsar eitt vargahjarta]]'''
# '''[[Hamferð:At Enda|At enda]]'''
{{clear}}

==[[Hamferð:Evst (2013)|Evst (2013)]]==
{{Album Art||Evst}}
# '''[[Hamferð:Evst|Evst]]'''
# '''[[Hamferð:Deyðir Varðar|Deyðir varðar]]'''
# '''[[Hamferð:Við Teimum Kvirru Gráu|Við teimum kvirru gráu]]'''
# '''[[Hamferð:At Jarða Tey Elskaðu|At jarða tey elskaðu]]'''
# '''[[Hamferð:Sinnisloysi|Sinnisloysi]]'''
# '''[[Hamferð:Ytst|Ytst]]'''
{{clear}}

==[[Hamferð:Támsins Likam (2018)|Támsins likam (2018)]]==
{{Album Art||Támsins likam}}
# '''[[Hamferð:Fylgisflog|Fylgisflog]]'''
# '''[[Hamferð:Stygd|Stygd]]'''
# '''[[Hamferð:Tvístevndur Meldur|Tvístevndur meldur]]'''
# '''[[Hamferð:Frosthvarv|Frosthvarv]]'''
# '''[[Hamferð:Hon Syndrast|Hon syndrast]]'''
# '''[[Hamferð:Vápn Í Anda|Vápn í anda]]'''
{{Clear}}

{{ArtistFooter
|fLetter     = H
|asin        = 
|iTunes      = 415357866
|allmusic    = mn0003690108
|discogs     = 3574951
|musicbrainz = 98e9584e-ace8-453e-8374-961584908f65
|spotify     = 4Okrb4EaAYXpRu2cx2ipa4
|youtube     = hamferdofficial
}}
{{MetalArchives|3540318967}}

Extract information like:

  • band's country
  • band's Spotify ID

Allow to fetch a dump of a selected Wikipedia page(s)

E.g. https://en.wikipedia.org/wiki/Special:Export/The_Deep_(2012_film)

Use mwclient package.

>>> site = mwclient.Site(('https', 'en.wikipedia.org'))
>>> page = site.pages[u'Leipäjuusto']
>>> page.text()

Usage example

import logging; logging.basicConfig(level=logging.DEBUG)

import mwclient
site = mwclient.Site('vim.fandom.com', path='/')

from mediawiki_dump.dumps import MediaWikiClientDump
dump = MediaWikiClientDump(site, ['Vim documentation', 'Tutorial'])

print(dump.get_content())

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.