Giter Site home page Giter Site logo

readwiki-zh's Introduction

ReadWiki-ZH

从中文Wiki Dump中提取有效词条并转换至文本文件或Markdown文件。
有效词条: 非Template, Category, Wikipedia, File, Topic, Portal, MediaWiki, Draft, Help等类型词条,多个同义词保留其中一个词条。

1. 环境配置

测试环境: Python 3.7.4, Ubuntu 18.04, Windows 7
虚拟环境中安装依赖项

pip install -r requirements.txt

2. 下载中文Wiki Dump

2.1 Wget下载

需安装`Wget

from readwiki.wiki_download import WIKIDownload

# 选择Dump Index及输出文件夹
archive = '20200220'
output_dir = './dump'
print('Downloading dump:', archive)

# 使用Wget下载Dump文件
downloader = WIKIDownload(output_dir)
xml_path, txt_path = downloader.run(archive, verbose=True)

print('Index txt:', txt_path)
print('Content xml:', xml_path)

2.2 手动下载

中文Dump Index页面下选择一个归档日期。归档日期越新,包含的词条越多。
选择20200220后(也可选择其他日期),下载以下两个文件,并解压至dump文件夹。

zhwiki-20200220-pages-articles-multistream.xml.bz2 1.9 GB
zhwiki-20200220-pages-articles-multistream-index.txt.bz2 26.9 MB

3. 提取有效词条至文件

from readwiki.wiki_parse2doc import WIKIParse2Doc

# Dump文件地址
xml_path = './dump/zhwiki-20200220-pages-articles-multistream.xml.bz2'

# 提取前100个有效词条至TXT文件
WIKIParse2Doc(xml_path, './docs/words_txt').run(num=100)
# 提取前100个有效至Markdown文件
WIKIParse2Doc(xml_path, './docs/words_md', markdown=True).run(num=100)

设置num=None,提取全部有效词条。
词条共3430255个, 有效词条1098595个。
提取完成后,输出文件可在docs文件夹查看。
几个输出示例:数学开放源代码邓丽君

readwiki-zh's People

Contributors

quqixun avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

coriger

readwiki-zh's Issues

你好啊,我发现个问题

就是在转换的过程中,表格就被删了,留下的是空格
然后就是参考资料这一部分有的就没有了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.