Giter Site home page Giter Site logo

wikidata-corpus's Introduction

wikidata

wikidata.org

Download

STORE_PATH=data
DATA_URL=http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

cd $STORE_PATH
wget $DATA_URL

Extract articles

WikiExtractor.py -b 5000M \
    -o data/zhwiki-latest-pages-articles.extracted \
    data/zhwiki-latest-pages-articles.xml.bz2

繁体转简体

opencc -i data/zhwiki-latest-pages-articles.extracted/AA/wiki_00  \
    -o data/zhwiki-latest-pages-articles.0620.chs \
    -c t2s.json

Download t2s.json.

到此为止,已经完成了大部分繁简转换工作。

其他情况处理

  1. 维基百科使用的繁简转换方法是以词表为准,外加人工修正。人工修正之后的文字是这种格式,多数是为了解决各地术语名称不同的问题:

他的主要成就包括Emacs及後來的GNU Emacs,GNU C 編譯器及-{zh-hant:GNU 除錯器;zh-hans:GDB 调试器}-。

对付这种可以简单的使用正则表达式来解决。一般简体中文的限定词是zh-hans或zh-cn。

  1. 由于Wikipedia Extractor抽取正文时,会将有特殊标记的外文直接剔除,最后形成类似这样的正文:

西方语言中“数学”(;)一词源自于古希腊语的()

虽然上面这句话是读不通的,但鉴于这种句子对我要处理的问题影响不大,就暂且忽略了。最后再将「」『』这些符号替换成引号,顺便删除空括号。

python2 fix_special_symbols.py data/zhwiki-latest-pages-articles.0620.chs

程序执行结束,输出: data/zhwiki-latest-pages-articles.0620.chs.normalized

浏览文件

head data/zhwiki-latest-pages-articles.0620.chs.normalized

分词

  • 执行脚本
export PYTHONIOENCODING="UTF-8"
python3 wordseg.py > data/zhwiki-latest-pages-articles.0620.chs.normalized.wordseg

word2vec

word2vec官方的实现。

./word2vec_c_format_train.sh

Usage of word2vec model

  • word2vec cli
distance, compute-accuracy, word-analogy
  • python
python3 word2vec_gensim_similarity.py

TF-IDF

  • plain code

train

python3 tfidf_plain.py

After running, dump words, weights and idf into pickle file.

现在会有稀疏矩阵的问题,解决方案是使用限定的词汇表。

python3 tfidf_sklearn.py

关联项目

中文近义词库,Synonyms使用wikidata-corpus训练的词向量生成近义词表。

references

http://licstar.net/archives/328 http://licstar.net/archives/tag/wikipedia-extractor

wikidata-corpus's People

Contributors

hailiang-wang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.