Giter Site home page Giter Site logo

elisa-aleman / mecab-python Goto Github PK

View Code? Open in Web Editor NEW
9.0 1.0 0.0 12 KB

Example usage of the python wrappers for MeCab Japanese parser in MacOSX.

License: MIT License

Python 100.00%
mecab python python3 natural-language-processing japanese japanese-nlp example tutorial nlp-parsing

mecab-python's Introduction

MeCab Japanese parser usage example in Python

MeCab is a Japanese language parsing tool and as such, it is widely discussed and documented only in Japanese. If you wish to parse Japanese texts but aren' confident reading the documentation, this example might be of use to you.

I am working in python3, but I have examples for python2 as well in this tutorial. I am also working in MacOSX (I included Linux too) so installing methods might be different in your case.

Install MeCab

If you are on MacOSX:

From the terminal, using homebrew:

brew install mecab
brew install mecab-ipadic

If you are on a debian-based Linux:

sudo apt-get install libmecab-dev
sudo apt-get install mecab mecab-ipadic-utf8

Install Python wrapper

Python 3:

Now there is already a library dedicated to wrap MeCab in python3 available in PiPy. The original project, copyright and install files can be found at:

https://pypi.python.org/pypi/mecab-python3

Also available at GitHub at:

https://github.com/SamuraiT/mecab-python3

Note: I do not claim any property over these projects, my code is only an example of usage

Install with pip:

pip3 install mecab-python3

Python 2:

In python2, the available wrapper is not on PiPy, so we download manually:

https://pypi.python.org/pypi/mecab-python/0.996

Download the file mecab-python-0.996.tar.gz to ~/

Terminal>>

pip2 install mecab-python-0.996.tar.gz
rm mecab-python-0.996.tar.gz

Then in Python:

import MeCab
# With MeCab original dictionary
# MeCab のデフォルト辞書で
mecab_tagger = MeCab.Tagger('')

text = 'これは日本語の形態素解析のテストです。動詞の形も一般化できるようになっています。'
parsed = [[chunk.split('\t')[0], tuple(chunk.split('\t')[1].split(','))] for chunk in mecab_tagger.parse(text).splitlines()[:-1]]
###
# the output layout is as follows:
# parsed --> [[surface, feature] for word in text]
# surface = 'surface'
# feature = ('part-of-speech, sub-class 1, sub-class 2, sub-class 3, inflection, conjugation, root-form, reading, pronunciation')
###

# >>> for i in parsed: print(i)
# ...
# ['これ', ('名詞', '代名詞', '一般', '*', '*', '*', 'これ', 'コレ', 'コレ')]
# ['は', ('助詞', '係助詞', '*', '*', '*', '*', 'は', 'ハ', 'ワ')]
# ['日本語', ('名詞', '一般', '*', '*', '*', '*', '日本語', 'ニホンゴ', 'ニホンゴ')]
# ['の', ('助詞', '連体化', '*', '*', '*', '*', 'の', 'ノ', 'ノ')]
# ['形態素', ('名詞', '一般', '*', '*', '*', '*', '形態素', 'ケイタイソ', 'ケイタイソ')]
# ['解析', ('名詞', 'サ変接続', '*', '*', '*', '*', '解析', 'カイセキ', 'カイセキ')]
# ['の', ('助詞', '連体化', '*', '*', '*', '*', 'の', 'ノ', 'ノ')]
# ['テスト', ('名詞', 'サ変接続', '*', '*', '*', '*', 'テスト', 'テスト', 'テスト')]
# ['です', ('助動詞', '*', '*', '*', '特殊・デス', '基本形', 'です', 'デス', 'デス')]
# ['。', ('記号', '句点', '*', '*', '*', '*', '。', '。', '。')]
# ['動詞', ('名詞', '一般', '*', '*', '*', '*', '動詞', 'ドウシ', 'ドーシ')]
# ['の', ('助詞', '連体化', '*', '*', '*', '*', 'の', 'ノ', 'ノ')]
# ['形', ('名詞', '一般', '*', '*', '*', '*', '形', 'カタチ', 'カタチ')]
# ['も', ('助詞', '係助詞', '*', '*', '*', '*', 'も', 'モ', 'モ')]
# ['一般', ('名詞', '一般', '*', '*', '*', '*', '一般', 'イッパン', 'イッパン')]
# ['化', ('名詞', '接尾', 'サ変接続', '*', '*', '*', '化', 'カ', 'カ')]
# ['できる', ('動詞', '自立', '*', '*', '一段', '基本形', 'できる', 'デキル', 'デキル')]
# ['よう', ('名詞', '非自立', '助動詞語幹', '*', '*', '*', 'よう', 'ヨウ', 'ヨー')]
# ['に', ('助詞', '格助詞', '一般', '*', '*', '*', 'に', 'ニ', 'ニ')]
# ['なっ', ('動詞', '自立', '*', '*', '五段・ラ行', '連用タ接続', 'なる', 'ナッ', 'ナッ')]
# ['て', ('助詞', '接続助詞', '*', '*', '*', '*', 'て', 'テ', 'テ')]
# ['い', ('動詞', '非自立', '*', '*', '一段', '連用形', 'いる', 'イ', 'イ')]
# ['ます', ('助動詞', '*', '*', '*', '特殊・マス', '基本形', 'ます', 'マス', 'マス')]
# ['。', ('記号', '句点', '*', '*', '*', '*', '。', '。', '。')]

mecab-python's People

Contributors

elisa-aleman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.