Giter Site home page Giter Site logo

mocobeta / janome Goto Github PK

View Code? Open in Web Editor NEW
828.0 32.0 49.0 412.55 MB

Japanese morphological analysis engine written in pure Python

Home Page: https://mocobeta.github.io/janome/en/

License: Apache License 2.0

Shell 1.23% Python 98.75% Batchfile 0.02%
python nlp-library japanese-language

janome's Introduction

Janome

image

image

image

image

image

Janome is a Japanese morphological analysis engine written in pure Python.

General documentation:

https://mocobeta.github.io/janome/en/ (English)

https://mocobeta.github.io/janome/ (Japanese)

Requirements

Python 3.7+ is required.

Install

[Note] This consumes about 500 MB memory for building.

(venv) $ pip install janome

Run

(venv) $ python
>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> for token in t.tokenize('すもももももももものうち'):
...     print(token)
...
すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ
も    助詞,係助詞,*,*,*,*,も,モ,モ
もも  名詞,一般,*,*,*,*,もも,モモ,モモ
も    助詞,係助詞,*,*,*,*,も,モ,モ
もも  名詞,一般,*,*,*,*,もも,モモ,モモ
の    助詞,連体化,*,*,*,*,の,ノ,ノ
うち  名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ

License

Licensed under Apache License 2.0 and uses the MeCab-IPADIC dictionary/statistical model.

See LICENSE.txt and NOTICE.txt for license details.

Acknowledgement

Special thanks to @ikawaha, @takuyaa, @nakagami and @janome_oekaki.

Copyright(C) 2015-2023, Tomoko Uchida. All rights reserved.

janome's People

Contributors

andriyor avatar bastianzim avatar ikawaha avatar kamatari avatar mocobeta avatar nakagami avatar norihitoishida avatar onishik88 avatar roy-ht avatar saito400 avatar sueki1242 avatar syou6162 avatar takahi-i avatar takeshi0406 avatar uezo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

janome's Issues

形態素解析の精度

'12' '123' '123456' のような単語があったとき、
'12' '3' と出たり、'12' '123' '123456' と出たりするように、実行するたびに形態素解析が変わります。
これを、単語の個数を一定にしたいのですが、全て辞書登録しなくてはならないのでしょうか?
できれば、個数を固定化するような引数を設定したいのですが。

Show progress when compiling user dictionary

It takes very long time when I compile Neologd as a user dictionary.
Showing progress indicator especially for running create_minimum_transducer helps the users to decide to continue or abort compiling.

I will send pull request to solve this issue, however, I don’t have a confidence that this is the best way for this from the view point of architecture.

【Help wanted】tokenize() を速くしたい

python コードの最適化・高速化に興味がある方,janome を題材に腕試し 💪 をしてみませんか。PRお待ちしています 🙏

(以下自分のための備忘もかねて)

やりたいこと

Tokenizer.tokenize() を速くしたい

参考1:プロファイリング

profile 取得スクリプト: https://gist.github.com/mocobeta/d909efd82510a9147bbb383cc342dc8e#file-profile_tokenize-py
解析対象テキスト(『檸檬』): https://gist.github.com/mocobeta/d909efd82510a9147bbb383cc342dc8e#file-lemon_utf8-txt

上記の簡易プロファイリングスクリプトで, Tokenizer.tokenize() を実行したときの,各メソッドが消費した合計時間(tottime)が長い順番に並べた結果です。 (Core i5-2300 CPU @ 2.80GHz, fedora25, python 3.6, janome 0.3.4)

$ python profile_tokenize.py 
Sun Jul 30 22:03:19 2017    /tmp/pstats

         1152879 function calls in 1.268 seconds

   Ordered by: internal time
   List reduced from 51 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   129738    0.275    0.000    0.367    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/fst.py:432(next_arc)
    10676    0.253    0.000    0.638    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/fst.py:358(_run)
    28058    0.168    0.000    0.221    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/lattice.py:130(add)
   432608    0.098    0.000    0.098    0.000 {built-in method _struct.unpack}
       11    0.079    0.007    1.263    0.115 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/tokenizer.py:206(__tokenize_partial)
     6339    0.072    0.000    0.072    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/dic.py:300(get_char_categories)
       11    0.062    0.006    0.062    0.006 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/lattice.py:124(<listcomp>)
     5338    0.053    0.000    0.694    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/fst.py:349(run)
    26662    0.036    0.000    0.058    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/lattice.py:87(__init__)
   150604    0.034    0.000    0.034    0.000 /home/moco/tmp/venv36/lib/python3.6/site-packages/janome/dic.py:209(get_trans_cost)
  • fst.py は辞書引き(候補となる形態素を,表層形をFSTでエンコード した辞書 sysdic/fst.data から探す)
  • lattice.py はビタビ・アルゴリズムによるコスト最小パスの導出

上位3つの関数で実行時間の大半が消費されているため,これらが(いずれか1つでも)速くなれば解析時間が大幅に削減されるはずです。

もう少し長めで,解析に20秒程度かかるテキスト(『坊っちゃん』)も用意しています。
https://gist.github.com/mocobeta/d909efd82510a9147bbb383cc342dc8e#file-bocchan_utf8-txt

参考2:開発・テスト方法

https://github.com/mocobeta/janome/wiki

ゲームルール

janome の開発ポリシー,というほど大げさなものはないですが,決めていることが2つあります。

  • pure python かつ外部ライブラリに依存しないこと
  • クロスプラットフォーム(Python2.7/3.3+, Windows/Mac/Linux)で動作すること

お礼など

いただいたPRによって解析が高速化されて,無事マージされた場合,気持ちばかりではありますが,お名前(アカウント名)を README に掲載させていただきます(事前に掲載許可をいただきます)。

How to fix "compileFST assert next_addr is not None" Error

en:
When I try to create a user dictionary using Janome's API, UserDicitionary (),
I encountered /lib/site-packages/janome/fst.py compileFST assert next_addr is not None error message with The stack trace is output.
Then the user dictionary cannot be created.
How can I generate a dictionary?

The original data to be used as a user dictionary is in IPA dictionary format.
I wrote data check function in same python code.
check function says all the lines has 12 commna . so I think The CSV file in IPA dictionary format is ready.

The CSV file is as follows.
<About 20 characters mixed with numbers, symbols and alphabets>快特,-1,-1,1000,名詞,固有名詞,一般,,,,快特,,*
<About 20 characters mixed with numbers, symbols and alphabets>特快,-1,-1,1000,名詞,固有名詞,一般,,,,特快,,*
About 20,000 words below.

I can register up to about 9,000 words without any problems.
If the problem can be solved, the dictionary will have more than tens of millions of lines.

ja( 日本語 ):

Janome の API 、UserDicitionary() を使用してユーザー辞書を作成しようとすると、
<pythonインストール先>/lib/site-packages/janome/fst.py compileFST assert next_addr is not None と言うスタックトレースが出力されて、ユーザー辞書の作成ができません。
どうすれば、辞書が生成できるでしょうか。

ユーザー辞書にする元データは、IPA辞書形式です。
APIを記載したpythonソースコード内で入力ファイルの,(カンマ)の数を数えると全ての行が12個なので、IPA辞書形式のCSVファイルは出来ているかと思います。

CSVファイルは、以下の要領です。
<数字・記号・アルファベット混じりの20文字ぐらい>快特,-1,-1,1000,名詞,固有名詞,一般,,,,快特,,*
<数字・記号・アルファベット混じりの20文字ぐらい>特快,-1,-1,1000,名詞,固有名詞,一般,,,,特快,,*
以下2万語程度

9千語ぐらいまでは、問題なく登録できます。
問題が解決できれば、辞書は数千万行以上にする予定です。

Environment:
Janome 3.10
python 2.7.18 32bit
Windows 8.1 64bit

AWS EC2上でインストール出来ない

ローカルのMac OSでは問題なく動くのですが、AWSのlinuxにデプロイしようとしたところインストールされてない事が判明しました。サーバー上に入って手動でインストールしてみたら以下のエラーが出ました。

python3でpyenv使ってます。

$ pip install janome
Collecting janome
Downloading Janome-0.2.8.tar.gz (13.6MB)
100% |████████████████████████████████| 13.6MB 102kB/s
Building wheels for collected packages: janome
Running setup.py bdist_wheel for janome ... done
Stored in directory: /home/webmaster/.cache/pip/wheels/8b/08/f2/9e1d9300c6041925ad32148eb67bc997ec91f0d7ffc999573a
Successfully built janome
Installing collected packages: janome
/usr/local/.pyenv/pyenv.d/exec/pip-rehash/pip: line 20: 19382 Killed "$PYENV_COMMAND_PATH" "$@"

Limit lattice size

When tokenize large documents, lattice grows bigger and consume large memory.
To limit memory consumption (and computing costs,) need to call backtrace() when the size of lattice exceeds a fixed max buffer size. (i.e. split the input texts into multiple chunks.)

Texts can be split at:

  • linefeed code
  • punctuation (especially '。', '、')

If suitable split points are not found, the text will be split at arbitrary point. That could earn inaccurate results, but I think we should put a priority on safety...

Some katakana words have no prounuciation

I need to segment some sentences and get their pronunciations. Some katakana words don't seem to have information on their pronunciation. I can of course transcribe them by katakana's prounuciation rules. But I'm wondering if this is by design? Or this is a bug?

Here's the code to produce the error

from janome.tokenizer import Tokenizer
toker = Tokenizer()

stc = "米国上院では、エドワード・ケネディー上院議員、ジョン・マッケイン上院議員共著による議案についても検討される。"
for token in toker.tokenize(stc):
    print(token)

And here's the output

米国    名詞,固有名詞,地域,国,*,*,米国,ベイコク,ベイコク
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
で      助詞,格助詞,一般,*,*,*,で,デ,デ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
、      記号,読点,*,*,*,*,、,、,、
エドワード      名詞,固有名詞,人名,名,*,*,エドワード,エドワード,エドワード
・      記号,一般,*,*,*,*,・,・,・
ケネディー      名詞,一般,*,*,*,*,ケネディー,*,*
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
議員    名詞,一般,*,*,*,*,議員,ギイン,ギイン
、      記号,読点,*,*,*,*,、,、,、
ジョン  名詞,固有名詞,人名,名,*,*,ジョン,ジョン,ジョン
・      記号,一般,*,*,*,*,・,・,・
マッケイン      名詞,一般,*,*,*,*,マッケイン,*,*
上院    名詞,固有名詞,組織,*,*,*,上院,ジョウイン,ジョーイン
議員    名詞,一般,*,*,*,*,議員,ギイン,ギイン
共著    名詞,一般,*,*,*,*,共著,キョウチョ,キョーチョ
による  助詞,格助詞,連語,*,*,*,による,ニヨル,ニヨル
議案    名詞,一般,*,*,*,*,議案,ギアン,ギアン
について        助詞,格助詞,連語,*,*,*,について,ニツイテ,ニツイテ
も      助詞,係助詞,*,*,*,*,も,モ,モ
検討    名詞,サ変接続,*,*,*,*,検討,ケントウ,ケントー
さ      動詞,自立,*,*,サ変・スル,未然レル接続,する,サ,サ
れる    動詞,接尾,*,*,一段,基本形,れる,レル,レル
。      記号,句点,*,*,*,*,。,。,。

The last column in ケネディー and マッケイン are "*", while エドワード and ジョン have that info.

Added Conda download option

Hi,

I just wanted to let everybody know that I added the package to conda-forge, so that the package can also be downloaded using conda.

Repo: https://github.com/conda-forge/janome-feedstock
Anaconda: https://anaconda.org/conda-forge/janome

If there is interest, I can add a badge or install section to the readme, but I wanted to check here first if that is wanted.

Also, in case any of the maintainers would like to be added as maintainers to the Conda-forge repo, feel free to ping me here or add an issue/pr to the feedstock repo.

Thanks!

Tokenizer causes MemoryError on Windows 32bit

On the Windows 32bit environment, Tokenizer causes the MemoryError exception.

  • Installation is successful.
  • But below codes cause MemoryError
from janome.tokenizer import Tokenizer
t = Tokenizer()

>>>Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\path\to\venv\lib\site-packages\janome\tokenizer.py", line 49, in __init__
    from sysdic import SYS_DIC
  File "C:\path\to\venv\lib\site-packages\sysdic\__init__.py", line 6, in <module>
    from . import entries, connections, chardef, unknowns
MemoryError

On the Windows 64bit, there is no error.
(I tested janome on Windows7 32bit and Windows8.1 64bit).

Expected argument type for `POSStopFilter()` and `POSKeepFilter()` is unclear

Thanks for a great library!

It seems that POSStopFilter() and POSKeepFilter() don't expect str.

(v0.3.6)

>>> from janome.analyzer import Analyzer
>>> from janome.tokenfilter import *
>>> 
>>> s = '吾輩は猫である'
>>> 
>>> a = Analyzer(token_filters=[POSKeepFilter(u'助動詞')])
>>> 
>>> for token in a.analyze(s):
...     print(token)
... 
	助詞,係助詞,*,*,*,*,,,
	助動詞,*,*,*,特殊,連用形,,,
ある	助動詞,*,*,*,五段ラ行アル,基本形,ある,アル,アル
>>> 
>>> a = Analyzer(token_filters=[POSKeepFilter([u'助動詞'])])
>>> 
>>> for token in a.analyze(s):
...     print(token)
... 
	助動詞,*,*,*,特殊,連用形,,,
ある	助動詞,*,*,*,五段ラ行アル,基本形,ある,アル,アル

If str is not expected, it's better to update example code in docs to avoid confusion.

if str is expected, tokenfilter.py should be updated.

For example:

self.pos_list = pos_list

->

if type(pos_list) == str:
	self.pos_list = [pos_list]
else:
	self.pos_list = pos_list

How to change Half-width symbol from 名詞,サ変 to 記号,一般

en:

To change the half-width symbol from being classified as "名詞,サ変" to "記号,一般".
It is written in /Lib/site-packages/janome/sysdic/unknowns.py,
Should I rewrite the definition of SYMBOL to'SYMBOL': [(1283,1283,17585, u'\ u8a18 \ u53f7, \ u4e00 \ u822c, *, *')] ?

ja(日本語):

半角の記号が「名詞,サ変」に分類されているのを、「記号,一般」に変更するには、
<pythonインストールディレクトリ>/Lib/site-packages/janome/sysdic/unknowns.py に書かれている、
SYMBOLの定義を、'SYMBOL':[(1283,1283,17585,u'\u8a18\u53f7,\u4e00\u822c,,')]に書き換えればよいですか?

Environment:
janome version 3.10
python 2.7.18 (32bit)
Windows 8.1 64bit

Boundary-value is not tokenized properly

If tests/text_large.txt is given, I think 「その他 名詞,代名詞,一般,,,*,その他,ソノタ,ソノタ」 should be returned, but following tokens are returned.

そ 名詞,特殊,助動詞語幹,,,,そ,ソ,ソ
の 助詞,連体化,
,,,,の,ノ,ノ
他 名詞,非自立,副詞可能,
,,,他,ホカ,ホカ

Words defined in use_dictionary are not tokenized in input text

I have a large set of samples with their correct form of tokenization. When the samples are tokenized by Janome, the tokenization is different from the expected way in my correct tokenizations.

Many of the failing cases are when there are two consecutive Katakana words in my input, and Janome does not recognize them as two words and creates one token with two words attached together.

I collected the words and added them to the user defined dictionary in Simplified format. This corrected the tokenization of about half of my samples, but many of samples are still incorrectly tokenized including the many cases with consecutive Katakana words.

I tried to find a minimal example below which replicates my problem.

Please let me know if there something else I need to do to improve the tokenization and assure right tokenization for the words I added to user defined dictionary. Thanks :)

My environment:

  • Windows 7
  • Python 3.5
  • No virtual environment, I am using a python interpreter directly installed on windows and installed Janome in it.
  • Janome version 0.3.9

Minimal code replicating my problem:

from janome.tokenizer import Tokenizer as JanomeTokenizer

'''
defined words in test.csv:
コミミックス,カスタム名詞,コミミックス
シンクロナイズド,カスタム名詞,シンクロナイズド
ダイビング,カスタム名詞,ダイビング
'''

string = 'コミミックスシンクロナイズドダイビング'

tok = JanomeTokenizer('test.csv', udic_type='simpledic', udic_enc='utf8')
toks = tok.tokenize(string, wakati=True)
tlist = [ tok for tok in toks if tok != ' ' ]
print ('\n'.join(tlist))

How to handle entries with same surface form, part of speech and word cost ?

It seems there are entries that share same surface form, part of speech and word cost. e.g.:

内浜,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,内浜,ウチハマ,ウチハマ
内浜,1293,1293,8676,名詞,固有名詞,地域,一般,*,*,内浜,ウチバマ,ウチバマ

Also the lattice graph that is built when analysing shows the problem here:

Essentially, janome cannot decide which entry should be selected and the result can't be stable (to me it's a subtle bug of mecab-ipadic dictionary / language model).

For consistency of analysis outputs, we might be able to choose one of them by a criteria (e.g., internal morph id) and completely discard others. I don't know how many such entries there so firstly investigation would be needed.

IndexError: list index out of rangeと表示されます。

卒業制作でクローラーを制作しながらPythonを勉強しているのですが、このようなエラーが出て困っています。何が原因なのかさっぱりわからないのでお力を貸して頂けないでしょうか?
よろしくお願いします。

文字列として渡す値に問題があるような気がするのですが、リスト配列からはみ出ているというのがよくわかりません。

エラー内容

Traceback (most recent call last):
  File "/vagrant/pysearch-master/manage.py", line 15, in <module>
    crawl_web('https://news.google.com/news/headlines?hl=ja&ned=jp', 3)
  File "/vagrant/pysearch-master/web_crawler/crawler.py", line 137, in crawl_web
    add_page_to_index(page_url, html)
  File "/vagrant/pysearch-master/web_crawler/crawler.py", line 121, in add_page_to_index
    for word in _split_to_word(line): #janomeで日本語形態解析して単語をwordに代入
  File "/vagrant/pysearch-master/web_crawler/crawler.py", line 45, in _split_to_word
    return [token.surface for token in t.tokenize(text)]
  File "/home/vagrant/.virtualenvs/dev/local/lib/python3.4/site-packages/janome/tokenizer.py", line 193, in tokenize
    return list(self.__tokenize_stream(text, wakati))
  File "/home/vagrant/.virtualenvs/dev/local/lib/python3.4/site-packages/janome/tokenizer.py", line 200, in __tokenize_stream
    tokens, pos = self.__tokenize_partial(text[processed:], wakati)
  File "/home/vagrant/.virtualenvs/dev/local/lib/python3.4/site-packages/janome/tokenizer.py", line 254, in __tokenize_partial
    lattice.end()
  File "/home/vagrant/.virtualenvs/dev/local/lib/python3.4/site-packages/janome/lattice.py", line 154, in end
    self.add(eos)
  File "/home/vagrant/.virtualenvs/dev/local/lib/python3.4/site-packages/janome/lattice.py", line 140, in add
    node.index = len(self.snodes[self.p])
IndexError: list index out of range

htmlからbody以下にあるタグの中にあるテキストを解析に渡す

def add_page_to_index(url, html):
    body_soup = BeautifulSoup(html, "html.parser").find('body')
    #htmlないの属性タグとその中身をchild_tagに入れていってる<body>以下にある全てのタグ<a>やら<th>やらを持ってくる。
    #先ずはbodyより下のhtml全部持ってきて次にその下のdivを持ってきてul持ってきてどんどん掘り下げる感じ
    #if body_soup.findChildren() is not None:
    for child_tag in body_soup.findChildren():
        #beautifulsoupの機能でタグの名前だけ取り出してる。スクリプトだけは避ける。それ以降の処理がスキップされてループに戻る
        if child_tag.name == 'script': #child_tag.nameタグの名前を取り出す
            continue
        #.textはそのタグの中身を表示する。<a>link</a>だったらlinkだけとりだす。
        child_text = child_tag.text
        for line in child_text.split('\n'): #文字列から改行を取り除いて分ける。
            line = line.rstrip().lstrip() #上のコードだけだと両端の空白が消せないからここで削除している。実際には削除はできないので取り除いたのを返している
            for word in _split_to_word(line): #janomeで日本語形態解析して単語をwordに代入
                add_to_index(word, url)

add_page_to_indexで文字列で受け取ったテキストを解析して単語にして返す

def _split_to_word(text):
    """Japanese morphological analysis with janome.
    Splitting text and creating words list.
    """
    t = Tokenizer()
    #token.surfaceで日本語の文字だけ取り出せる。例えば車は高いだったら"車" "は" "高い" だけ取り出せる。
    return [token.surface for token in t.tokenize(text)]

janomeトークナイズ、NEologd辞書と簡略辞書の併用時のエラーについて

Twitterを送らせていただきました、SKIYOです。

指示の通り、コードとエラーを記載します。
(なお、プライバシーに関係しそうな箇所を---にしています)

▼Code

import codecs as cd
import numpy as np
import math
import re
import xlrd
import xlsxwriter
from collections import Counter
from itertools import chain
from janome.tokenizer import Tokenizer
import os
import csv


### データセット整形
with cd.open(r"\\Saigo\RedirectedFolders\---.csv", "r", "utf-8") as file:
    df = pd.read_csv(file)
    
## 行drop
df.dropna(how='any', inplace=True)
#how='any':一つでもNaN(Not a Number)がある行/列(,axis=1)をdrop なお、how='all'は全てがNaNをdrop
#inplace=True:元のdfが変更される。

## 昇順ソート
df.sort_values(by='顧客ID', inplace=True)
#axis=1:列方向にソート ascending=False:降順でソート by='列名':その列内の値がソート基準になる


### ユーザ辞書の設定
t = Tokenizer(r"C:\Box Sync\Desktop\---\dict_simple_utf8sig.csv", udic_type="simpledic", udic_enc="utf-8-sig", mmap=True)#Tokenizer初期化


### Tokenize
data = []
each_data = []
ID_data = []
for i in range(len(df.index)):
    ID = df.iat[i, 0]
    value = df.iat[i, 1] #.iat[行番号, 列番号] なお、.at['行ラベル', '列ラベル']も可能
    tokens = t.tokenize(value)
    for token in tokens:
        partOfSpeech = token.part_of_speech.split(',')[0] #.part_of_speech.split(',')[0]:品詞
        #なお、[1]~[3]は品詞細分類1~3
        #その他、.infl_type:活用型、.infl_form:活用形、.base_form:原形、.reading:読み、.phonetic:発音
        if partOfSpeech == u'名詞': #名詞を抽出する
            each_data.append(token.surface) #.surface:表層形(tokenそのもの)
    if i != 0:
        if ID != df.iat[i-1, 0]:
            each_data.insert(0, ID)
            data.append(each_data)
            each_data = []
    if len(data) == 10:
        break


### エクセル作成
#ファイル作成
output_Exl = xlsxwriter.Workbook(r"result_pd\morphology_wUserdict.xlsx")
#シート作成
output_sht = output_Exl.add_worksheet('tokens')

for row in range(len(data)):
    for i in range(len(data[row])):
        output_sht.write(row, i, data[row][i]) # (行, 列, 追加するデータ)


##data(リストのリスト)内のすべてのtokensを同じリストに格納
chain_data = list(chain.from_iterable(data)) #chain(.from_iterable)():iterableなオブジェクトを一つのオブジェクトにまとめる
#http://coolpythontips.blogspot.com/2016/02/itertoolschain.html

c = Counter(chain_data) #Counterは、keyに要素、valueに出現回数の、辞書型のサブクラス
result_ranking = c.most_common() #(要素, 出現回数)という形のタプルを出現回数が多い順に並べたリストを返す。引数にnを入力すると、上位n位までを対象にする。
#https://note.nkmk.me/python-collections-counter/

ranking = output_Exl.add_worksheet('count')
for row in range(len(result_ranking)):
    for i in range(len(result_ranking[row])):
        ranking.write(row, i, result_ranking[row][i])
#
output_Exl.close() #エクセル保存

▼error

Traceback (most recent call last):
  File ".\tokenizing_SFDC_pd.py", line 32, in <module>
    t = Tokenizer(r"C:\Box Sync\Desktop\---\dict_simple_utf8sig.csv", udic_type="simpledic", udic_enc="utf-8-sig", mmap=True)#Tokenizer初期化
  File "C:\Users\---\AppData\Local\Programs\Python\Python36\lib\site-packages\janome\tokenizer.py", line 168, in __init__
    self.user_dic = UserDictionary(udic, udic_enc, udic_type, connections)
  File "C:\Users\---\AppData\Local\Programs\Python\Python36\lib\site-packages\janome\dic.py", line 374, in __init__
    compiledFST, entries = build_method(user_dict, enc)
  File "C:\Users\---\AppData\Local\Programs\Python\Python36\lib\site-packages\janome\dic.py", line 404, in buildsimpledic
    surface, pos_major, reading = line.split(',')
ValueError: too many values to unpack (expected 3)

janoma pipエラー

再現条件はMacbook air os Sierraメモリ2GBにVagrant でubuntu 14.04を仮装で立VirtualenvでPythonの仮装環境を立ててpipでインストールしようとしたらエラーが出てインストール出来ませんでした。

`------------------------------------------------------------
/home/vagrant/.virtualenvs/dev/bin/pip run on Tue Jul 25 17:36:38 2017
Downloading/unpacking janome
Getting page https://pypi.python.org/simple/janome/
URLs to search for versions for janome:

  • https://pypi.python.org/simple/janome/
    Analyzing links from page https://pypi.python.org/simple/janome/
    Found link https://pypi.python.org/packages/0a/90/225d6f18c08d2de316dde27e19dbf012372d6e890216e00770c9d061b5b0/Janome-0.3.3.zip#md5=586568ca8f08d1ee907e87e1122f3a9d (from https://pypi.python.org/simple/janome/), version: 0.3.3
    Found link https://pypi.python.org/packages/17/ac/63644f3355ed05fab6adea0f554c230949d62a916338854fb0bf3559a098/Janome-0.3.0.tar.gz#md5=6c039e65de3928189442bed76d366379 (from https://pypi.python.org/simple/janome/), version: 0.3.0
    Found link https://pypi.python.org/packages/1c/2b/a1257f672a8b8340654988bd33f655282c26d1a44823774aa1debdcf7c90/Janome-0.2.2.tar.gz#md5=578836b6b33b7d0cd563de4e6008418b (from https://pypi.python.org/simple/janome/), version: 0.2.2
    Found link https://pypi.python.org/packages/1e/f3/ab90fea3333a8ab4f062b8315a24c0159fca9ce5791ce4a0f38e527f018e/Janome-0.2.7.tar.gz#md5=9554dea7850682675d4358016e64006f (from https://pypi.python.org/simple/janome/), version: 0.2.7
    Found link https://pypi.python.org/packages/2d/ae/9d698657f5922e9f94e61122a1c1181a4de878ddf52e706240ff226f6ece/Janome-0.2.8.tar.gz#md5=46fc4b9c4c856e0ab62f7cbfbb0f83d6 (from https://pypi.python.org/simple/janome/), version: 0.2.8
    Found link https://pypi.python.org/packages/3c/3f/035da75079b423731e32171357cbb0d55b3705e57217baa1e903892c8061/Janome-0.3.1.tar.gz#md5=26261e9592c6b39a6cff26b99feb730d (from https://pypi.python.org/simple/janome/), version: 0.3.1
    Found link https://pypi.python.org/packages/52/81/e4724be188cd194fef4c0641ac7c90aa241f069fdb1a039a736bfc1795e1/Janome-0.2.4.tar.gz#md5=e45ad15e264749cf34eea09179fa77dd (from https://pypi.python.org/simple/janome/), version: 0.2.4
    Found link https://pypi.python.org/packages/75/9e/17fae6a6a2a77204918eed49bf66278cdd7299962d2fe7ab535ff3dda3af/Janome-0.2.3.tar.gz#md5=4cd1f791b138c9125dc0ae00ff484a8b (from https://pypi.python.org/simple/janome/), version: 0.2.3
    Found link https://pypi.python.org/packages/8e/61/faa737d50c1c573d376fd5f3b4d4749f269ba5950787b8c816f879eafc1c/Janome-0.1.3.tar.gz#md5=70d53b05d7c9d1fad696a33d113986aa (from https://pypi.python.org/simple/janome/), version: 0.1.3
    Found link https://pypi.python.org/packages/aa/8c/5a455b754ab9b1e9af1391434459eace55c7eaef6ac826c2c1c01bccd7f4/Janome-0.1.4.tar.gz#md5=0cd2d0cae4128b6a2b8a98a083705f0a (from https://pypi.python.org/simple/janome/), version: 0.1.4
    Found link https://pypi.python.org/packages/aa/d7/c4415785abecb97c7cfc8c93be9a1a421162b3d4b123b09e88a8fa714337/Janome-0.3.2.tar.gz#md5=8c8cba5d1b2dd8678fa70223eca7abf9 (from https://pypi.python.org/simple/janome/), version: 0.3.2
    Found link https://pypi.python.org/packages/cd/95/1059c3573d9fd61daec10d1de7960395e5a5b993500171d446cef42debb5/Janome-0.1.2.tar.gz#md5=fa0db0275f19397d79c9ddb1a62de5ca (from https://pypi.python.org/simple/janome/), version: 0.1.2
    Found link https://pypi.python.org/packages/d5/12/6ccc44da6d439cd4302862d4f8b49f0211e6c86342631fa5d2087e436302/Janome-0.2.0.tar.gz#md5=215a333ac1d000cbd894d30387df0004 (from https://pypi.python.org/simple/janome/), version: 0.2.0
    Found link https://pypi.python.org/packages/e5/d8/5397fc8088b9d6c712fb7a2e0bda3a78ff14ef48a7d409d31c1512e19d82/Janome-0.2.6.tar.gz#md5=72f9bb34d7cbdecc3083330b990f044a (from https://pypi.python.org/simple/janome/), version: 0.2.6
    Found link https://pypi.python.org/packages/ec/11/7c805455996444394cdbd125672fb087c7959dbf01a8fa6455bb9e29e267/Janome-0.2.5.tar.gz#md5=ae6c970243aec1c4cf218ad1bca83c80 (from https://pypi.python.org/simple/janome/), version: 0.2.5
    Using version 0.3.3 (newest of versions: 0.3.3, 0.3.2, 0.3.1, 0.3.0, 0.2.8, 0.2.7, 0.2.6, 0.2.5, 0.2.4, 0.2.3, 0.2.2, 0.2.0, 0.1.4, 0.1.3, 0.1.2)
    Downloading from URL https://pypi.python.org/packages/0a/90/225d6f18c08d2de316dde27e19dbf012372d6e890216e00770c9d061b5b0/Janome-0.3.3.zip#md5=586568ca8f08d1ee907e87e1122f3a9d (from https://pypi.python.org/simple/janome/)
    Running setup.py (path:/home/vagrant/.virtualenvs/dev/build/janome/setup.py) egg_info for package janome
    running egg_info
    creating pip-egg-info/Janome.egg-info
    writing pip-egg-info/Janome.egg-info/PKG-INFO
    writing dependency_links to pip-egg-info/Janome.egg-info/dependency_links.txt
    writing top-level names to pip-egg-info/Janome.egg-info/top_level.txt
    writing manifest file 'pip-egg-info/Janome.egg-info/SOURCES.txt'
    warning: manifest_maker: standard file '-c' not found

    reading manifest file 'pip-egg-info/Janome.egg-info/SOURCES.txt'
    writing manifest file 'pip-egg-info/Janome.egg-info/SOURCES.txt'
    Source in ./.virtualenvs/dev/build/janome has version 0.3.3, which satisfies requirement janome
    Installing collected packages: janome
    Running setup.py install for janome
    Running command /home/vagrant/.virtualenvs/dev/bin/python3 -c "import setuptools, tokenize;file='/home/vagrant/.virtualenvs/dev/build/janome/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mb9su4aw-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/vagrant/.virtualenvs/dev/include/site/python3.4
    running install
    running build
    running build_py
    creating build
    creating build/lib
    creating build/lib/janome
    copying janome/init.py -> build/lib/janome
    copying janome/dic.py -> build/lib/janome
    copying janome/fst.py -> build/lib/janome
    copying janome/lattice.py -> build/lib/janome
    copying janome/tokenizer.py -> build/lib/janome
    creating build/lib/sysdic
    copying sysdic/entries_extra5_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra4_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact9.py -> build/lib/sysdic
    copying sysdic/entries_compact1_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact3_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra1_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra3_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact4.py -> build/lib/sysdic
    copying sysdic/entries_extra0_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra1.py -> build/lib/sysdic
    copying sysdic/entries_extra6.py -> build/lib/sysdic
    copying sysdic/entries_compact2_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra3.py -> build/lib/sysdic
    copying sysdic/entries_compact8_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra9_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra2.py -> build/lib/sysdic
    copying sysdic/entries_buckets.py -> build/lib/sysdic
    copying sysdic/entries_extra7.py -> build/lib/sysdic
    copying sysdic/entries_compact3.py -> build/lib/sysdic
    copying sysdic/entries_compact0.py -> build/lib/sysdic
    copying sysdic/entries_extra0.py -> build/lib/sysdic
    copying sysdic/entries_compact8.py -> build/lib/sysdic
    copying sysdic/init.py -> build/lib/sysdic
    copying sysdic/entries_compact5_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact4_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact7.py -> build/lib/sysdic
    copying sysdic/entries_extra6_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact1.py -> build/lib/sysdic
    copying sysdic/entries_extra5.py -> build/lib/sysdic
    copying sysdic/entries_compact7_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra9.py -> build/lib/sysdic
    copying sysdic/connections1.py -> build/lib/sysdic
    copying sysdic/entries_extra4.py -> build/lib/sysdic
    copying sysdic/unknowns.py -> build/lib/sysdic
    copying sysdic/connections2.py -> build/lib/sysdic
    copying sysdic/entries_extra8_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra8.py -> build/lib/sysdic
    copying sysdic/entries_compact6_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact0_idx.py -> build/lib/sysdic
    copying sysdic/chardef.py -> build/lib/sysdic
    copying sysdic/entries_compact5.py -> build/lib/sysdic
    copying sysdic/entries_extra7_idx.py -> build/lib/sysdic
    copying sysdic/entries_compact2.py -> build/lib/sysdic
    copying sysdic/entries_compact6.py -> build/lib/sysdic
    copying sysdic/entries_compact9_idx.py -> build/lib/sysdic
    copying sysdic/entries_extra2_idx.py -> build/lib/sysdic
    copying sysdic/fst.data.0 -> build/lib/sysdic
    copying sysdic/fst.data.1 -> build/lib/sysdic
    running build_scripts
    creating build/scripts-3.4
    copying and adjusting bin/janome -> build/scripts-3.4
    changing mode of build/scripts-3.4/janome from 664 to 775
    running install_lib
    copying build/lib/sysdic/entries_extra5_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra4_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact9.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact1_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact3_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra1_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra3_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact4.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra0_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra6.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact2_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra3.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact8_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra9_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_buckets.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra7.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact3.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact0.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra0.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact8.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/init.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact5_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact4_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact7.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra6_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra5.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact7_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra9.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/connections1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra4.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/unknowns.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/connections2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra8_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra8.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/fst.data.0 -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact6_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/fst.data.1 -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact0_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/chardef.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact5.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra7_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact6.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_compact9_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/sysdic/entries_extra2_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic
    copying build/lib/janome/fst.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome
    copying build/lib/janome/init.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome
    copying build/lib/janome/lattice.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome
    copying build/lib/janome/tokenizer.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome
    copying build/lib/janome/dic.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra5_idx.py to entries_extra5_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra4_idx.py to entries_extra4_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact9.py to entries_compact9.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact1_idx.py to entries_compact1_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact3_idx.py to entries_compact3_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra1_idx.py to entries_extra1_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra3_idx.py to entries_extra3_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact4.py to entries_compact4.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra0_idx.py to entries_extra0_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra1.py to entries_extra1.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra6.py to entries_extra6.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact2_idx.py to entries_compact2_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra3.py to entries_extra3.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact8_idx.py to entries_compact8_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra9_idx.py to entries_extra9_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra2.py to entries_extra2.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_buckets.py to entries_buckets.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra7.py to entries_extra7.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact3.py to entries_compact3.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact0.py to entries_compact0.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra0.py to entries_extra0.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact8.py to entries_compact8.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/init.py to init.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact5_idx.py to entries_compact5_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact4_idx.py to entries_compact4_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact7.py to entries_compact7.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra6_idx.py to entries_extra6_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact1.py to entries_compact1.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra5.py to entries_extra5.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact7_idx.py to entries_compact7_idx.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra9.py to entries_extra9.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/connections1.py to connections1.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra4.py to entries_extra4.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/unknowns.py to unknowns.cpython-34.pyc
    byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/connections2.py to connections2.cpython-34.pyc
    Complete output from command /home/vagrant/.virtualenvs/dev/bin/python3 -c "import setuptools, tokenize;file='/home/vagrant/.virtualenvs/dev/build/janome/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mb9su4aw-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/vagrant/.virtualenvs/dev/include/site/python3.4:
    running install

running build

running build_py

creating build

creating build/lib

creating build/lib/janome

copying janome/init.py -> build/lib/janome

copying janome/dic.py -> build/lib/janome

copying janome/fst.py -> build/lib/janome

copying janome/lattice.py -> build/lib/janome

copying janome/tokenizer.py -> build/lib/janome

creating build/lib/sysdic

copying sysdic/entries_extra5_idx.py -> build/lib/sysdic

copying sysdic/entries_extra4_idx.py -> build/lib/sysdic

copying sysdic/entries_compact9.py -> build/lib/sysdic

copying sysdic/entries_compact1_idx.py -> build/lib/sysdic

copying sysdic/entries_compact3_idx.py -> build/lib/sysdic

copying sysdic/entries_extra1_idx.py -> build/lib/sysdic

copying sysdic/entries_extra3_idx.py -> build/lib/sysdic

copying sysdic/entries_compact4.py -> build/lib/sysdic

copying sysdic/entries_extra0_idx.py -> build/lib/sysdic

copying sysdic/entries_extra1.py -> build/lib/sysdic

copying sysdic/entries_extra6.py -> build/lib/sysdic

copying sysdic/entries_compact2_idx.py -> build/lib/sysdic

copying sysdic/entries_extra3.py -> build/lib/sysdic

copying sysdic/entries_compact8_idx.py -> build/lib/sysdic

copying sysdic/entries_extra9_idx.py -> build/lib/sysdic

copying sysdic/entries_extra2.py -> build/lib/sysdic

copying sysdic/entries_buckets.py -> build/lib/sysdic

copying sysdic/entries_extra7.py -> build/lib/sysdic

copying sysdic/entries_compact3.py -> build/lib/sysdic

copying sysdic/entries_compact0.py -> build/lib/sysdic

copying sysdic/entries_extra0.py -> build/lib/sysdic

copying sysdic/entries_compact8.py -> build/lib/sysdic

copying sysdic/init.py -> build/lib/sysdic

copying sysdic/entries_compact5_idx.py -> build/lib/sysdic

copying sysdic/entries_compact4_idx.py -> build/lib/sysdic

copying sysdic/entries_compact7.py -> build/lib/sysdic

copying sysdic/entries_extra6_idx.py -> build/lib/sysdic

copying sysdic/entries_compact1.py -> build/lib/sysdic

copying sysdic/entries_extra5.py -> build/lib/sysdic

copying sysdic/entries_compact7_idx.py -> build/lib/sysdic

copying sysdic/entries_extra9.py -> build/lib/sysdic

copying sysdic/connections1.py -> build/lib/sysdic

copying sysdic/entries_extra4.py -> build/lib/sysdic

copying sysdic/unknowns.py -> build/lib/sysdic

copying sysdic/connections2.py -> build/lib/sysdic

copying sysdic/entries_extra8_idx.py -> build/lib/sysdic

copying sysdic/entries_extra8.py -> build/lib/sysdic

copying sysdic/entries_compact6_idx.py -> build/lib/sysdic

copying sysdic/entries_compact0_idx.py -> build/lib/sysdic

copying sysdic/chardef.py -> build/lib/sysdic

copying sysdic/entries_compact5.py -> build/lib/sysdic

copying sysdic/entries_extra7_idx.py -> build/lib/sysdic

copying sysdic/entries_compact2.py -> build/lib/sysdic

copying sysdic/entries_compact6.py -> build/lib/sysdic

copying sysdic/entries_compact9_idx.py -> build/lib/sysdic

copying sysdic/entries_extra2_idx.py -> build/lib/sysdic

copying sysdic/fst.data.0 -> build/lib/sysdic

copying sysdic/fst.data.1 -> build/lib/sysdic

running build_scripts

creating build/scripts-3.4

copying and adjusting bin/janome -> build/scripts-3.4

changing mode of build/scripts-3.4/janome from 664 to 775

running install_lib

copying build/lib/sysdic/entries_extra5_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra4_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact9.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact1_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact3_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra1_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra3_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact4.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra0_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra6.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact2_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra3.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact8_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra9_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_buckets.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra7.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact3.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact0.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra0.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact8.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/init.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact5_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact4_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact7.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra6_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra5.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact7_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra9.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/connections1.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra4.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/unknowns.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/connections2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra8_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra8.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/fst.data.0 -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact6_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/fst.data.1 -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact0_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/chardef.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact5.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra7_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact2.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact6.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_compact9_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/sysdic/entries_extra2_idx.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic

copying build/lib/janome/fst.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome

copying build/lib/janome/init.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome

copying build/lib/janome/lattice.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome

copying build/lib/janome/tokenizer.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome

copying build/lib/janome/dic.py -> /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/janome

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra5_idx.py to entries_extra5_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra4_idx.py to entries_extra4_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact9.py to entries_compact9.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact1_idx.py to entries_compact1_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact3_idx.py to entries_compact3_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra1_idx.py to entries_extra1_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra3_idx.py to entries_extra3_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact4.py to entries_compact4.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra0_idx.py to entries_extra0_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra1.py to entries_extra1.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra6.py to entries_extra6.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact2_idx.py to entries_compact2_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra3.py to entries_extra3.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact8_idx.py to entries_compact8_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra9_idx.py to entries_extra9_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra2.py to entries_extra2.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_buckets.py to entries_buckets.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra7.py to entries_extra7.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact3.py to entries_compact3.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact0.py to entries_compact0.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra0.py to entries_extra0.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact8.py to entries_compact8.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/init.py to init.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact5_idx.py to entries_compact5_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact4_idx.py to entries_compact4_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact7.py to entries_compact7.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra6_idx.py to entries_extra6_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact1.py to entries_compact1.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra5.py to entries_extra5.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_compact7_idx.py to entries_compact7_idx.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra9.py to entries_extra9.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/connections1.py to connections1.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/entries_extra4.py to entries_extra4.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/unknowns.py to unknowns.cpython-34.pyc

byte-compiling /home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/sysdic/connections2.py to connections2.cpython-34.pyc


Cleaning up...
Removing temporary dir /home/vagrant/.virtualenvs/dev/build...
Command /home/vagrant/.virtualenvs/dev/bin/python3 -c "import setuptools, tokenize;file='/home/vagrant/.virtualenvs/dev/build/janome/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mb9su4aw-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/vagrant/.virtualenvs/dev/include/site/python3.4 failed with error code -9 in /home/vagrant/.virtualenvs/dev/build/janome
Exception information:
Traceback (most recent call last):
File "/home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/pip/basecommand.py", line 122, in main
status = self.run(options, args)
File "/home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/pip/commands/install.py", line 283, in run
requirement_set.install(install_options, global_options, root=options.root_path)
File "/home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/pip/req.py", line 1435, in install
requirement.install(install_options, global_options, *args, **kwargs)
File "/home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/pip/req.py", line 706, in install
cwd=self.source_dir, filter_stdout=self._filter_install, show_stdout=False)
File "/home/vagrant/.virtualenvs/dev/lib/python3.4/site-packages/pip/util.py", line 697, in call_subprocess
% (command_desc, proc.returncode, cwd))
pip.exceptions.InstallationError: Command /home/vagrant/.virtualenvs/dev/bin/python3 -c "import setuptools, tokenize;file='/home/vagrant/.virtualenvs/dev/build/janome/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-mb9su4aw-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/vagrant/.virtualenvs/dev/include/site/python3.4 failed with error code -9 in /home/vagrant/.virtualenvs/dev/build/janome`

Inappropriate property names: infl_form, infl_type

Token's properties, infl_form ("連用形") and infl_type ("連用型") are need to be swapped.

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> data1 = (u"尼崎に住んでると言った。")
>>> for token in t.tokenize(data1):
...   print(token.infl_form)
... 
*
*
五段マ行
一段
*
五段ワ行促音便
特殊
*

>>> for token in t.tokenize(data1):
...   print(token.infl_type)
... 
*
*
連用タ接続
基本形
*
連用タ接続
基本形
*

should be

>>> from janome.tokenizer import Tokenizer
>>> t = Tokenizer()
>>> data1 = (u"尼崎に住んでると言った。")
>>> for token in t.tokenize(data1):
...   print(token.infl_form)
... 
*
*
連用タ接続
基本形
*
連用タ接続
基本形
*
>>> for token in t.tokenize(data1):
...   print(token.infl_type)
... 
... 
*
*
五段マ行
一段
*
五段ワ行促音便
特殊
*

Support Python 3x and 2.7 by the same codes.

I'd like to merge janome and janomePy2.
The problem is in fst.py (janome's implementation heavily depends on byte sequences introduced in Python 3.)

To support user defined dictionary, we need to preserve binary compatibility of compiled dictionary data between 3x and 2.7.

token.part_of_speech type returns both str and unicode

janome.Tokenizer returns token and type of the member are both str and unicode . It can reproduce as below in my environment.

# -*- coding: utf-8 -*-
from __future__ import print_function

from janome.tokenizer import Tokenizer

def main():
    t = Tokenizer()
    for token in t.tokenize(u'すもも,も.もも/も:もものうち'):
        print('token.surface:', token.surface, type(token.surface))
        print('token.part_of_speech:', token.part_of_speech, type(token.part_of_speech))

if __name__ == '__main__':
    main()

It seems symbols are unicode and others are str.

Output:

token.surface: すもも <type 'unicode'>
token.part_of_speech: 名詞,一般,*,* <type 'str'>
token.surface: , <type 'unicode'>
token.part_of_speech: 名詞,サ変接続,*,* <type 'unicode'>
token.surface: も <type 'unicode'>
token.part_of_speech: 助詞,係助詞,*,* <type 'str'>
token.surface: . <type 'unicode'>
token.part_of_speech: 名詞,サ変接続,*,* <type 'unicode'>
token.surface: も <type 'unicode'>
token.part_of_speech: 助詞,係助詞,*,* <type 'str'>
token.surface: も <type 'unicode'>
token.part_of_speech: 助詞,係助詞,*,* <type 'str'>
token.surface: / <type 'unicode'>
token.part_of_speech: 名詞,サ変接続,*,* <type 'unicode'>
token.surface: も <type 'unicode'>
token.part_of_speech: 助詞,係助詞,*,* <type 'str'>
token.surface: : <type 'unicode'>
token.part_of_speech: 名詞,サ変接続,*,* <type 'unicode'>
token.surface: もも <type 'unicode'>
token.part_of_speech: 名詞,一般,*,* <type 'str'>
token.surface: の <type 'unicode'>
token.part_of_speech: 助詞,連体化,*,* <type 'str'>
token.surface: うち <type 'unicode'>
token.part_of_speech: 名詞,非自立,副詞可能,* <type 'str'>

My environment is here.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

Python 2.7.10
Janome 0.2.6

Add version information into ` __init__.py`

Hello @mocobeta and janome-contributors !
Thank you for developing such a great library.

I am using janome in an interactive environment like jupyter-lab.

I'd be very happy if janome.__version__ could show the version, like other major libraries (Numpy, statsmodels, PyTorch, Optuna, etc.). This could be achieved by modifying __init__.py.

It would also be beneficial to import classes such as Tokenizer inside __init__.py, so that the dir(janome) shows what classes are below it.

However, this also has the disadvantage of increasing the time and memory usage of import janome.

What are the opinions of developers regarding the implementation of this?
If there are no problems, I would like to send a pull request.

Best regard.

How to register words containing ", (comma)" in the dictionary

en:

Is there a way to register words that contain "," (comma) in the user dictionary?
For example, I want to register the following words.

  • 3,3-Dimethylpentane

In case of MeCab, you can register a word including comma by enclosing it in double quotation marks.
In case of Janome, I get "ValueError: too many values to unpack" error as follows.

Traceback (most recent call last):
  File "janome_make_usrdic.py", line 86, in <module>
    user_dict = UserDictionary("janome_sample_dic.csv", "cp932", "ipadic", sysdi
c.connections)
  File "C:\Python27\lib\site-packages\janome\dic.py", line 393, in __init__
    compiledFST, entries = build_method(user_dict, enc)
  File "C:\Python27\lib\site-packages\janome\dic.py", line 405, in buildipadic
    line.split(',')
ValueError: too many values to unpack

As long as it is split by line.split(','), is it not possible to handle words containing commas?

ja(日本語):

ユーザー辞書に「,(カンマ)」を含む単語を登録する方法はありますか。
例えば、次のような単語を登録したいです。

  • 3,3-ジメチルペンタン

MeCab の場合は、 ダブルクォーテーションで括ることで、カンマを含む単語を辞書登録できます。
Janome の場合は、以下のように「ValueError: too many values to unpack」エラーとなってしまいます。

Traceback (most recent call last):
  File "janome_make_usrdic.py", line 86, in <module>
    user_dict = UserDictionary("janome_sample_dic.csv", "cp932", "ipadic", sysdi
c.connections)
  File "C:\Python27\lib\site-packages\janome\dic.py", line 393, in __init__
    compiledFST, entries = build_method(user_dict, enc)
  File "C:\Python27\lib\site-packages\janome\dic.py", line 405, in buildipadic
    line.split(',')
ValueError: too many values to unpack

line.split(',') で分割している以上、カンマを含む単語は扱えない仕様でしょうか。

Environment:
Janome 0.3.10
python 2.7.18 32bit
Windows 8.1 64bit

ユーザー定義辞書で定義した内容が期待通りに抽出されない

下記の関数で作成した userdic.csv を利用して、特定の固有名詞を抽出することを想定しています。

def generate_userdic_janome(words):
    userdic = []
    for word in words:
        entry = f"{word},1288,1288,-100000,名詞,固有名詞,一般,*,*,*,{word},{word},{word}"
        userdic.append(entry.split(","))
    df = pd.DataFrame(userdic)
    df.to_csv("userdic.csv", header=False, index=False, encoding="utf-8-sig")
    print("generate userdic.csv")

作成した辞書はこちらです。

クラウド,1288,1288,-100000,名詞,固有名詞,一般,*,*,*,クラウド,クラウド,クラウド

このとき、生成した辞書を使って

t = Tokenizer("userdic.csv", udic_enc="utf8")
for token in t.tokenize('クラウド利用のお客様'):
  print(token)

を実行すると

クラ	名詞,固有名詞,一般,*,*,*,クラ,クラ,クラ
ウド	名詞,一般,*,*,*,*,ウド,ウド,ウド
利用	名詞,サ変接続,*,*,*,*,利用,リヨウ,リヨー
の	助詞,連体化,*,*,*,*,の,ノ,ノ
お客様	名詞,一般,*,*,*,*,お客様,オキャクサマ,オキャクサマ

という結果となり クラウド が期待通り抽出されませんでした。
こちらはプログラム上に問題があるのか、コストの設定等に問題があるのかご指摘いただければ幸いです。

'Too many open files' with mmap=True

  • macOS 11.6
  • Python: 3.9.7
  • Janome: 0.4.1

同一のコードでTokenizerの初期化がmmap=TrueだとToo many open filesが発生します

from janome.tokenizer import Tokenizer

for i in range(10):
    t = Tokenizer(mmap=True)  # or False
    print(i)
    for _ in t.tokenize('すもももももももものうち'):
        pass
$ ulimit -n
256
$ python sample.py  # mmap=False
0 1 2 3 4 5 6 7 8 9 
$ python sample.py  # mmap=True
0 1 2 3 4 5 Traceback (most recent call last):
  File "/Users/takanori/Project/manaviria/djangoapp/apps/work/sample.py", line 4, in <module>
  File "/Users/takanori/Project/manaviria/djangoapp/apps/work/env/lib/python3.9/site-packages/janome/tokenizer.py", line 177, in __init__
  File "/Users/takanori/Project/manaviria/djangoapp/apps/work/env/lib/python3.9/site-packages/janome/sysdic/__init__.py", line 83, in mmap_entries
OSError: [Errno 24] Too many open files

Stop supporting Python 2

Extended support of Python 2 will be finally ended at January 1st, 2020.
It's good time to stop supporting Python 2 and drop all codes like if PY3: code_for_py3; else: code_for_py2 (a big thanks to them!).
We will then be able to simplify our code and make future development easier.

However, there is no need to rush, so I'm planning to drop support for Python 2 as of the first release in 2020.

Too many open files

I have Janome of version 0.4.1.
I created many Janome tokenizers and I received error the following.

Traceback (most recent call last):
  File "janome_test.py", line 6, in <module>
  File "/tmp/test/venv/lib/python3.8/site-packages/janome/tokenizer.py", line 177, in __init__
  File "/tmp/test/venv/lib/python3.8/site-packages/janome/sysdic/__init__.py", line 92, in mmap_entries
OSError: [Errno 24] Too many open files

The detail code is the following.

from janome.tokenizer import Tokenizer

ts = []

for i in range(100):
    ts.append(Tokenizer())

for t in ts:
    t.tokenize('すもももももももものうち')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.