Giter Site home page Giter Site logo

neologd / mecab-ipadic-neologd Goto Github PK

View Code? Open in Web Editor NEW
2.7K 123.0 288.0 452.82 MB

Neologism dictionary based on the language resources on the Web for mecab-ipadic

License: Other

Shell 92.13% Perl 7.87%
mecab-ipadic named-entities dictionary furigana neologism-dictionary mecab language-resources japanese-language

mecab-ipadic-neologd's Introduction

NEologd : Neologism dictionary generator

NEologd generates neologism dictionary using various language resources.

An entry of the neologism dictionary has following 4 columns for each neologism.

  • Surface
  • Phonetic signs
    • IPA (International Phonetic Alphabet)
    • kana indicating the pronunciation (In Japanese)
  • Base form of Surface
  • Part-Of-Speech (POS) tags

NEologd will cope with an occurrence of neologism of the world instead of you.

For Japanese

README.ja.md is written in Japanese.

Application example

Copyrights

Copyright (c) 2015 Toshinori Sato (@overlast) All rights reserved.

mecab-ipadic-neologd's People

Contributors

felixonmars avatar neologd avatar overlast avatar pecorarista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mecab-ipadic-neologd's Issues

Issueではありませんが。。。

大変申し訳ないですが、本辞書とMECABの既存辞書を一緒に使うのがおすすめと言うことなんですが、両方を使うにはどうすればいいか教えていただけますか。

What is the correct way to customize the pos-id.def file in mecab-ipadic-neologd?

Hi,
I'm trying to modify the pos-id.def coming with the neologd dictionary. But after changing that file, whether I execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d /usr/lib/mecab/dic/mecab-ipadic-neologd
or execute
sudo ./mecab-dict-index -f UTF8 -t UTF8 -d .../build/mecab-ipadic-2.7.0-20070801-neologd-20170710>,
I would get the error "

dictionary_compiler.cpp(133) [dic.size()] no dictionaries are specified
or
char_property.cpp(236) [unk.find(it->first) != unk.end()] category [ALPHA] is undefined in ...../mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20170710/unk.def

respectively.

So could anyone tell me the correct way to compile the new pos-id.def for the neglogd dictionary? Any hint is appreciated. Thanks.

Nothing happened after "Download original mecab-ipadic file"

Try to install ipadic-neologd on Mac and followed steps, but after the step of Download original mecab-ipadic file, nothing happened and the program seems break. Can you help? Thanks

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] : find => ok
[install-mecab-ipadic-NEologd] : sort => ok
[install-mecab-ipadic-NEologd] : head => ok
[install-mecab-ipadic-NEologd] : cut => ok
[install-mecab-ipadic-NEologd] : egrep => ok
[install-mecab-ipadic-NEologd] : mecab => ok
[install-mecab-ipadic-NEologd] : mecab-config => ok
[install-mecab-ipadic-NEologd] : make => ok
[install-mecab-ipadic-NEologd] : curl => ok
[install-mecab-ipadic-NEologd] : sed => ok
[install-mecab-ipadic-NEologd] : cat => ok
[install-mecab-ipadic-NEologd] : diff => ok
[install-mecab-ipadic-NEologd] : tar => ok
[install-mecab-ipadic-NEologd] : unxz => ok
[install-mecab-ipadic-NEologd] : xargs => ok
[install-mecab-ipadic-NEologd] : grep => ok
[install-mecab-ipadic-NEologd] : iconv => ok
[install-mecab-ipadic-NEologd] : patch => ok
[install-mecab-ipadic-NEologd] : which => ok
[install-mecab-ipadic-NEologd] : file => ok
[install-mecab-ipadic-NEologd] : openssl => ok
[install-mecab-ipadic-NEologd] : awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/local/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
MacBook-Pro:mecab-ipadic-neologd User$

Most "}" entries are unnecessary

I think most "}" entries are unnecessary.

ag } mecab-user-dict-seed.20200130.csv 
47452:2015年{{lang|zh|深圳}}土砂崩事故,1288,1288,3753,名詞,固有名詞,一般,*,*,*,2015年{{lang|zh|深圳}}土砂崩事故,ニセンジュウゴネンシンセンドシャクズレジコ,ニセンジュウゴネンシンセンドシャクズレジコ
388354:}★,1288,1288,8142,名詞,固有名詞,一般,*,*,*,}★,ワルグチトワルコメボクメツダンタ,ワルグチトワルコメボクメツダンタ
655423:カジタタカアキ,1289,1289,4374,名詞,固有名詞,人名,一般,*,*,梶田隆章{{R|nichigai}},カジタタカアキ,カジタタカーキ
842080:ザウィドウ{{}}真実を求めて{{}},1288,1288,4068,名詞,固有名詞,一般,*,*,*,ザ・ウィドウ{{~}}真実を求めて{{~}},ザウィドウシンジツヲモトメテ,ザウィドウシンジツオモトメテ

Thank you for providing and keeping a good dictionary.

Download common-nouns.csv of specific date

Motivation

  • Extract newly added nouns to the dictionary using the current common-nouns.csv and the last year's common-nouns.csv

Goal

  • Download latest common-nouns.csv and last year'S common-nouns.csv.
  • Is there any way we could download common-nouns.csv of specific date?
    I have looked into the /seeds/ directory but it seems that there is only 2017/02's common-nouns.csv.

With best regards

A small script to find wrong yomigana entries

Hello,

First of all, your mecab-ipadic-neologd is amazing.
Thank you so much!

I wrote a small script and found some wrong yomigana entries.
find-neologd-error-entries.rb.txt
neologd-error-entries.txt

$ ruby find-neologd-error-entries.rb mecab-user-dict-seed.20160111.csv
It generates "neologd-error-entries.txt".

e.g.

  • 京都市上京区西町,1293,1293,-1319,名詞,固有名詞,地域,一般,,,京都市上京区西町,キョウトシカミギョウクニシマチニシマチニシマチニシマチニシマチニシマチニシマチ,キョートシカミギョークニシマチニシマチニシマチニシマチニシマチニシマチニシマチ
  • 神津小学校,1288,1288,352,名詞,固有名詞,一般,,,*,神津小学校,カミツショウガッコウコウヅショウガッコウ,カミツショーガッコーコーズショウガッコー

I know we can't get yomigana perfectly, but neologd may have some errors in zip code data splitting.

mecab-ipadic-NEologd won't be updated when running the installer with the full path

I ran a cron job with the full path installer and -n option, like,

00 03 * * 2 /opt/mecab/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y

Then the following errors occurred.

fatal: Not a git repository (or any of the parent directories): .git
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

This occurred in the following code because the current directory was not a git repository.

if [ `git log refs/heads/master --pretty=%H | head -1` = `git ls-remote origin -h refs/heads/master |cut -f1` ]; then
    echo "$ECHO_PREFIX mecab-ipadic-NEologd is already up-to-date"

In this case, the condition is always true because both of the results are empty.
Therefore, the message "mecab-ipadic-NEologd is already up-to-date" is always displayed.

Wrong entry for ササキ

佐々木貞清,1289,1289,2337,名詞,固有名詞,人名,一般,*,*,佐々木貞清,ササキ,ササキ

インストールに失敗

リポジトリをクローン後に以下のコマンドでインストールを実行すると mecab-ipadic-2.7.0-20070801.tar.gz のハッシュ値が違うという原因でエラーが発生します

$ ./bin/install-mecab-ipadic-neologd -n

・エラー発生時のログ

[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3435    0  3435    0     0   7797      0 --:--:-- --:--:-- --:--:--  7896
[make-mecab-ipadic-NEologd] : Fail to download /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz
[make-mecab-ipadic-NEologd] : You should remove /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz before retrying to install mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] :        rm -rf /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801
[make-mecab-ipadic-NEologd] :        rm /mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz

該当のtarファイルが置かれている以下の URL へアクセスすると Google ドライブのエラーが表示されておりこれが影響しているかもしれません

https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM

google_drive_

e-mail and URL tokenization

Motivation and Goal

Instead of breaking down an email address and/or an URL, it could be a desirable option to be able to identify email addresses and URLs as a single token. See example below to compare current behavior to the suggested one.

Sample code

import MeCab
mecab = MeCab.Tagger("-Ochasen -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd")

text = "中川さんのメールは[email protected]です"
print(mecab.parse(text))

Output

中川    ナカガワ中川    名詞-固有名詞-人名-
さん    サン    さん    名詞-接尾-人名
                  助詞-連体化
メール   メール   メール   名詞-サ変接続
                  助詞-係助詞
nakagawa        nakagawa        nakagawa        名詞-固有名詞-組織
@       @       @       記号-一般
xxxx    イエナイXXXX    名詞-固有名詞-一般
.       .       .       記号-一般
co.jp   シーオージェイピー co.jp   名詞-固有名詞-一般
です    デス    です    助動詞  特殊 基本形
EOS

Desirable output

中川    ナカガワ中川    名詞-固有名詞-人名-
さん    サン    さん    名詞-接尾-人名
                  助詞-連体化
メール   メール   メール   名詞-サ変接続
                  助詞-係助詞
nakagawa@xxxx.co.jp        [...]
です    デス    です    助動詞  特殊 基本形
EOS

すもももももももものうち

辞書を自分で鍛えるのが面倒なので、新し目の辞書を探していてmecab-ipadic-neologdに行き当たりました。なるほど今まで細切れになっていたものが一語として認識され調子良さそうです。しかしながら、ひとつこまったことが。「すもももももももものうち」を解析すると、一般名詞「すもももももももものうち」と解析されてしまいます。

これは辞書をmakeする過程でなにか足りなかったからなのでしょうか?それとも、こういう仕様なのでしょうか?

同じようにmecab-unidic-neologdの方も一般名詞となってしまうことを確認しております。

How to use on Windows10 and Python?

I am a non-japanese speaker. Firstly I installed mecab from that website:
https://pypi.org/project/mecab-python3/

even it didn't create a mecab folder on my pc.

in the python file, I wrote wakati = MeCab.Tagger("-Owakati") and it worked well! but they say mecab-ipadic-neologd is better and I need to use it. But all guides are based on Linux and MacOS. Please help

neologdを使ってみて思ったのですが

こんにちわ。
使わせて頂いてありがとうございます。
さて、数字やローマ字、記号の混じったものは名詞・固有名詞となっています。
ipadicでは数要素であることがfeatureで分かります。
同じように数字などの混じった名詞・固有名詞に、例えば度量衡などの要素を加えて頂けませんか?
良い案があれば、その他の方法でも良いです。
単体で¥は記号、その他の%、kg、cm、Ⅼ(リットルと読めない)は名詞です。
例えば、4カ月  名詞、固有名詞、一般、度量衡
    5つ   名詞、一般、    、度量衡
    A型   名詞、固有名詞、一般、度量衡
    35℃  名詞、固有名詞、一般、度量衡
    70%  名詞、固有名詞、一般、度量衡
    65kg  名詞、固有名詞、一般、度量衡
    180cm  名詞、固有名詞、一般、度量衡
    500m3  ※これは分割されてしまいます。
    8個(ケ) ※これは分割されてしまいます。
5l※リットルと読めるが、分割されてしまいます。

How to produce mecab-user-dict-seed.YYYYMMDD.csv.xz?

Hi, I love and appreciate this helpful dictionary!

A quick question: how do you produce seed file mecab-user-dict-seed.YYYYMMDD.csv.xz?
I suppose you use some scripts to it, but if so, the scripts are also uploaded to the git repo?

I'm looking for the way to build a bit customized-version of the dic.

Thanks in advance!

It cannot parse である correctly

When I use mecab with default dictionary , it can correctly parse this sentence.

対象者はゼロであるが、実施する。
対象    名詞,一般,*,*,*,*,対象,タイショウ,タイショー
者      名詞,接尾,一般,*,*,*,者,シャ,シャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
ゼロ    名詞,数,*,*,*,*,ゼロ,ゼロ,ゼロ
で      助動詞,*,*,*,特殊・ダ,連用形,だ,デ,デ
ある    助動詞,*,*,*,五段・ラ行アル,基本形,ある,アル,アル
が      助詞,接続助詞,*,*,*,*,が,ガ,ガ
、      記号,読点,*,*,*,*,、,、,、
実施    名詞,サ変接続,*,*,*,*,実施,ジッシ,ジッシ
する    動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
。      記号,句点,*,*,*,*,。,。,。

but, When I use mecab with neologd dictionary (commit 0700f47) , 「である」 is treated as 固有名詞.

対象者はゼロであるが、実施する。
対象者  名詞,固有名詞,一般,*,*,*,対象者,タイショウシャ,タイショーシャ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
ゼロ    名詞,数,*,*,*,*,ゼロ,ゼロ,ゼロ
である  名詞,固有名詞,一般,*,*,*,である,デアル,デアル
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
、      記号,読点,*,*,*,*,、,、,、
実施    名詞,サ変接続,*,*,*,*,実施,ジッシ,ジッシ
する    動詞,自立,*,*,サ変・スル,基本形,する,スル,スル
。      記号,句点,*,*,*,*,。,。,。

Is this a bug, or the sentence is grammatically wrong ?
Thanks.

Missing Japanese names

These names are missing in mecab-user-dict-seed.20181112.csv and mecab-ipadic-2.7.0-20070801.
I think they are famous/common names.

サンペイ 三瓶
ソウシゲル 宗茂
タケユタカ 武豊
ユウト 勇人
リンカ 梨花

組織名「日生協」/「日本生活協同組合連合会」

単語の追加に関する要望があります。
いわゆる「生協」(COOP)の略称「日生協」と正式名称の「日本生活協同組合連合会」を追加してほしいです。

現状、人名の姓として「日生」のみが辞書に存在するため、「日生協」を処理すると
「日生」と「協」に分割されてしまいます。

パッチの形にする良い方法が浮かばなかったので、Issueとして報告します。

installer reports error

installer reports fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.

system env:

~% $SHELL '--version'
zsh 5.0.2 (x86_64-pc-linux-gnu)

~% git --version
git version 2.5.0

detail logs:

....

fatal: ambiguous argument '...refs/heads/master^': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
[install-mecab-ipadic-neologd] : Get the newest updated information using git
./bin/install-mecab-ipadic-neologd: line 199: [: =: unary operator expected
HEAD is now at f987dff Fix typo

[install-mecab-ipadic-neologd] : mecab-ipadic-neologd will be install to /usr/lib64/mecab/dic/mecab-ipadic-neologd

....

「世界の秘密」

Hello,
This dictionary is very good!
I use it every day.
Thank you so much.

By the way, a phrase "世界の秘密" is analyzed a token of this dictionary.
The phrase is a quiz TV program name.
But, that TV program ended only five months.
I think that the phrase should be analyzed "世界" + "の" + "秘密".

build時に「line 525: 6288 Killed ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8」のエラーが出る

エラーの内容

最近のレポジトリからgit clone後、エラーが表示されてインストールに失敗します。
参照しようとしているディレクトリが違うように見えますが、ご助言いただけましたら幸いです。

状況

・DockerFileを利用しています。
・DockerFile内でgit clone 後にbuildしています。

コード

# Dockerfile

FROM python:3.6
WORKDIR /code
ENV PYTHONUNBUFFERED 1
COPY requirements.txt /code/
RUN apt-get update -y&&\
    apt-get upgrade -y&&\
    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&\
    apt-get install git make curl xz-utils file&&\
    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&\
    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&\
    mkdir /code/media && \
    mkdir /code/static &&\
    python -m pip install --upgrade pip &&\
    pip install -r requirements.txt
COPY . /code/

エラーの全文

↑は関係のない項目なので省きます。
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
Setting up mecab-ipadic-utf8 (2.7.0-20070801+main-2.1) ...
Compiling IPA dictionary for Mecab.  This takes long time...
reading /usr/share/mecab/dic/ipadic/unk.def ... 40
emitting double-array: 100% |###########################################| 
/usr/share/mecab/dic/ipadic/model.def is not found. skipped.
reading /usr/share/mecab/dic/ipadic/Adnominal.csv ... 135
reading /usr/share/mecab/dic/ipadic/Postp-col.csv ... 91
reading /usr/share/mecab/dic/ipadic/Filler.csv ... 19
reading /usr/share/mecab/dic/ipadic/Others.csv ... 2
reading /usr/share/mecab/dic/ipadic/Noun.nai.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.others.csv ... 151
reading /usr/share/mecab/dic/ipadic/Noun.verbal.csv ... 12146
reading /usr/share/mecab/dic/ipadic/Noun.proper.csv ... 27328
reading /usr/share/mecab/dic/ipadic/Conjunction.csv ... 171
reading /usr/share/mecab/dic/ipadic/Adj.csv ... 27210
reading /usr/share/mecab/dic/ipadic/Postp.csv ... 146
reading /usr/share/mecab/dic/ipadic/Noun.number.csv ... 42
reading /usr/share/mecab/dic/ipadic/Noun.name.csv ... 34202
reading /usr/share/mecab/dic/ipadic/Noun.place.csv ... 72999
reading /usr/share/mecab/dic/ipadic/Noun.csv ... 60477
reading /usr/share/mecab/dic/ipadic/Interjection.csv ... 252
reading /usr/share/mecab/dic/ipadic/Auxil.csv ... 199
reading /usr/share/mecab/dic/ipadic/Noun.demonst.csv ... 120
reading /usr/share/mecab/dic/ipadic/Adverb.csv ... 3032
reading /usr/share/mecab/dic/ipadic/Noun.adverbal.csv ... 795
reading /usr/share/mecab/dic/ipadic/Verb.csv ... 130750
reading /usr/share/mecab/dic/ipadic/Prefix.csv ... 221
reading /usr/share/mecab/dic/ipadic/Suffix.csv ... 1393
reading /usr/share/mecab/dic/ipadic/Symbol.csv ... 208
reading /usr/share/mecab/dic/ipadic/Noun.adjv.csv ... 3328
reading /usr/share/mecab/dic/ipadic/Noun.org.csv ... 16668
emitting double-array: 100% |###########################################| 
reading /usr/share/mecab/dic/ipadic/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################| 

done!
update-alternatives: using /var/lib/mecab/dic/ipadic-utf8 to provide /var/lib/mecab/dic/debian (mecab-dictionary) in auto mode
Processing triggers for libc-bin (2.28-10) ...
Reading package lists...
Building dependency tree...
Reading state information...
curl is already the newest version (7.64.0-4+deb10u1).
file is already the newest version (1:5.35-4+deb10u1).
git is already the newest version (1:2.20.1-2+deb10u3).
make is already the newest version (4.2.1-1.2).
xz-utils is already the newest version (5.2.4-1).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Cloning into 'mecab-ipadic-neologd'...
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
[install-mecab-ipadic-NEologd] :     patch => ok
[install-mecab-ipadic-NEologd] :     which => ok
[install-mecab-ipadic-NEologd] :     file => ok
[install-mecab-ipadic-NEologd] :     openssl => ok
[install-mecab-ipadic-NEologd] :     awk => ok

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd is already up-to-date

[install-mecab-ipadic-NEologd] : mecab-ipadic-NEologd will be install to /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd

[install-mecab-ipadic-NEologd] : Make mecab-ipadic-NEologd
[make-mecab-ipadic-NEologd] : Start..
[make-mecab-ipadic-NEologd] : Check local seed directory
[make-mecab-ipadic-NEologd] : Check local seed file
[make-mecab-ipadic-NEologd] : Check local build directory
[make-mecab-ipadic-NEologd] : create /code/mecab-ipadic-neologd/libexec/../build
[make-mecab-ipadic-NEologd] : Download original mecab-ipadic file
[make-mecab-ipadic-NEologd] : Try to access to https://ja.osdn.net
[make-mecab-ipadic-NEologd] : Try to download from https://ja.osdn.net/frs/g_redir.php?m=kent&f=mecab%2Fmecab-ipadic%2F2.7.0-20070801%2Fmecab-ipadic-2.7.0-20070801.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 11.6M  100 11.6M    0     0  7350k      0  0:00:01  0:00:01 --:--:-- 7731k
Hash value of /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801.tar.gz matched
[make-mecab-ipadic-NEologd] : Decompress original mecab-ipadic file
mecab-ipadic-2.7.0-20070801/
mecab-ipadic-2.7.0-20070801/README
mecab-ipadic-2.7.0-20070801/AUTHORS
mecab-ipadic-2.7.0-20070801/COPYING
mecab-ipadic-2.7.0-20070801/ChangeLog
mecab-ipadic-2.7.0-20070801/INSTALL
mecab-ipadic-2.7.0-20070801/Makefile.am
mecab-ipadic-2.7.0-20070801/Makefile.in
mecab-ipadic-2.7.0-20070801/NEWS
mecab-ipadic-2.7.0-20070801/aclocal.m4
mecab-ipadic-2.7.0-20070801/config.guess
mecab-ipadic-2.7.0-20070801/config.sub
mecab-ipadic-2.7.0-20070801/configure
mecab-ipadic-2.7.0-20070801/configure.in
mecab-ipadic-2.7.0-20070801/install-sh
mecab-ipadic-2.7.0-20070801/missing
mecab-ipadic-2.7.0-20070801/mkinstalldirs
mecab-ipadic-2.7.0-20070801/Adj.csv
mecab-ipadic-2.7.0-20070801/Adnominal.csv
mecab-ipadic-2.7.0-20070801/Adverb.csv
mecab-ipadic-2.7.0-20070801/Auxil.csv
mecab-ipadic-2.7.0-20070801/Conjunction.csv
mecab-ipadic-2.7.0-20070801/Filler.csv
mecab-ipadic-2.7.0-20070801/Interjection.csv
mecab-ipadic-2.7.0-20070801/Noun.adjv.csv
mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv
mecab-ipadic-2.7.0-20070801/Noun.csv
mecab-ipadic-2.7.0-20070801/Noun.demonst.csv
mecab-ipadic-2.7.0-20070801/Noun.nai.csv
mecab-ipadic-2.7.0-20070801/Noun.name.csv
mecab-ipadic-2.7.0-20070801/Noun.number.csv
mecab-ipadic-2.7.0-20070801/Noun.org.csv
mecab-ipadic-2.7.0-20070801/Noun.others.csv
mecab-ipadic-2.7.0-20070801/Noun.place.csv
mecab-ipadic-2.7.0-20070801/Noun.proper.csv
mecab-ipadic-2.7.0-20070801/Noun.verbal.csv
mecab-ipadic-2.7.0-20070801/Others.csv
mecab-ipadic-2.7.0-20070801/Postp-col.csv
mecab-ipadic-2.7.0-20070801/Postp.csv
mecab-ipadic-2.7.0-20070801/Prefix.csv
mecab-ipadic-2.7.0-20070801/Suffix.csv
mecab-ipadic-2.7.0-20070801/Symbol.csv
mecab-ipadic-2.7.0-20070801/Verb.csv
mecab-ipadic-2.7.0-20070801/char.def
mecab-ipadic-2.7.0-20070801/feature.def
mecab-ipadic-2.7.0-20070801/left-id.def
mecab-ipadic-2.7.0-20070801/matrix.def
mecab-ipadic-2.7.0-20070801/pos-id.def
mecab-ipadic-2.7.0-20070801/rewrite.def
mecab-ipadic-2.7.0-20070801/right-id.def
mecab-ipadic-2.7.0-20070801/unk.def
mecab-ipadic-2.7.0-20070801/dicrc
mecab-ipadic-2.7.0-20070801/RESULT
[make-mecab-ipadic-NEologd] : Configure custom system dictionary on /code/mecab-ipadic-neologd/libexec/../build/mecab-ipadic-2.7.0-20070801-neologd-20200813
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets $(MAKE)... yes
checking for working aclocal-1.4... missing
checking for working autoconf... found
checking for working automake-1.4... missing
checking for working autoheader... found
checking for working makeinfo... missing
checking for a BSD-compatible install... /usr/bin/install -c
checking for mecab-config... /usr/bin/mecab-config
configure: creating ./config.status
config.status: creating Makefile
[make-mecab-ipadic-NEologd] : Encode the character encoding of system dictionary resources from EUC_JP 
to UTF-8
./../../libexec/iconv_euc_to_utf8.sh ./Adnominal.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp-col.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Filler.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.nai.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.others.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.verbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.proper.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Conjunction.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adj.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Postp.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.number.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.name.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.place.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Interjection.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Auxil.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.demonst.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Adverb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adverbal.csv
./../../libexec/iconv_euc_to_utf8.sh ./Verb.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Prefix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Suffix.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Symbol.csv 
./../../libexec/iconv_euc_to_utf8.sh ./Noun.adjv.csv
./../../libexec/iconv_euc_to_utf8.sh ./Noun.org.csv 
rm ./Adnominal.csv 
rm ./Postp-col.csv 
rm ./Filler.csv 
rm ./Others.csv
rm ./Noun.nai.csv 
rm ./Noun.others.csv
rm ./Noun.verbal.csv
rm ./Noun.proper.csv
rm ./Conjunction.csv
rm ./Adj.csv
rm ./Postp.csv
rm ./Noun.number.csv
rm ./Noun.name.csv 
rm ./Noun.place.csv
rm ./Noun.csv
rm ./Interjection.csv
rm ./Auxil.csv
rm ./Noun.demonst.csv
rm ./Adverb.csv
rm ./Noun.adverbal.csv
rm ./Verb.csv 
rm ./Prefix.csv
rm ./Suffix.csv 
rm ./Symbol.csv
rm ./Noun.adjv.csv
rm ./Noun.org.csv
./../../libexec/iconv_euc_to_utf8.sh ./right-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./left-id.def 
./../../libexec/iconv_euc_to_utf8.sh ./feature.def 
./../../libexec/iconv_euc_to_utf8.sh ./unk.def 
./../../libexec/iconv_euc_to_utf8.sh ./rewrite.def 
./../../libexec/iconv_euc_to_utf8.sh ./pos-id.def
./../../libexec/iconv_euc_to_utf8.sh ./matrix.def 
./../../libexec/iconv_euc_to_utf8.sh ./char.def 
rm ./right-id.def 
rm ./left-id.def 
rm ./feature.def
rm ./unk.def
rm ./rewrite.def
rm ./pos-id.def 
rm ./matrix.def
rm ./char.def
mv ./Postp.csv.utf8 ./Postp.csv 
mv ./Noun.org.csv.utf8 ./Noun.org.csv 
mv ./Prefix.csv.utf8 ./Prefix.csv
mv ./Noun.demonst.csv.utf8 ./Noun.demonst.csv
mv ./rewrite.def.utf8 ./rewrite.def
mv ./Others.csv.utf8 ./Others.csv 
mv ./matrix.def.utf8 ./matrix.def
mv ./pos-id.def.utf8 ./pos-id.def
mv ./Noun.others.csv.utf8 ./Noun.others.csv
mv ./Noun.adjv.csv.utf8 ./Noun.adjv.csv 
mv ./Interjection.csv.utf8 ./Interjection.csv
mv ./Adj.csv.utf8 ./Adj.csv
mv ./unk.def.utf8 ./unk.def
mv ./Auxil.csv.utf8 ./Auxil.csv
mv ./Noun.number.csv.utf8 ./Noun.number.csv 
mv ./char.def.utf8 ./char.def
mv ./Conjunction.csv.utf8 ./Conjunction.csv
mv ./feature.def.utf8 ./feature.def
mv ./Filler.csv.utf8 ./Filler.csv
mv ./Symbol.csv.utf8 ./Symbol.csv 
mv ./Postp-col.csv.utf8 ./Postp-col.csv
mv ./Noun.csv.utf8 ./Noun.csv
mv ./Adnominal.csv.utf8 ./Adnominal.csv 
mv ./Adverb.csv.utf8 ./Adverb.csv
mv ./Noun.nai.csv.utf8 ./Noun.nai.csv
mv ./Noun.name.csv.utf8 ./Noun.name.csv
mv ./Noun.adverbal.csv.utf8 ./Noun.adverbal.csv
mv ./Noun.proper.csv.utf8 ./Noun.proper.csv 
mv ./Noun.place.csv.utf8 ./Noun.place.csv
mv ./Suffix.csv.utf8 ./Suffix.csv
mv ./left-id.def.utf8 ./left-id.def
mv ./right-id.def.utf8 ./right-id.def
mv ./Noun.verbal.csv.utf8 ./Noun.verbal.csv
mv ./Verb.csv.utf8 ./Verb.csv 
[make-mecab-ipadic-NEologd] : Fix yomigana field of IPA dictionary
patching file Noun.csv
patching file Noun.place.csv
patching file Verb.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.adverbal.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.others.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Prefix.csv
patching file Suffix.csv
patching file Noun.proper.csv
patching file Noun.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Noun.verbal.csv
patching file Noun.name.csv
patching file Noun.org.csv
patching file Noun.place.csv
patching file Noun.proper.csv
patching file Suffix.csv
patching file Noun.demonst.csv
patching file Noun.csv
patching file Noun.name.csv
[make-mecab-ipadic-NEologd] : Copy user dictionary resource
[make-mecab-ipadic-NEologd] : Install adverb entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adverb-dict-seed.20150623.csv.xz
[make-mecab-ipadic-NEologd] : Install interjection entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-interjection-dict-seed.20170216.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-common-noun-ortho-variant-dict-seed.20170228.csv.xz
[make-mecab-ipadic-NEologd] : Install noun orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-proper-noun-ortho-variant-dict-seed.20161110.csv.xz
[make-mecab-ipadic-NEologd] : Install entries of orthographic variant of a noun used as verb form using /code/mecab-ipadic-neologd/libexec/../seed/neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv.xz
[make-mecab-ipadic-NEologd] : Install frequent adjective orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-std-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-exp-dict-seed.20151126.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-adjective-exp-dict-seed.20151126.csv.xz, please set --install_adjective_exp option

[make-mecab-ipadic-NEologd] : Install adjective verb orthographic variant entries using /code/mecab-ipadic-neologd/libexec/../seed/neologd-adjective-verb-dict-seed.20160324.csv.xz
[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-date-time-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-date-time-infreq-dict-seed.20190415.csv.xz, 
please set --install_infreq_datetime option

[make-mecab-ipadic-NEologd] : Not install /code/mecab-ipadic-neologd/libexec/../seed/neologd-quantity-infreq-dict-seed.20190415.csv.xz
[make-mecab-ipadic-NEologd] :     When you install neologd-quantity-infreq-dict-seed.20190415.csv.xz, please set --install_infreq_quantity option

[make-mecab-ipadic-NEologd] : Install entries of ill formed words using /code/mecab-ipadic-neologd/libexec/../seed/neologd-ill-formed-words-dict-seed.20170127.csv.xz
[make-mecab-ipadic-NEologd] : Re-Index system dictionary
reading ./unk.def ... 40
emitting double-array: 100% |###########################################|
./model.def is not found. skipped.
reading ./Adnominal.csv ... 135
reading ./Postp-col.csv ... 91
reading ./Filler.csv ... 19
reading ./Others.csv ... 2
reading ./Noun.nai.csv ... 42
reading ./neologd-ill-formed-words-dict-seed.20170127.csv ... 60616
reading ./neologd-proper-noun-ortho-variant-dict-seed.20161110.csv ... 138379
reading ./Noun.others.csv ... 153
reading ./Noun.verbal.csv ... 12150
reading ./Noun.proper.csv ... 27493
reading ./Conjunction.csv ... 171
reading ./Adj.csv ... 27210
reading ./neologd-common-noun-ortho-variant-dict-seed.20170228.csv ... 152869
reading ./Postp.csv ... 146
reading ./Noun.number.csv ... 42
reading ./Noun.name.csv ... 34215
reading ./Noun.place.csv ... 73194
reading ./neologd-noun-sahen-conn-ortho-variant-dict-seed.20160323.csv ... 26058
reading ./Symbol.csv ... 208
reading ./neologd-adjective-verb-dict-seed.20160324.csv ... 20268
reading ./Noun.adjv.csv ... 3328                                             058
reading ./Noun.org.csv ... 17149
/code/mecab-ipadic-neologd/bin/../libexec/make-mecab-ipadic-neologd.sh: line 525:  6288 Killed
         ${MECAB_LIBEXEC_DIR}/mecab-dict-index -f UTF8 -t UTF8
ERROR: Service 'python-django' failed to build: The command '/bin/sh -c apt-get update -y&&    apt-get 
upgrade -y&&    apt-get install mecab libmecab-dev mecab-ipadic mecab-ipadic-utf8 sudo -y&&    apt-get 
install git make curl xz-utils file&&    git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git&&    /code/mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -y &&    mkdir /code/media &&     mkdir /code/static &&    python -m pip install --upgrade pip &&    pip install -r requirements.txt' returned a non-zero code: 137

Make capable to install mecab-ipadic-NEologd to an user directory without sudo privileges

Currently, I should set "--asuser option" to install the mecab-ipadic-NEologd to an user directory without sudo privileges.
But I would like mecab-ipadic-NEologd to detect whether sudo privileges are required.

So I will implement following features

  • A process to compare an uid of a current user and an uid of target directory
  • assudo option
    • It's required when I want to install using sudoer privileges

Negative cost

Thanks first for the great database.

Motivation

I find some words in the data are assigned negative costs.

$ cat mecab-ipadic-neologd/build/mecab-ipadic-2.7.0-20070801-neologd-20191111/mecab-user-dict-seed.20191111.csv | grep "ファニチャーロウ"
ファニチャーロウレーシング,1288,1288,-5111,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング
ファニチャー・ロウ・レーシング,1288,1288,-9029,名詞,固有名詞,一般,*,*,*,ファニチャー・ロウ・レーシング,ファニチャーロウレーシング,ファニチャーロウレーシング

Costs are lower for more frequent words. But the examples above do not seem to be so frequent as assigned a very low cost. I suspect this could possibly be a result of integer overflow or sort.

Goal

I would like to know:
(1) if this is a correct/intended result or a bug
(2) if correct/intended, how negative costs should be interpreted.

Can someone help me with this?

Add Left double quotation mark to Regexp.ja

Motivation

This issue is about Regexp.ja in a wiki.

以下の全角記号は半角記号に置換
/!”#$%&’()*+,−./:;<>?@[¥]^_`{|}

It recommends replacing Right double quotation mark(U+201D) to Quotation mark(U+0022) and not replacing Left double quotation mark(U+201C) to Quotation mark. I prefer both Right and Left double quotation mark to be replaced to Quotation mark in sentences like below.

ダブルクォテーションは日本語では“強調”のために使われる。
→ ダブルクォテーションは日本語では"強調"のために使われる。

Sorry if there is a specific reason why Left double quotation mark is not included in the rule.

Goal

My suggestion might look like this.

以下の全角記号は半角記号に置換
!“”#$%&’()*+,−./:;<>?@[¥]^_`{|}

In addition to adding Left double quotation mark(U+201C), I omitted Slash(U+002F), which is a half-width character, at the head of the line. I guess this is a mistake.

'三重県' and '群馬県' are parsed as name of person

Both 三重県 and 群馬県 are name of prefecture. Other prefectures are analyzed as 名詞-固有名詞-地域-一般 correctly.

But these prefectures are analyzed as 名詞-固有名詞-人名-一般 and I find these are in seed file. There are no famous persons named 三重県 nor 群馬県 as I searched.

I think both of words should be analyzed as 名詞-固有名詞-地域-一般.

Result of analysis

茨城県  名詞,固有名詞,地域,一般,*,*,茨城県,イバラキケン,イバラキケン
栃木県  名詞,固有名詞,地域,一般,*,*,栃木県,トチギケン,トチギケン
群馬県  名詞,固有名詞,人名,一般,*,*,群馬県,グンマケン,グンマケン
愛知県  名詞,固有名詞,地域,一般,*,*,愛知県,アイチケン,アイチケン
岐阜県  名詞,固有名詞,地域,一般,*,*,岐阜県,ギフケン,ギフケン
三重県  名詞,固有名詞,人名,一般,*,*,三重県,ミエケン,ミエケン

Seed file

./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:三重県,1289,1289,-2894,名詞,固有名詞,人名,一般,*,*,三重県,ミエケン,ミエケン
./build/mecab-ipadic-2.7.0-20070801-neologd-20190812/mecab-user-dict-seed.20190812.csv:群馬県,1289,1289,1138,名詞,固有名詞,人名,一般,*,*,群馬県,グンマケン,グンマケン

hatena keyword doesn't have 16 or higher yomigana characters.

hatena keyword doesn't have 16 or higher yomigana characters.

e.g.

  • うごめもしゅうへんのはてなでのも うごメモ周辺のはてなでの問題
  • おしえてはてなだいありーでんごん 教えてはてなダイアリー伝言板
  • しんはてなだいあらーえいがひゃく 真・はてなダイアラー映画百選

hatena keyword has proper yomigana when the yomigana has 15 or lower characters.

Some wrong yomigana/hyouki entries

// hyouki
mecab-user-dict-seed.20160222.csv:387971: ウグイスタケ,1288,1288,-1686,名詞,固有名詞,一般,,,,鶯〓,ウグイスタケ,ウグイスタケ
mecab-user-dict-seed.20160222.csv:388991: ウチダヒャッケン,1288,1288,-5999,名詞,固有名詞,一般,
,,,内田百〓@6BE1@,ウチダヒャッケン,ウチダヒャッケン
(+87 "〓" entries)

// yomigana
mecab-user-dict-seed.20160222.csv:268129: けけ,1289,1289,7587,名詞,固有名詞,人名,一般,,,けけ,ケケヶ,ケケヶ
mecab-user-dict-seed.20160222.csv:274205: ずヾや株式会社,1288,1288,4587,名詞,固有名詞,一般,,,*,ずヾや株式会社,ズヾヤカブシキガイシャ,ズヾヤカブシキガイシャ

"ヶ" and "ヾ" are not good for Japanese yomigana.

Cannot Install mecab-python3 (unable to execute swig: no such file in directory)

Motivation

Hello,
I've successfully installed MeCab, mecab-ipadic, and the neological dictionary. However, I cannot install mecab-python3 to get MeCab talking with Python. Each time I've tried, I receive the following error:

unable to execute 'swig': No such file or directory
error: command 'swig' failed with exit status 1

From what I've gathered looking into the issue on Google, it seems to be an issue that resulted from the most recent update. Was wondering if there was a temporary fix until the issue gets resolved?

I ran across this online:

https://qiita.com/siraasagi/items/e07e0b271cb7cd679a70

but as I'm using a Mac, I cannot run apt in the command line. Brew also does not recognize the formulae when I substitute it with apt-get. Any help would be much appreciated!

Thanks for your time!

Goal

Goal is to use MeCab with Python to tokenize some Japanese text for NLP purposes.

README.md のBibtexについて

2017年度の言語処理学会と,2016年度の情報処理学会の論文の author についてですが,
橋本泰一さんの名前が Taiichi Hashimoro とタイポしているかと思います.
Taiichi Hashimoto が正しいかと.

Pronunciations for 1日間 ~ 10日間 are wrong

「1日間 ~ 10日間」の読み方が間違ってる。
The 'カン' from '間' are missing.
And the furigana of "1日間" should be "イチニチカン" not "ツイタチカン"。

1日間   名詞,固有名詞,一般,*,*,*,1日間,ツイタチ,ツイタチ
2日間   名詞,固有名詞,一般,*,*,*,2日間,フツカ,フツカ
3日間
4日間
...
10日間  名詞,固有名詞,一般,*,*,*,10日間,トオカ,トオカ

11日間 is correct.

11日間  名詞,固有名詞,一般,*,*,*,11日間,ジュウイチニチカン,ジュウイチニチカン

数値系が固有名詞になっている

$100,1288,1288,7806,名詞,固有名詞,一般,,,,$100,ヒャクドル,ヒャクドル
昭和10年,1288,1288,6518,名詞,固有名詞,一般,
,,,昭和10年,ショウワジュウネン,ショーワジュウネン
10 years,1288,1288,4569,名詞,固有名詞,一般,,,*,10 years,テンイヤーズ,テンイヤーズ

などの数値系の辞書の品詞が、固有名詞になっているが、固有名詞ではないのではないでしょうか?
一般などの品詞に変えられないでしょうか?

Unnecessary variants for single address

grep -a "愛知県名古屋市南区豊田町" mecab-user-dict-seed.20160225.csv

名古屋市豊田町,1293,1293,-5820,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,ナゴヤシトヨダチョウ,ナゴヤシトヨダチョー
愛知県南区豊田町,1293,1293,-1981,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンミナミクトヨダチョウ,アイチケンミナミクトヨダチョー
愛知県名古屋市南区豊田町,1293,1293,-19354,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンナゴヤシミナミクトヨダチョウ,アイチケンナゴヤシミナミクトヨダチョー
愛知県名古屋市豊田町,1293,1293,-18608,名詞,固有名詞,地域,一般,,,愛知県名古屋市南区豊田町,アイチケンナゴヤシトヨダチョウ,アイチケンナゴヤシトヨダチョー

I think we don't need "名古屋市豊田町" "愛知県南区豊田町" "愛知県名古屋市豊田町".
https://www.google.co.jp/search?q="名古屋市豊田町"
4 results
https://www.google.co.jp/search?q="愛知県南区豊田町"
0 results
https://www.google.co.jp/search?q="愛知県名古屋市豊田町"
0 results

Wide "," is included in 原形

$ ag アースウィンドアンドファイアー mecab-user-dict-seed.20200123.csv 
151101:Earth Wind & Fire,1288,1288,4131,名詞,固有名詞,一般,*,*,*,Earth,Wind&Fire,アースウィンドアンドファイアー,アースウィンドアンドファイアー

I think this is better.

- Earth,Wind&Fire
+ Earth, Wind&Fire

normalize_neologd.pyの間違い?

WikiのRegexp.jaのページに記載されているnormalize_neologd.pyですが,

s = unicode_normalize('0−9A-Za-z。-゚', s)

の部分の0と9の間がHYPHEN-MINUSではなくMINUS SIGNになっています.

「10日」を正規化すると「10日」のようになると思うのですが,現在のソースコードでは「10日」のようになってしまいます (Python 2.7.9で確認).

以下の変更をマージしていただけませんか?

arosh/mecab-ipadic-neologd-wiki@0e5534d

(Wikiに対するPull Requestの方法が分からなかったので,Issueで質問させていただきました)

Download failed in China

Hi. Thank you for sharing a great dictionary!
Currently, we are using your dict for Japanese text-to-speech system in our project.

The users from China reported the failure of downloading due to the block of google drive service.
espnet/espnet#606
Is there any plan to provide another download source for the installation?

needs ` yum install patch` with CentOS7

It must be needed patch command before install with CentOS7 as Minimal

./bin/install-mecab-ipadic-neologd -n
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEologd] :     iconv => ok
which: no patch in ($HOME/perl5/perlbrew/bin:/usr/local/bin:/usr/bin:$HOME/bin:/usr/local/sbin:/usr/sbin)
[install-mecab-ipadic-NEologd] :     patch is not found.

so, we have to do rewrite the description like below:

$ sudo yum install mecab mecab-devel mecab-ipadic git make curl xz patch

出力エンコーディングの指定

Windows環境(C#, NMeCaB)で使用しているのですが、出力エンコーディングがUTF8なので少し手を加えないと使用できません。

コンパイル環境はUnixで当面良いので、出力エンコーディングをインストーラのオプションで指定できるようにしてもらえると助かります。

参考(自著ブログ): mecab-ipadic-neologdをNMeCab用にshift-jisでコンパイルした - 雲行きそらゆきココロイキ

"株式会社" should be splitted.

I think these characters should be splitted:
(株), (株), 株式会社

neologd has 5 "あおい電子工業" variants.

  • あおい電子工業 株式会社,1292,1292,-14635,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業 (株),1292,1292,-10826,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業(株),1292,1292,6301,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業株式会社,1292,1292,-9787,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ
  • あおい電子工業(株),1292,1292,-6382,名詞,固有名詞,組織,,,*,あおい電子工業株式会社,アオイデンシコウギョウカブシキガイシャ,アオイデンシコーギョーカブシキガイシャ

I don't think we need 5 variants for "あおい電子工業",
and, more importantly, neologd doesn't have basic "あおい電子工業 アオイデンシコウギョウ".

I think these entries are enough and we can reduce the dictionary size.
あおい電子工業 アオイデンシコウギョウ
株式会社 カブシキガイシャ
(株) カブシキガイシャ
(株) カブシキガイシャ

Regards.

`Android標準ブラウザ` related entries

Motivation

Fix incorrect entries

Goal

  • write the goal

write the description

$ grep 'Android.*ブラウザ' mecab-user-dict-seed.20200709.csv
Android標準ブラウザ,1288,1288,4545,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ
Android標準ブラウザー,1288,1288,5229,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザー,ブラウザー
android標準ブラウザ,1288,1288,4545,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ
ブラウザ,1288,1288,6395,名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ

$ mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd
ブラウザ
ブラウザ        名詞,固有名詞,一般,*,*,*,Android標準ブラウザ,ブラウザ,ブラウザ

ブラウザ is a generic word but neologd seems it as a Android標準ブラウザ :(

Release new version

Motivation

I hope people can install latest updated package with fresh data.

I has packaged version 0.0.5 of mecab-ipadic-neologd which released on 2016-05-02 for Debian and derived distribution like Ubuntu, also release the packaging file on both Launchpad PPA and Bintray, So people can easily install by command apt-get install mecab-ipadic-neologd.

Goal

  • Release new version for updated data

Some entries have wrong yomi and pronunciation

Some entries have wrong yomi and pronunciation.
For example, after building dictionary,

$ cd mecab_ipadic_neologd
$ grep '高橋みなみ,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:5代目高橋みなみ,1289,1289,4078,名詞,固有名詞,人名,一般,*,*,5代目高橋みなみ,ゴダイメタカハシミナミ,ゴダイメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:ゴダイメタカハシミナミ,1289,1289,-951,名詞,固有名詞,人名,一般,*,*,5代目高橋みなみ,ゴダイメタカハシミナミ,ゴダイメタカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,一般,*,*,高橋みなみ,タカハシミナミ,タカハシミナミ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:高橋みなみ,1289,1289,273,名詞,固有名詞,人名,一般,*,*,高橋みなみ,タカハシミナミエーケービーフォーティエイト,タカハシミナミエーケービーフォーティエイト
$ grep '日本料理,' ./**/*.csv
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,一般,*,*,*,日本料理,ニホンリョウリ,ニホンリョーリ
./build/mecab-ipadic-2.7.0-20070801-neologd-20170227/mecab-user-dict-seed.20170227.csv:日本料理,1288,1288,3024,名詞,固有名詞,一般,*,*,*,日本料理,ニホンリョウリニッポンリョウリ,ニホンリョーリニッポンリョウリ

Why?
It looks to me that 日本料理 has concatenated yomi and pronunciation.
Why does 高橋みなみ have エーケービーフォーティエイト?

My version is 20170228-01, but more old version have same issues.

Thanks.

Improper proper nouns

I found some clauses suffixed with "。" are registered as 固有名詞 (proper noun) incorrectly.

$ echo '好きだ。' | mecab -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd
好きだ。	名詞,固有名詞,一般,*,*,*,好きだ。,スキダ,スキダ

The examples are the below:

  • 好きだ。
  • 元気です。
  • おはよう。
  • あなた。
  • またね。
  • 娘。

Failed to build lucene-kuromoji because mecab-user-dict-seed.20190930.csv contain invalid format.

How have you been in a year?
mecab-user-dict-seed.20190930.csv contains invalid CSV format as follows.

line 1378761:
マスストランディング,1288,1288,-141,名詞,固有名詞,一般,*,**,,マス・ストランディング,マスストランディング,マスストランディング
マスストランディング,1288,1288,-141,名詞,固有名詞,一般,*,*,*,マス・ストランディング,マスストランディング,マスストランディング

Morphological analysis result of "夫婦" is wrong

Motivation

I think the morphological analysis result of "夫婦" is wrong.
(build version: mecab-ipadic-2.7.0-20070801-neologd-20190919)

echo "夫婦" | mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
夫婦    名詞,固有名詞,一般,*,*,*,夫婦。,フウフ,フーフ
  1. Original (原形) of "夫婦" is 夫婦 instead of 夫婦。
  2. "夫婦" (Type of noun/品詞細分類1) is 一般 instead of 固有名詞

Goal

  1. Fix original to 夫婦 from 夫婦。
  2. Fix type of noun to 一般 from 固有名詞

Could you deal with this issue for me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.