Giter Site home page Giter Site logo

sudachi's Introduction

Sudachi

Sudachi logo

Build Quality Gate

日本語 README

Sudachi is Japanese morphological analyzer. Morphological analysis consists mainly of the following tasks.

  • Segmentation
  • Part-of-speech tagging
  • Normalization

Tutorial

For a tutorial on installation, please refer to the tutorial page.

For a tutorial on the plugin, please refer to the plugin tutorial page.

For information on building Sudachi from source or development see Development page.

Features

Sudachi has the following features.

  • Multiple-length segmentation
    • You can change the mode of segmentations
    • Extract morphemes and named entities at once
  • Large lexicon
    • Based on UniDic and NEologd
  • Plugins
    • You can change the behavior of processings
  • Work closely with the synonym dictionary
    • We will release the sysnonym dictionary at a later date

Dictionaries

Sudachi has three types of dictionaries.

  • Small: includes only the vocabulary of UniDic
  • Core: includes basic vocabulary (default)
  • Full: includes miscellaneous proper nouns

Click here for pre-built dictionaries. For more details, see SudachiDict.

How to use the small / full dictionary

Run the command line tool with the configuration string

$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'

Use on the command line

$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]

Options

  • -r conf specifies the setting file (overrides -s)
  • -s json additional settings (overrides -r)
  • -p directory root directory of resources
  • -m {A|B|C} specifies the mode of splitting
  • -a outputs the dictionary form and the reading form
  • -d dump the debug outputs
  • -o file specifies output file (default: the standard output)
  • -t separate words with spaces
  • -ts separate words with spaces, and break line for each sentence
  • -f ignore errors
  • --systemDict file specify path to the system dictionary. Will override other settings.
  • --userDict file add a user dictionary. Will not override other settings, but add another user dictionary.
  • --format class use the provided class for formatting output instead of default configuration

Examples

$ echo 東京都へ行く | java -jar target/sudachi.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -a
東京都  名詞,固有名詞,地名,一般,*,*     東京都  東京都  トウキョウト
へ      助詞,格助詞,*,*,*,*     へ      へ      エ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く    行く    イク
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -m A
東京    名詞,固有名詞,地名,一般,*,*     東京
都      名詞,普通名詞,一般,*,*,*        都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -t
東京都 へ 行く

How to use the API

You can find details in the Javadoc.

To compile an application with Sudachi API, declare a dependency on Sudachi in maven project.

<dependency>
  <groupId>com.worksap.nlp</groupId>
  <artifactId>sudachi</artifactId>
  <version>0.5.3</version>
</dependency>

The modes of splitting

Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.

The followings are examples in the core dictionary.

A:選挙/管理/委員/会
B:選挙/管理/委員会
C:選挙管理委員会

A:客室/乗務/員
B:客室/乗務員
C:客室乗務員

A:労働/者/協同/組合
B:労働者/協同/組合
C:労働者協同組合

A:機能/性/食品
B:機能性/食品
C:機能性食品

The followings are examples in the full dictionary.

A:医薬/品/安全/管理/責任/者
B:医薬品/安全/管理/責任者
C:医薬品安全管理責任者

A:消費/者/安全/調査/委員/会
B:消費者/安全/調査/委員会
C:消費者安全調査委員会

A:さっぽろ/テレビ/塔
B:さっぽろ/テレビ塔
C:さっぽろテレビ塔

A:カンヌ/国際/映画/祭
B:カンヌ/国際/映画祭
C:カンヌ国際映画祭

In full-text searching, to use A and B can improve precision and recall.

Plugins

You can use or make plugins which modify the behavior of Sudachi.

Type of Plugins Example
Modify the Inputs Character normalization
Make OOVs Considering script styles
Connect Words Inhibition, Overwrite costs
Modify the Path Fix Person names, Equalization of splitting

Prepared Plugins

We prepared following plugins.

Type of Plugins Plugin
Modify the Inputs character normalization Full/half-width, Cases, Variants
normalization of prolong symbols Normalize "~", "ー"s
Remove yomigana Remove yomigana in parentheses
Make OOVs Make one character OOVs Use as the fallback
MeCab compatible OOVs
Connect Words Inhibition Specified by part-of-speech
Modify the Path Join Katakata OOVs
Join numerics
Equalization of splitting* Smooth of OOVs and not OOVs
Normalize numerics Normalize Kanji numerics and scales
Estimate person names*

* will be released at a later date.

Normalized Form

Sudachi normalize the following variations.

  • Okurigana
    • e.g. 打込む → 打ち込む
  • Script
    • e.g. かつ丼 → カツ丼
  • Variant
    • e.g. 附属 → 付属
  • Misspelling
    • e.g. シュミレーション → シミュレーション
  • Contracted form
    • e.g. ちゃあ → ては

Character Normalization

DefaultInputTextPlugin normalizes an input text in the following order.

  1. To lower case by Character.toLowerCase()
  2. Unicode normalization by NFKC

When rewrite.def has the following descriptions, DefaultInputTextPlugin stops the above processing and aplies the followings.

  • Ignore
# single code point: this character is skipped in character normalization
髙
  • Replace
# rewrite rule: <target> <replacement>
A' Ā

If the number of characters increases as a result of character normalization, Sudachi may output morphemes whose length is 0 in the original input text.

User Dictionary

To create and use your own dictionaries, please refer to docs/user_dict.md.

Comparison with MeCab and Kuromoji

Sudachi MeCab kuromoji
Multiple Segmentation Yes     No   Limited ^a
Normalization Yes No Limited ^b
Joining, Correction Yes No Limited ^b
Use multiple user dictionary Yes Yes No
Saving Memory Good ^c Poor Good
Accuracy Good Good Good
Speed Good Excellent Good
  • ^a: approximation with n-best
  • ^b: with Lucene filters
  • ^c: memory sharing with multiple Java VMs

Future Releases

  • Speeding up
  • Releasing plugins
  • Improving the accuracy
  • Adding more split informations
  • Adding more normalized forms
  • Fix reading forms (pronunciation -> Furigana)
  • Coodinating segmentations with the synonym dictionary

Licenses

Sudachi

Sudachi by Works Applications Co., Ltd. is licensed under the Apache License, Version2.0

Copyright (c) 2017 Works Applications Co., Ltd.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Logo

Sudachi Logo

This logo or a modified version may be used by anyone to refer to the morphological analyzer Sudachi, but does not indicate endorsement by Works Applications Co., Ltd.

Copyright (c) 2017 Works Applications Co., Ltd.

Elasticsearch

We release a plug-in for Elasticsearch.

Python

An implementation of Sudachi in Python

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

Citing Sudachi

We have published a paper about Sudachi and its language resources; "Sudachi: a Japanese Tokenizer for Business" (Takaoka et al., LREC2018).

When citing Sudachi in papers, books, or services, please use the follow BibTex entry;

@InProceedings{TAKAOKA18.8884,
  author = {Kazuma Takaoka and Sorami Hisamoto and Noriko Kawahara and Miho Sakamoto and Yoshitaka Uchida and Yuji Matsumoto},
  title = {Sudachi: a Japanese Tokenizer for Business},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Sudachi (日本語README)

English README

Sudachi は日本語形態素解析器です。形態素解析はおもに以下の3つの処理を おこないます。

  • テキスト分割
  • 品詞付与
  • 正規化処理

チュートリアル

インストールのチュートリアルは、インストールのチュートリアルを参照ください。

プラグインのチュートリアルは、プラグインのチュートリアルを参照ください。
プラグイン機構を用いて、分かち書きを実現しています。

Sudachi の特長

Sudachi は従来の形態素解析器とくらべ、以下のような特長があります。

  • 複数の分割単位の併用
    • 必要に応じて切り替え
    • 形態素解析と固有表現抽出の融合
  • 多数の収録語彙
    • UniDic と NEologd をベースに調整
  • 機能のプラグイン化
    • 文字正規化や未知語処理に機能追加が可能
  • 同義語辞書との連携
    • 後日公開予定

辞書の取得

Sudachi には3種類の辞書があります。

  • Small: UniDic の収録語とその正規化表記、分割単位を収録
  • Core: 基本的な語彙を収録 (デフォルト)
  • Full: 雑多な固有名詞まで収録

ビルド済みの辞書はこちらで配布しています。 くわしくは SudachiDict をごらんください。

スモール/フル辞書の利用方法

コマンドラインツールで設定文字列を指定します

$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'

コマンドラインツール

$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]

オプション

  • -r conf 設定ファイルを指定 (-s と排他)
  • -s json デフォルト設定の上書き (-r と排他)
  • -p directory リソースの起点となるディレクトリを指定
  • -m {A|B|C} 分割モード
  • -a 読み、辞書形も出力
  • -d デバッグ情報の出力
  • -o 出力ファイル (指定がない場合は標準出力)
  • -t 単語をスペース区切りで出力
  • -ts 単語をスペース区切りで出力、文末で改行を出力
  • -f エラーを無視して処理を続行する

出力例

$ echo 東京都へ行く | java -jar target/sudachi.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -a
東京都  名詞,固有名詞,地名,一般,*,*     東京都  東京都  トウキョウト
へ      助詞,格助詞,*,*,*,*     へ      へ      エ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く    行く    イク
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -m A
東京    名詞,固有名詞,地名,一般,*,*     東京
都      名詞,普通名詞,一般,*,*,*        都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -t
東京都 へ 行く

ライブラリの利用

ライブラリとしての利用は Javadoc を参照してください。

Maven プロジェクトで利用する場合は以下の dependency を追加してください。

<dependency>
  <groupId>com.worksap.nlp</groupId>
  <artifactId>sudachi</artifactId>
  <version>0.5.3</version>
</dependency>

分割モード

Sudachi では短い方から A, B, C の3つの分割モードを提供します。 A は UniDic 短単位相当、C は固有表現相当、B は A, C の中間的な単位です。

以下に例を示します。

(コア辞書利用時)

A:選挙/管理/委員/会
B:選挙/管理/委員会
C:選挙管理委員会

A:客室/乗務/員
B:客室/乗務員
C:客室乗務員

A:労働/者/協同/組合
B:労働者/協同/組合
C:労働者協同組合

A:機能/性/食品
B:機能性/食品
C:機能性食品

(フル辞書利用時)

A:医薬/品/安全/管理/責任/者
B:医薬品/安全/管理/責任者
C:医薬品安全管理責任者

A:消費/者/安全/調査/委員/会
B:消費者/安全/調査/委員会
C:消費者安全調査委員会

A:さっぽろ/テレビ/塔
B:さっぽろ/テレビ塔
C:さっぽろテレビ塔

A:カンヌ/国際/映画/祭
B:カンヌ/国際/映画祭
C:カンヌ国際映画祭

検索用途であれば A と C を併用することで、再現率と適合率を向上させる ことができます。

機能追加プラグイン

Sudachi では形態素解析の各ステップをフックして処理を差し込むプラグイン機構を 提供しています。

プラグイン 処理例
入力テキスト修正 異体字統制、表記補正
未知語処理 文字種による調整
単語接続処理 品詞接続禁制、コスト値上書き
出力解修正 人名処理、分割粒度調整

プラグインを作成することでユーザーが独自の処理をおこなうことができます。

システム提供プラグイン

システム提供のプラグインとして以下のものを利用できます。

処理部分 プラグイン
入力テキスト修正 文字列正規化 全半角、大文字/小文字、異体字
カスタマイズ可能
長音正規化 「~」や長音記号連続の正規化
読みがな削除 括弧内の読み仮名を削除
未知語処理 1文字未知語 フォールバックとして利用
MeCab互換
単語接続処理 品詞接続禁制 カスタマイズ可能
出力解修正 カタカナ未知語まとめ上げ
数詞まとめ上げ
分割粒度調整* 未知語/既知語の分割粒度の平滑化
数詞正規化 漢数詞や位取りの正規化
人名補正* 敬称や前後関係から人名部を推定

* は後日公開予定

表記正規化

Sudachi のシステム辞書では以下のような表記正規化を提供します。

  • 送り違い
    • 例) 打込む → 打ち込む
  • 字種
    • 例) かつ丼 → カツ丼
  • 異体字
    • 例) 附属 → 付属
  • 誤用
    • 例) シュミレーション → シミュレーション
  • 縮約
    • 例) ちゃあ → ては

文字正規化

デフォルトで適用されるプラグイン DefaultInputTextPlugin で入力文に対して以下の順で正規化をおこないます。

  1. Character.toLowerCase() をつかった小文字化
  2. NFKC をつかった Unicode 正規化

ただし、rewrite.def に以下の記述があった場合は上記の処理は適用されず、こちらの処理が優先されます。

  • 正規化抑制
# コードポイントが1つのみ記述されている場合は、文字正規化を抑制します
髙
  • 置換
# 置換対象文字列 置換先文字列
A' Ā

文字正規化の結果、文字数が増えた場合、原文上では長さが0になる形態素が出力されることがあります。

ユーザー辞書

ユーザー辞書の作成と利用方法については、docs/user_dict.mdをご覧ください。

MeCab / kuromoji との比較

Sudachi MeCab kuromoji
分割単位の併用 × △ ^1
文字正規化、表記正規化 × △ ^2
まとめ上げ、補正処理 × △ ^2
複数ユーザ辞書の利用 ×
省メモリ ◎ ^3
解析精度
解析速度
  • ^1: n-best解による近似
  • ^2: Lucene フィルター併用
  • ^3: メモリマップ利用による複数 JavaVM での辞書共有

今後のリリースでの対応予定

  • 高速化
  • 未実装プラグインの整備
  • 解析精度向上
  • 分割情報の拡充
  • 正規化表記の拡充
  • 読み情報の整備 (発音読み → ふりがな読み)
  • 同義語辞書との連携

Elasticsearch

Elasticsearch で Sudachi をつかうためのプラグインも公開しています。

Python

Python 版も公開しています。

Slack

開発者やユーザーの方々が質問したり議論するためのSlackワークスペースを用意しています。

Sudachiの引用

Sudachiとその言語資源について、論文を発表しています; "Sudachi: a Japanese Tokenizer for Business" (Takaoka et al., LREC2018).

Sudachiを論文や書籍、サービスなどで引用される際には、以下のBibTexをご利用ください。

@InProceedings{TAKAOKA18.8884,
  author = {Kazuma Takaoka and Sorami Hisamoto and Noriko Kawahara and Miho Sakamoto and Yoshitaka Uchida and Yuji Matsumoto},
  title = {Sudachi: a Japanese Tokenizer for Business},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

sudachi's People

Contributors

chikurin66 avatar dependabot[bot] avatar eiennohito avatar ikawaha avatar kawahara-n avatar kazuma-t avatar kengotoda avatar kotaroooo0 avatar liboz avatar mana-ysh avatar motoki317 avatar msnoigrs avatar mugenen avatar nagahama-d avatar nishihara-daiki avatar sh0nk avatar sorami avatar t-yamamura avatar u16s avatar yepeisheng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sudachi's Issues

[Question] Thread safety

Hi. I want to create a multi-threaded program using Sudachi. Which class instances can be shared between threads? Should I create every instances for each threads?

I want to know about the following classes;

  • Dictionary (JapaneseDictionary)
  • Tokenizer (JapaneseTokenizer)

Update: I checked the implementation and found out that JapaneseTokenizer mutates it's member variable. So, the tokenizer is not thread safe.

Thanks.

potential bug at connection matrix

@kazuma-t

row and column of connection matrix looks traversed and I think setConnectionCost at GrammerImpl contains bug.

  1. setConnectionCost looks properly used (rightid is right, leftid is left)

    grammar.setConnectCost(leftId, rightId, Grammar.INHIBITED_CONNECTION);

  2. getConnectionCost looks not properly used (rightid is left, leftid is right)

    short connectCost = grammar.getConnectCost(lNode.rightId, rNode.leftId);

setConnectionCost is not frequently used, so the effect is small.
It's better to have consensus between 1 and 2 and shape of connection matrix.
Off course, the format of dictionary file must be kept.

DoubleArray fails when key contains negative byte

Hi, there. Thank you for your nice implementation.

I have been interested in Double Array Trie implementation for years, and studied a similar dart-clone-java implementation before. His implementation works well.

But it seems that your implementation can't handle keys containing negative byte. Example:

    public void testRaw() throws Exception
    {
        byte[][] keys = new byte[3][];
        keys[0] = new byte[]{1};
        keys[1] = new byte[]{1, 2, -1};
        keys[2] = "東京都".getBytes(StandardCharsets.UTF_8);
        DoubleArray doubleArray = new DoubleArray();
        doubleArray.build(keys, new int[]{0, 1, 2}, null);

        for (int i = 0; i < 3; ++i)
        {
            assertEquals(i, doubleArray.exactMatchSearch(keys[i])[0]);
            System.out.printf("Good for %s\n", Arrays.toString(keys[i]));
        }
    }

By the way, your repo is over its data quota.

~ git clone https://github.com/WorksApplications/Sudachi.git
Cloning into 'Sudachi'...
remote: Counting objects: 2353, done.
remote: Compressing objects: 100% (955/955), done.
remote: Total 2353 (delta 754), reused 2307 (delta 753), pack-reused 0
Receiving objects: 100% (2353/2353), 306.04 KiB | 86.00 KiB/s, done.
Resolving deltas: 100% (754/754), done.
Checking connectivity... done.
Downloading dictionary/sudachi_lex.csv (517 MB)
Error downloading object: dictionary/sudachi_lex.csv (acabc71): Smudge error: Error downloading dictionary/sudachi_lex.csv (acabc71c0b63936801fbdaf1749f8b91e7fa8b15532461e05fec9b98f5f474db): batch response: This repository is over its data quota. Purchase more data packs to restore access.

So I can't download the csv file or debug into your com.worksap.nlp.sudachi.dictionary.DictionaryBuilder. Is there any magic inside the DictionaryBuilder?

[Question] What are different Parts of Speech values?

What are the different elements returned in the parts of speech?
For example, on tokenizing 太郎 we get:

太郎	名詞,固有名詞,人名,名,*,*	太郎

I want to understand the various parts under:

名詞,固有名詞,人名,名,*,*

Do each of these comma separated values have a descriptive name?

Sorry, if it is mentioned somewhere in the code! I couldn't find it!

And, thank you for making such an awesome package and making it open source!!

Version management policy

Currently version of Sudachi is 0.1-SNAPSHOT (maybe major = 0, minor = 1), however version of elasticsearch-sudachi is 1.0.0-SNAPSHOT (major = 1, minor = 0, patch = 0). It seems that they use different versioning policy. I recommend you to unify it.

And if you follow the semver2.0, I recommend you to change 1.0.0-SNAPSHOT to 0.1.0-SNAPSHOT. 0.x.y is know as initial development phase so you can break backward compatibility.

Ignore brackets and yomigaka

We want to analyze 「行(おこな)う」 as 「行う」because「おこな」 means pronunciation (yomigana).

Why not multi sub-module project?

I found second pom.xml in elasticsearch directory. Then why we do not use multi-module project?

If we have no clear reason, I will propose a PR to change this project to multi-Module. Here is condition to do so:

  • use the same <version> for each module (not mandatory but seems possible)
  • move src, licenses and pom.xml into sudachi/src, sudachi/licenses and sudachi/pom.xml
    • name of sub directory can be changed to sudachi-core or what you prefer

Add advanced validations for dictionary building

We plan to introduce dictionary build warnings, which will not abort the building of the dictionary, but will report that something was not good.

Warning-producing checks will be optional, but enabled by default.

Proposed list of warnings:

  • Surface forms are not normalized. Words with such surfaces will not be possible to lookup via Trie index (this is current behavior) and those problems seem to appear somewhat frequently with user dictionaries.
  • Word segmentation producing non-consistent splitting. Concatenation of word splitting surfaces should produce the surface of the original word.
  • Having non-distinguishable dictionary entries (with same left/right connection IDs + surface). In this case an entry with the highest cost wins, otherwise the last dictionary entry wins. We will remove all other entries from index.

invalid format at buildLexicon

Thanks for developing this tool. I tried to install Sudachi on a CentOS 6 server. I tried 'mvn package' but ended up with the following message. Files downloaded in './target/dictionary/unidic-mecab-2.1.2_src' seems to be OK. Is there any specific file that I should check?

[INFO] --- iterator-maven-plugin:0.5.1:iterator (build-system-dictionary) @ sudachi ---
[INFO] ------ (core) org.codehaus.mojo:exec-maven-plugin:1.6.0:java
reading the source file...Error: invalid format at line 1
[WARNING]
java.lang.IllegalArgumentException: invalid format
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.buildLexicon (DictionaryBuilder.java:114)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.build (DictionaryBuilder.java:89)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.main (DictionaryBuilder.java:432)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:497)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:745)

Emoji after "。" causes StringIndexOutOfBoundsException

Description

When trying to analyze text containing an emoji after a "。", java.lang.StringIndexOutOfBoundsException is thrown.

Environment

  • Sudachi 0.4.2
  • system_core.dic @ 20200330
  • OpenJDK 8u252

Steps to reproduce

$ echo "。😀" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
        at java.lang.String.substring(String.java:1963)
        at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
        at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
        at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
        at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
        at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)

oovPOS is invalid

I tried to start sudachi in this way but end up like this.

target]$ java -jar sudachi-0.1.1-SNAPSHOT.jar
Exception in thread "main" java.lang.IllegalArgumentException: oovPOS is invalid:????,??,*,*,*,*
	at com.worksap.nlp.sudachi.SimpleOovProviderPlugin.setUp(SimpleOovProviderPlugin.java:63)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:89)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:56)
	at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:44)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:135)

I've checked the content in ./src/main/resources/sudachi.json and the oovProviderPlugin part seems OK.

Words w/o normalization are ignored in user dictionary

The first column of a source of user dictionary is a headword for TRIE.
Because input texts are normalized by DefaultInputTextPlugin, the headwords must be normalized in the same way.
Normalize by DefaultInputTextPlugin when build dictionary, or add a caution to the document.

SentenceDetector splits at whitespace regardless of text length when no break point found

The current behavior:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
EOS
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

We want it to return the full text if it is shorter than the buffer size like:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

User dictionary source File Creation from Token and POS mapped file

if there any Utility to generate the “User dictionary source File” from a raw file ,which has Sentence and its Tokens and POS Mapping for Each Token .
I mean if we have Token and POS mapping , if there any easy way to generate the “User dictionary source File”

For Example , if we have a file as below , or any similar format, can we generate the “User dictionary source File”

image

Configuration: Improve Error messages

Right now if a resource does not exist, Sudachi throws NullPointerException, which is bad and not useful. Instead, report exact configuration string and all actually tried paths.

Adjust lengths of known word and OOV

Known word and OOV are different in segmentation although their word structures are the same.

全国的 名詞,普通名詞,形状詞可能,,,* 全国的

間接 名詞,普通名詞,一般,,,* 間接
的 接尾辞,形状詞的,,,,

Adjust them by PathRewritePlugin

Empty surface morpheme(s) before output "´" or "…"

SudachiCommandLine outputs empty surface morpheme(s) before output "´" or "…" depending on following context. Seems ACCENT char causes buggy behaviour but HORIZONTAL ELLIPSIS char is normalized for triple DOT chars. The possibility of morpheme having zero length surface should be noted in README.md, I think.

java -jar sudachi-0.1.2-SNAPSHOT.jar -a -m A
´
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´´
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´。
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
。	補助記号,句点,*,*,*,*	。	。	キゴウ	0
EOS
´あ
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
あ	感動詞,一般,*,*,*,*	あっ	あ	ア	0
EOS
…   
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
…	補助記号,句点,*,*,*,*	.	.	キゴウ	0
EOS

Remove prolonged sound mark

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださ  動詞,一般,*,*,五段-サ行,未然形-一般     下す
いーー  感動詞,フィラー,*,*,*,* いー
!       補助記号,句点,*,*,*,*   !
!       補助記号,句点,*,*,*,*   !

Expected result:

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださいーー  動詞,非自立可能,*,*,五段-ラ行,命令形 下さる
!       補助記号,句点,*,*,*,*   !
!       補助記号,句点,*,*,*,*   !

`抑える` is normalized to `押さえる`.

抑える is normalized to 押さえる. Is this expected behavior?

  • Sudachi 0.7.0
  • sudachi-dictionary-20220729/system_core.dic
{
    "systemDict" : "/dbfs/FileStore/sudachi/system_core.dic",
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ]}
    ]
}

Blank reading form in OOV

Currently, unknown (OOV) words have the same reading form is blank string ("")

Can u have any solutions to Unknown words can return itself in reading form ?

example:
text: サンドウィッチマン ライブ
output:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

expect:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   サンドウィッチマン
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

I suggest a solution is adding a flag in SimpleOovProviderPlugin to check when the unknown words return itself in reading form:


@Override
   public List<LatticeNode> provideOOV(InputText inputText, int offset, boolean hasOtherWords) {
       if (!hasOtherWords) {
           LatticeNode node = createNode();
           node.setParameter(leftId, rightId, cost);
           int length = inputText.getWordCandidateLength(offset);
           String s = inputText.getSubstring(offset, offset + length);
           WordInfo info = new WordInfo(s, (short) length, oovPOSId, s, s, "");
           node.setWordInfo(info);
           return Collections.singletonList(node);
       } else {
           return Collections.emptyList();
       }
   }

Configuration: Merging Plugins

For plugin configuration use the following strategy.

  1. Order plugins in the specified configuration completely overrides order of plugins in the default configuration.
  2. Plugin configuration is merged with the default configuration by replacing options from the default configuration.
  3. class "remaining" allows to put remaining plugins to the specific point.

Clarify the definition of core and non_core lexicon

At ae2b047, the dictionary source was split into two files, but there are no clear definition of them and the difference. Could you add it to README?

In addition to that, since all the examples of the named entities on README such as "医薬品安全管理責任者" and "自転車安全整備士" are now included in non_core.csv, the default settings don't achieve the three mode difference on README (Both B and C end up with "医薬品/安全/管理/責任者") and users need to specify system_full.dic as system dictionary. It might be better to describe that it requires full dictionary, or to replace them to another example which is included in core dictionary to avoid any confusion.

Adjust order of splitting and normalization for numerical expression

「にぎり寿司三億年」

=== Before rewriting:
0: 0 24 にぎり寿司三億年(1634131) 2 5142 5154 13355

=== After rewriting:
0: 0 9 にぎり(136394) 4 0 0 0
1: 9 15 寿司(700728) 4 0 0 0
2: 15 18 三(316771) 5 0 0 0
3: 18 21 億(434255) 5 0 0 0
4: 21 24 年(772367) 9 0 0 0

===
にぎり 名詞,普通名詞,一般,,,* 握り にぎり ニギリ 0
寿司 名詞,普通名詞,一般,,,* 寿司 寿司 ズシ 0
三 名詞,数詞,,,, 三 三 サン 0
億 名詞,数詞,,,, 億 億 オク 0
年 名詞,普通名詞,助数詞可能,,,* 年 年 ネン 0

`1人` is tokenized into single token while `6人` is not.

新任1人を含む6人の取締役の選任など3議案を原案通りに可決した

Could someone tell me why 1人 in the sentence above is tokenized into 1人 while 6人 is to 6 and ?

  • Sudachi: 0.7.0
  • sudachi-dictionary-20220729/system_core.dic

output

// surface: 新任, normalized: 新任, part of speach: 名詞,普通名詞,サ変可能,*,*,*, read: シンニン
// surface: 1人, normalized: 一人, part of speach: 名詞,普通名詞,副詞可能,*,*,*, read: ヒトリ
// surface: を, normalized: を, part of speach: 助詞,格助詞,*,*,*,*,  read: ヲ
// surface: 含む, normalized: 含む, part of speach: 動詞,一般,*,*,五段-マ行,連体形-一般,  read: フクム
// surface: 6, normalized: 6, part of speach: 名詞,数詞,*,*,*,*,  read: ロク
// surface: 人, normalized: 人, part of speach: 接尾辞,名詞的,一般,*,*,*, read: ニン
// ...

Node missing problem

In sudachi-0.1.1-SNAPSHOT, this problem may occur in some cases where the input sentence contains successive alphabets.

To reproduce

$ echo "阿qd" | java -jar sudachi-0.1.1-SNAPSHOT.jar -d
=== Input dump:
阿qd
=== Lattice dump:
0: 5 5 (null)(0) BOS/EOS 0 0 0: 
1: 0 4 阿Q(1441792) 名詞,固有名詞,人名,一般,*,* 4788 4788 9443: 1 
2: 0 0 (null)(0) BOS/EOS 0 0 0: 0 
Exception in thread "main" java.lang.IllegalStateException: EOS isn't connected to BOS
	at com.worksap.nlp.sudachi.LatticeImpl.getBestPath(LatticeImpl.java:150)
	at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenize(JapaneseTokenizer.java:81)
	at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:43)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:148)

In this example, it seems that the node which contains the character "d" wasn't generated, so one cannot reach to EOS node.

Replace System.err & System.out with proper logger API

Current implementation uses System.out and System.err to output debug information, but it should be avoided in production code. Because using stdout and/or stderr directly makes system maintenance difficult, it doesn't support filtering base on log level nor class. It is also hard to apply log rotation and other log management mechanism.

I cannot judge which logger API we should use, probably it is one of followings:

  1. SLF4J API, which is common in Java ecosystem
  2. JUL (java.util.logging) API, which is standard in Java but little bit hard to maintain
  3. Log4J 2 which is used by Elasticsearch core

This repository is over its data quota. Purchase more data packs to restore access.

Thank you for publishing a great tool.

I got this error while trying git lfs pull.

$ git lfs pull
Git LFS: (0 of 1 files) 0 B / 493.43 MB
batch response: This repository is over its data quota. Purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/WorksApplications/Sudachi.git/info/lfs'

Remove EditConnectionCost plugin

This functionality will be removed in 1.0 as nobody seems to be using it.

Please comment here if you are actually using it and do not want to have it removed.

Invalid space with voiced/semi-voiced sound mark

$ echo "愛゛の゛ム゛チ゛" | java -jar sudachi-0.1.1-SNAPSHOT.jar
愛      名詞,普通名詞,一般,*,*,*        愛
        空白,*,*,*,*,*
゛の    名詞,普通名詞,一般,*,*,*        ゙の
゛      補助記号,一般,*,*,*,*   ゛
ム      接頭辞,*,*,*,*,*        ム
゛      補助記号,一般,*,*,*,*   ゛
チ゛    名詞,普通名詞,一般,*,*,*        ヂ
EOS

Unexpected char 65.533

I don't know if it's an error on my part or the dictionary files aren't working. In case it is my fault, I'm sorry, I didn't understand how dictionaries work, I think it was just load them to use the tokenizer.
I am trying to use in my program to generate phrases in anki for my studies.

When reading the "system_small.dic" or "system_core.dic" files, I get the following message in eclipse. I downloaded the files found on the home page, and performing a test by cmd I find the same error reported as shown in the image.

sudachi-dictionary-20200330-small.zip
sudachi-dictionary-20200330-core.zip

CMD:
image

Eclipse:

Exception in thread "JavaFX Application Thread" java.lang.IllegalArgumentException: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:115)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.buildSettings(JapaneseDictionary.java:97)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:52)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:48)
at [email protected]/com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:47)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.processaTexto(TelaProcessarFrasesController.java:217)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.lambda$4(TelaProcessarFrasesController.java:454)
at javafx.base/com.sun.javafx.binding.ExpressionHelper$Generic.fireValueChangedEvent(ExpressionHelper.java:360)
at javafx.base/com.sun.javafx.binding.ExpressionHelper.fireValueChangedEvent(ExpressionHelper.java:80)
at javafx.base/javafx.beans.property.ReadOnlyBooleanPropertyBase.fireValueChangedEvent(ReadOnlyBooleanPropertyBase.java:72)
at javafx.graphics/javafx.scene.Node$FocusedProperty.notifyListeners(Node.java:8159)
at javafx.graphics/javafx.scene.Scene$12.invalidated(Scene.java:2158)
at javafx.base/javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:112)
at javafx.base/javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:147)
at javafx.graphics/javafx.scene.Scene$KeyHandler.setFocusOwner(Scene.java:4030)
at javafx.graphics/javafx.scene.Scene$KeyHandler.requestFocus(Scene.java:4077)
at javafx.graphics/javafx.scene.Scene.requestFocus(Scene.java:2125)
at javafx.graphics/javafx.scene.Node.requestFocus(Node.java:8320)
at javafx.controls/com.sun.javafx.scene.control.behavior.TextAreaBehavior.mousePressed(TextAreaBehavior.java:264)
at javafx.controls/javafx.scene.control.skin.TextAreaSkin$ContentView.lambda$new$0(TextAreaSkin.java:1201)
at javafx.base/com.sun.javafx.event.CompositeEventHandler$NormalEventHandlerRecord.handleBubblingEvent(CompositeEventHandler.java:218)
at javafx.base/com.sun.javafx.event.CompositeEventHandler.dispatchBubblingEvent(CompositeEventHandler.java:80)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:238)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:191)
at javafx.base/com.sun.javafx.event.CompositeEventDispatcher.dispatchBubblingEvent(CompositeEventDispatcher.java:59)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:58)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.EventUtil.fireEventImpl(EventUtil.java:74)
at javafx.base/com.sun.javafx.event.EventUtil.fireEvent(EventUtil.java:54)
at javafx.base/javafx.event.Event.fireEvent(Event.java:198)
at javafx.graphics/javafx.scene.Scene$MouseHandler.process(Scene.java:3862)
at javafx.graphics/javafx.scene.Scene.processMouseEvent(Scene.java:1849)
at javafx.graphics/javafx.scene.Scene$ScenePeerListener.mouseEvent(Scene.java:2590)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:409)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:299)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.lambda$handleMouseEvent$2(GlassViewEventHandler.java:447)
at javafx.graphics/com.sun.javafx.tk.quantum.QuantumToolkit.runWithoutRenderLock(QuantumToolkit.java:411)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.handleMouseEvent(GlassViewEventHandler.java:446)
at javafx.graphics/com.sun.glass.ui.View.handleMouseEvent(View.java:556)
at javafx.graphics/com.sun.glass.ui.View.notifyMouse(View.java:942)
at javafx.graphics/com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
at javafx.graphics/com.sun.glass.ui.win.WinApplication.lambda$runLoop$3(WinApplication.java:174)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at org.glassfish.json.JsonTokenizer.unexpectedChar(JsonTokenizer.java:601)
at org.glassfish.json.JsonTokenizer.nextToken(JsonTokenizer.java:418)
at org.glassfish.json.JsonParserImpl$NoneContext.getNextEvent(JsonParserImpl.java:413)
at org.glassfish.json.JsonParserImpl.next(JsonParserImpl.java:363)
at org.glassfish.json.JsonReaderImpl.read(JsonReaderImpl.java:90)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:106)
... 57 more

Overflow can occur when writing headwordLength to the system dictionary.

There is a long headword line more than 255 bytes in core_lex.csv.

The latest core_lex.csv: line 49605

あなたの幸せが私の幸せ世の為人の為人類幸福繋がり創造即ち我らの使命なり今まさに変革の時ここに熱き魂と愛と情鉄の勇気と利他の精神を持つ者が結集せり日々感謝喜び笑顔繋がりを確かな一歩とし地球の永続を約束する公益の志溢れる我らの足跡に歴史の花が咲くいざゆかん浪漫輝く航海へ

This headword is 399 bytes in UTF-8.

It will store in short(2 bytes) when reading from CSV.

(short)cols[0].getBytes(StandardCharsets.UTF_8).length,

But it will cast to byte(1 byte) when writing to the dictionary.

Overflow can occur.

multiple blank parsed multiple words

related explosion/spaCy#3756 (comment)

echo "東京都  へ 行く" | java -jar target/sudachi-0.3.0.jar
東京都	名詞,固有名詞,地名,一般,*,*	東京都
 	空白,*,*,*,*,*	 
 	空白,*,*,*,*,*	 
へ	助詞,格助詞,*,*,*,*	へ
 	空白,*,*,*,*,*	 
行く	動詞,非自立可能,*,*,五段-カ行,終止形-一般	行く

Is this expected result ? Multiple blanks parsed to multiple 空白,,,,,*

Dictionary file Loading support from jar

if the Dictionary file is kept inside some resource jar , then the MMap.class cannot read the Dictionary file,

so can this be modified to support BinaryDictionary.class.getClassLoader().getResourceAsStream("system_core.dic ")
Then from InputStream we can get ByteBuffer using below logic
ByteBuffer byteBuffer = ByteBuffer.allocate(initialStream.available());
ReadableByteChannel channel = newChannel(initialStream);
IOUtils.readFully(channel, byteBuffer);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.