Giter Site home page Giter Site logo

sudachi's Issues

`抑える` is normalized to `押さえる`.

抑える is normalized to 押さえる. Is this expected behavior?

  • Sudachi 0.7.0
  • sudachi-dictionary-20220729/system_core.dic
{
    "systemDict" : "/dbfs/FileStore/sudachi/system_core.dic",
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ]}
    ]
}

Why not multi sub-module project?

I found second pom.xml in elasticsearch directory. Then why we do not use multi-module project?

If we have no clear reason, I will propose a PR to change this project to multi-Module. Here is condition to do so:

  • use the same <version> for each module (not mandatory but seems possible)
  • move src, licenses and pom.xml into sudachi/src, sudachi/licenses and sudachi/pom.xml
    • name of sub directory can be changed to sudachi-core or what you prefer

Clarify the definition of core and non_core lexicon

At ae2b047, the dictionary source was split into two files, but there are no clear definition of them and the difference. Could you add it to README?

In addition to that, since all the examples of the named entities on README such as "医薬品安全管理責任者" and "自転車安全整備士" are now included in non_core.csv, the default settings don't achieve the three mode difference on README (Both B and C end up with "医薬品/安全/管理/責任者") and users need to specify system_full.dic as system dictionary. It might be better to describe that it requires full dictionary, or to replace them to another example which is included in core dictionary to avoid any confusion.

Overflow can occur when writing headwordLength to the system dictionary.

There is a long headword line more than 255 bytes in core_lex.csv.

The latest core_lex.csv: line 49605

あなたの幸せが私の幸せ世の為人の為人類幸福繋がり創造即ち我らの使命なり今まさに変革の時ここに熱き魂と愛と情鉄の勇気と利他の精神を持つ者が結集せり日々感謝喜び笑顔繋がりを確かな一歩とし地球の永続を約束する公益の志溢れる我らの足跡に歴史の花が咲くいざゆかん浪漫輝く航海へ

This headword is 399 bytes in UTF-8.

It will store in short(2 bytes) when reading from CSV.

(short)cols[0].getBytes(StandardCharsets.UTF_8).length,

But it will cast to byte(1 byte) when writing to the dictionary.

Overflow can occur.

Node missing problem

In sudachi-0.1.1-SNAPSHOT, this problem may occur in some cases where the input sentence contains successive alphabets.

To reproduce

$ echo "阿qd" | java -jar sudachi-0.1.1-SNAPSHOT.jar -d
=== Input dump:
阿qd
=== Lattice dump:
0: 5 5 (null)(0) BOS/EOS 0 0 0: 
1: 0 4 阿Q(1441792) 名詞,固有名詞,人名,一般,*,* 4788 4788 9443: 1 
2: 0 0 (null)(0) BOS/EOS 0 0 0: 0 
Exception in thread "main" java.lang.IllegalStateException: EOS isn't connected to BOS
	at com.worksap.nlp.sudachi.LatticeImpl.getBestPath(LatticeImpl.java:150)
	at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenize(JapaneseTokenizer.java:81)
	at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:43)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:148)

In this example, it seems that the node which contains the character "d" wasn't generated, so one cannot reach to EOS node.

Adjust order of splitting and normalization for numerical expression

「にぎり寿司三億年」

=== Before rewriting:
0: 0 24 にぎり寿司三億年(1634131) 2 5142 5154 13355

=== After rewriting:
0: 0 9 にぎり(136394) 4 0 0 0
1: 9 15 寿司(700728) 4 0 0 0
2: 15 18 三(316771) 5 0 0 0
3: 18 21 億(434255) 5 0 0 0
4: 21 24 年(772367) 9 0 0 0

===
にぎり 名詞,普通名詞,一般,,,* 握り にぎり ニギリ 0
寿司 名詞,普通名詞,一般,,,* 寿司 寿司 ズシ 0
三 名詞,数詞,,,, 三 三 サン 0
億 名詞,数詞,,,, 億 億 オク 0
年 名詞,普通名詞,助数詞可能,,,* 年 年 ネン 0

Empty surface morpheme(s) before output "´" or "…"

SudachiCommandLine outputs empty surface morpheme(s) before output "´" or "…" depending on following context. Seems ACCENT char causes buggy behaviour but HORIZONTAL ELLIPSIS char is normalized for triple DOT chars. The possibility of morpheme having zero length surface should be noted in README.md, I think.

java -jar sudachi-0.1.2-SNAPSHOT.jar -a -m A
´
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´´
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´。
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
。	補助記号,句点,*,*,*,*	。	。	キゴウ	0
EOS
´あ
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
あ	感動詞,一般,*,*,*,*	あっ	あ	ア	0
EOS
…   
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
…	補助記号,句点,*,*,*,*	.	.	キゴウ	0
EOS

invalid format at buildLexicon

Thanks for developing this tool. I tried to install Sudachi on a CentOS 6 server. I tried 'mvn package' but ended up with the following message. Files downloaded in './target/dictionary/unidic-mecab-2.1.2_src' seems to be OK. Is there any specific file that I should check?

[INFO] --- iterator-maven-plugin:0.5.1:iterator (build-system-dictionary) @ sudachi ---
[INFO] ------ (core) org.codehaus.mojo:exec-maven-plugin:1.6.0:java
reading the source file...Error: invalid format at line 1
[WARNING]
java.lang.IllegalArgumentException: invalid format
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.buildLexicon (DictionaryBuilder.java:114)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.build (DictionaryBuilder.java:89)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.main (DictionaryBuilder.java:432)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:497)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:745)

Replace System.err & System.out with proper logger API

Current implementation uses System.out and System.err to output debug information, but it should be avoided in production code. Because using stdout and/or stderr directly makes system maintenance difficult, it doesn't support filtering base on log level nor class. It is also hard to apply log rotation and other log management mechanism.

I cannot judge which logger API we should use, probably it is one of followings:

  1. SLF4J API, which is common in Java ecosystem
  2. JUL (java.util.logging) API, which is standard in Java but little bit hard to maintain
  3. Log4J 2 which is used by Elasticsearch core

Configuration: Improve Error messages

Right now if a resource does not exist, Sudachi throws NullPointerException, which is bad and not useful. Instead, report exact configuration string and all actually tried paths.

Remove prolonged sound mark

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださ  動詞,一般,*,*,五段-サ行,未然形-一般     下す
いーー  感動詞,フィラー,*,*,*,* いー
!       補助記号,句点,*,*,*,*   !
!       補助記号,句点,*,*,*,*   !

Expected result:

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださいーー  動詞,非自立可能,*,*,五段-ラ行,命令形 下さる
!       補助記号,句点,*,*,*,*   !
!       補助記号,句点,*,*,*,*   !

DoubleArray fails when key contains negative byte

Hi, there. Thank you for your nice implementation.

I have been interested in Double Array Trie implementation for years, and studied a similar dart-clone-java implementation before. His implementation works well.

But it seems that your implementation can't handle keys containing negative byte. Example:

    public void testRaw() throws Exception
    {
        byte[][] keys = new byte[3][];
        keys[0] = new byte[]{1};
        keys[1] = new byte[]{1, 2, -1};
        keys[2] = "東京都".getBytes(StandardCharsets.UTF_8);
        DoubleArray doubleArray = new DoubleArray();
        doubleArray.build(keys, new int[]{0, 1, 2}, null);

        for (int i = 0; i < 3; ++i)
        {
            assertEquals(i, doubleArray.exactMatchSearch(keys[i])[0]);
            System.out.printf("Good for %s\n", Arrays.toString(keys[i]));
        }
    }

By the way, your repo is over its data quota.

~ git clone https://github.com/WorksApplications/Sudachi.git
Cloning into 'Sudachi'...
remote: Counting objects: 2353, done.
remote: Compressing objects: 100% (955/955), done.
remote: Total 2353 (delta 754), reused 2307 (delta 753), pack-reused 0
Receiving objects: 100% (2353/2353), 306.04 KiB | 86.00 KiB/s, done.
Resolving deltas: 100% (754/754), done.
Checking connectivity... done.
Downloading dictionary/sudachi_lex.csv (517 MB)
Error downloading object: dictionary/sudachi_lex.csv (acabc71): Smudge error: Error downloading dictionary/sudachi_lex.csv (acabc71c0b63936801fbdaf1749f8b91e7fa8b15532461e05fec9b98f5f474db): batch response: This repository is over its data quota. Purchase more data packs to restore access.

So I can't download the csv file or debug into your com.worksap.nlp.sudachi.dictionary.DictionaryBuilder. Is there any magic inside the DictionaryBuilder?

oovPOS is invalid

I tried to start sudachi in this way but end up like this.

target]$ java -jar sudachi-0.1.1-SNAPSHOT.jar
Exception in thread "main" java.lang.IllegalArgumentException: oovPOS is invalid:????,??,*,*,*,*
	at com.worksap.nlp.sudachi.SimpleOovProviderPlugin.setUp(SimpleOovProviderPlugin.java:63)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:89)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:56)
	at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:44)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:135)

I've checked the content in ./src/main/resources/sudachi.json and the oovProviderPlugin part seems OK.

Emoji after "。" causes StringIndexOutOfBoundsException

Description

When trying to analyze text containing an emoji after a "。", java.lang.StringIndexOutOfBoundsException is thrown.

Environment

  • Sudachi 0.4.2
  • system_core.dic @ 20200330
  • OpenJDK 8u252

Steps to reproduce

$ echo "。😀" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
        at java.lang.String.substring(String.java:1963)
        at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
        at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
        at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
        at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
        at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)

Configuration: Merging Plugins

For plugin configuration use the following strategy.

  1. Order plugins in the specified configuration completely overrides order of plugins in the default configuration.
  2. Plugin configuration is merged with the default configuration by replacing options from the default configuration.
  3. class "remaining" allows to put remaining plugins to the specific point.

`1人` is tokenized into single token while `6人` is not.

新任1人を含む6人の取締役の選任など3議案を原案通りに可決した

Could someone tell me why 1人 in the sentence above is tokenized into 1人 while 6人 is to 6 and ?

  • Sudachi: 0.7.0
  • sudachi-dictionary-20220729/system_core.dic

output

// surface: 新任, normalized: 新任, part of speach: 名詞,普通名詞,サ変可能,*,*,*, read: シンニン
// surface: 1人, normalized: 一人, part of speach: 名詞,普通名詞,副詞可能,*,*,*, read: ヒトリ
// surface: を, normalized: を, part of speach: 助詞,格助詞,*,*,*,*,  read: ヲ
// surface: 含む, normalized: 含む, part of speach: 動詞,一般,*,*,五段-マ行,連体形-一般,  read: フクム
// surface: 6, normalized: 6, part of speach: 名詞,数詞,*,*,*,*,  read: ロク
// surface: 人, normalized: 人, part of speach: 接尾辞,名詞的,一般,*,*,*, read: ニン
// ...

Add advanced validations for dictionary building

We plan to introduce dictionary build warnings, which will not abort the building of the dictionary, but will report that something was not good.

Warning-producing checks will be optional, but enabled by default.

Proposed list of warnings:

  • Surface forms are not normalized. Words with such surfaces will not be possible to lookup via Trie index (this is current behavior) and those problems seem to appear somewhat frequently with user dictionaries.
  • Word segmentation producing non-consistent splitting. Concatenation of word splitting surfaces should produce the surface of the original word.
  • Having non-distinguishable dictionary entries (with same left/right connection IDs + surface). In this case an entry with the highest cost wins, otherwise the last dictionary entry wins. We will remove all other entries from index.

[Question] Thread safety

Hi. I want to create a multi-threaded program using Sudachi. Which class instances can be shared between threads? Should I create every instances for each threads?

I want to know about the following classes;

  • Dictionary (JapaneseDictionary)
  • Tokenizer (JapaneseTokenizer)

Update: I checked the implementation and found out that JapaneseTokenizer mutates it's member variable. So, the tokenizer is not thread safe.

Thanks.

Invalid space with voiced/semi-voiced sound mark

$ echo "愛゛の゛ム゛チ゛" | java -jar sudachi-0.1.1-SNAPSHOT.jar
愛      名詞,普通名詞,一般,*,*,*        愛
        空白,*,*,*,*,*
゛の    名詞,普通名詞,一般,*,*,*        ゙の
゛      補助記号,一般,*,*,*,*   ゛
ム      接頭辞,*,*,*,*,*        ム
゛      補助記号,一般,*,*,*,*   ゛
チ゛    名詞,普通名詞,一般,*,*,*        ヂ
EOS

Ignore brackets and yomigaka

We want to analyze 「行(おこな)う」 as 「行う」because「おこな」 means pronunciation (yomigana).

User dictionary source File Creation from Token and POS mapped file

if there any Utility to generate the “User dictionary source File” from a raw file ,which has Sentence and its Tokens and POS Mapping for Each Token .
I mean if we have Token and POS mapping , if there any easy way to generate the “User dictionary source File”

For Example , if we have a file as below , or any similar format, can we generate the “User dictionary source File”

image

SentenceDetector splits at whitespace regardless of text length when no break point found

The current behavior:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
EOS
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

We want it to return the full text if it is shorter than the buffer size like:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

potential bug at connection matrix

@kazuma-t

row and column of connection matrix looks traversed and I think setConnectionCost at GrammerImpl contains bug.

  1. setConnectionCost looks properly used (rightid is right, leftid is left)

    grammar.setConnectCost(leftId, rightId, Grammar.INHIBITED_CONNECTION);

  2. getConnectionCost looks not properly used (rightid is left, leftid is right)

    short connectCost = grammar.getConnectCost(lNode.rightId, rNode.leftId);

setConnectionCost is not frequently used, so the effect is small.
It's better to have consensus between 1 and 2 and shape of connection matrix.
Off course, the format of dictionary file must be kept.

Adjust lengths of known word and OOV

Known word and OOV are different in segmentation although their word structures are the same.

全国的 名詞,普通名詞,形状詞可能,,,* 全国的

間接 名詞,普通名詞,一般,,,* 間接
的 接尾辞,形状詞的,,,,

Adjust them by PathRewritePlugin

This repository is over its data quota. Purchase more data packs to restore access.

Thank you for publishing a great tool.

I got this error while trying git lfs pull.

$ git lfs pull
Git LFS: (0 of 1 files) 0 B / 493.43 MB
batch response: This repository is over its data quota. Purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/WorksApplications/Sudachi.git/info/lfs'

[Question] What are different Parts of Speech values?

What are the different elements returned in the parts of speech?
For example, on tokenizing 太郎 we get:

太郎	名詞,固有名詞,人名,名,*,*	太郎

I want to understand the various parts under:

名詞,固有名詞,人名,名,*,*

Do each of these comma separated values have a descriptive name?

Sorry, if it is mentioned somewhere in the code! I couldn't find it!

And, thank you for making such an awesome package and making it open source!!

Blank reading form in OOV

Currently, unknown (OOV) words have the same reading form is blank string ("")

Can u have any solutions to Unknown words can return itself in reading form ?

example:
text: サンドウィッチマン ライブ
output:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

expect:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   サンドウィッチマン
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

I suggest a solution is adding a flag in SimpleOovProviderPlugin to check when the unknown words return itself in reading form:


@Override
   public List<LatticeNode> provideOOV(InputText inputText, int offset, boolean hasOtherWords) {
       if (!hasOtherWords) {
           LatticeNode node = createNode();
           node.setParameter(leftId, rightId, cost);
           int length = inputText.getWordCandidateLength(offset);
           String s = inputText.getSubstring(offset, offset + length);
           WordInfo info = new WordInfo(s, (short) length, oovPOSId, s, s, "");
           node.setWordInfo(info);
           return Collections.singletonList(node);
       } else {
           return Collections.emptyList();
       }
   }

Unexpected char 65.533

I don't know if it's an error on my part or the dictionary files aren't working. In case it is my fault, I'm sorry, I didn't understand how dictionaries work, I think it was just load them to use the tokenizer.
I am trying to use in my program to generate phrases in anki for my studies.

When reading the "system_small.dic" or "system_core.dic" files, I get the following message in eclipse. I downloaded the files found on the home page, and performing a test by cmd I find the same error reported as shown in the image.

sudachi-dictionary-20200330-small.zip
sudachi-dictionary-20200330-core.zip

CMD:
image

Eclipse:

Exception in thread "JavaFX Application Thread" java.lang.IllegalArgumentException: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:115)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.buildSettings(JapaneseDictionary.java:97)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:52)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:48)
at [email protected]/com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:47)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.processaTexto(TelaProcessarFrasesController.java:217)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.lambda$4(TelaProcessarFrasesController.java:454)
at javafx.base/com.sun.javafx.binding.ExpressionHelper$Generic.fireValueChangedEvent(ExpressionHelper.java:360)
at javafx.base/com.sun.javafx.binding.ExpressionHelper.fireValueChangedEvent(ExpressionHelper.java:80)
at javafx.base/javafx.beans.property.ReadOnlyBooleanPropertyBase.fireValueChangedEvent(ReadOnlyBooleanPropertyBase.java:72)
at javafx.graphics/javafx.scene.Node$FocusedProperty.notifyListeners(Node.java:8159)
at javafx.graphics/javafx.scene.Scene$12.invalidated(Scene.java:2158)
at javafx.base/javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:112)
at javafx.base/javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:147)
at javafx.graphics/javafx.scene.Scene$KeyHandler.setFocusOwner(Scene.java:4030)
at javafx.graphics/javafx.scene.Scene$KeyHandler.requestFocus(Scene.java:4077)
at javafx.graphics/javafx.scene.Scene.requestFocus(Scene.java:2125)
at javafx.graphics/javafx.scene.Node.requestFocus(Node.java:8320)
at javafx.controls/com.sun.javafx.scene.control.behavior.TextAreaBehavior.mousePressed(TextAreaBehavior.java:264)
at javafx.controls/javafx.scene.control.skin.TextAreaSkin$ContentView.lambda$new$0(TextAreaSkin.java:1201)
at javafx.base/com.sun.javafx.event.CompositeEventHandler$NormalEventHandlerRecord.handleBubblingEvent(CompositeEventHandler.java:218)
at javafx.base/com.sun.javafx.event.CompositeEventHandler.dispatchBubblingEvent(CompositeEventHandler.java:80)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:238)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:191)
at javafx.base/com.sun.javafx.event.CompositeEventDispatcher.dispatchBubblingEvent(CompositeEventDispatcher.java:59)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:58)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.EventUtil.fireEventImpl(EventUtil.java:74)
at javafx.base/com.sun.javafx.event.EventUtil.fireEvent(EventUtil.java:54)
at javafx.base/javafx.event.Event.fireEvent(Event.java:198)
at javafx.graphics/javafx.scene.Scene$MouseHandler.process(Scene.java:3862)
at javafx.graphics/javafx.scene.Scene.processMouseEvent(Scene.java:1849)
at javafx.graphics/javafx.scene.Scene$ScenePeerListener.mouseEvent(Scene.java:2590)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:409)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:299)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.lambda$handleMouseEvent$2(GlassViewEventHandler.java:447)
at javafx.graphics/com.sun.javafx.tk.quantum.QuantumToolkit.runWithoutRenderLock(QuantumToolkit.java:411)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.handleMouseEvent(GlassViewEventHandler.java:446)
at javafx.graphics/com.sun.glass.ui.View.handleMouseEvent(View.java:556)
at javafx.graphics/com.sun.glass.ui.View.notifyMouse(View.java:942)
at javafx.graphics/com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
at javafx.graphics/com.sun.glass.ui.win.WinApplication.lambda$runLoop$3(WinApplication.java:174)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at org.glassfish.json.JsonTokenizer.unexpectedChar(JsonTokenizer.java:601)
at org.glassfish.json.JsonTokenizer.nextToken(JsonTokenizer.java:418)
at org.glassfish.json.JsonParserImpl$NoneContext.getNextEvent(JsonParserImpl.java:413)
at org.glassfish.json.JsonParserImpl.next(JsonParserImpl.java:363)
at org.glassfish.json.JsonReaderImpl.read(JsonReaderImpl.java:90)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:106)
... 57 more

multiple blank parsed multiple words

related explosion/spaCy#3756 (comment)

echo "東京都  へ 行く" | java -jar target/sudachi-0.3.0.jar
東京都	名詞,固有名詞,地名,一般,*,*	東京都
 	空白,*,*,*,*,*	 
 	空白,*,*,*,*,*	 
へ	助詞,格助詞,*,*,*,*	へ
 	空白,*,*,*,*,*	 
行く	動詞,非自立可能,*,*,五段-カ行,終止形-一般	行く

Is this expected result ? Multiple blanks parsed to multiple 空白,,,,,*

Words w/o normalization are ignored in user dictionary

The first column of a source of user dictionary is a headword for TRIE.
Because input texts are normalized by DefaultInputTextPlugin, the headwords must be normalized in the same way.
Normalize by DefaultInputTextPlugin when build dictionary, or add a caution to the document.

Version management policy

Currently version of Sudachi is 0.1-SNAPSHOT (maybe major = 0, minor = 1), however version of elasticsearch-sudachi is 1.0.0-SNAPSHOT (major = 1, minor = 0, patch = 0). It seems that they use different versioning policy. I recommend you to unify it.

And if you follow the semver2.0, I recommend you to change 1.0.0-SNAPSHOT to 0.1.0-SNAPSHOT. 0.x.y is know as initial development phase so you can break backward compatibility.

Remove EditConnectionCost plugin

This functionality will be removed in 1.0 as nobody seems to be using it.

Please comment here if you are actually using it and do not want to have it removed.

Dictionary file Loading support from jar

if the Dictionary file is kept inside some resource jar , then the MMap.class cannot read the Dictionary file,

so can this be modified to support BinaryDictionary.class.getClassLoader().getResourceAsStream("system_core.dic ")
Then from InputStream we can get ByteBuffer using below logic
ByteBuffer byteBuffer = ByteBuffer.allocate(initialStream.available());
ReadableByteChannel channel = newChannel(initialStream);
IOUtils.readFully(channel, byteBuffer);

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.