worksapplications / sudachi Goto Github PK

View Code? Open in Web Editor NEW

745.0 44.0 70.0 1.61 MB

A Japanese Tokenizer for Business

Java 89.59% Kotlin 10.41%

morphological-analysis segmentation nlp-library pos-tagging

sudachi's Issues

Allow user defined part-of-speech in user dictionary

Allow user to define his/her own part-of-speech in user dictionary.

`抑える` is normalized to `押さえる`.

抑える is normalized to 押さえる. Is this expected behavior?

Sudachi 0.7.0
sudachi-dictionary-20220729/system_core.dic

{
    "systemDict" : "/dbfs/FileStore/sudachi/system_core.dic",
    "oovProviderPlugin" : [
        { "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
          "oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ]}
    ]
}

Why not multi sub-module project?

I found second pom.xml in elasticsearch directory. Then why we do not use multi-module project?

If we have no clear reason, I will propose a PR to change this project to multi-Module. Here is condition to do so:

use the same <version> for each module (not mandatory but seems possible)
move src, licenses and pom.xml into sudachi/src, sudachi/licenses and sudachi/pom.xml
- name of sub directory can be changed to sudachi-core or what you prefer

Clarify the definition of core and non_core lexicon

At ae2b047, the dictionary source was split into two files, but there are no clear definition of them and the difference. Could you add it to README?

In addition to that, since all the examples of the named entities on README such as "医薬品安全管理責任者" and "自転車安全整備士" are now included in non_core.csv, the default settings don't achieve the three mode difference on README (Both B and C end up with "医薬品/安全/管理/責任者") and users need to specify system_full.dic as system dictionary. It might be better to describe that it requires full dictionary, or to replace them to another example which is included in core dictionary to avoid any confusion.

Overflow can occur when writing headwordLength to the system dictionary.

There is a long headword line more than 255 bytes in core_lex.csv.

The latest core_lex.csv: line 49605

あなたの幸せが私の幸せ世の為人の為人類幸福繋がり創造即ち我らの使命なり今まさに変革の時ここに熱き魂と愛と情鉄の勇気と利他の精神を持つ者が結集せり日々感謝喜び笑顔繋がりを確かな一歩とし地球の永続を約束する公益の志溢れる我らの足跡に歴史の花が咲くいざゆかん浪漫輝く航海へ

This headword is 399 bytes in UTF-8.

It will store in short(2 bytes) when reading from CSV.

Sudachi/src/main/java/com/worksap/nlp/sudachi/dictionary/DictionaryBuilder.java

Line 163 in 423d070

(short)cols[0].getBytes(StandardCharsets.UTF_8).length,

But it will cast to byte(1 byte) when writing to the dictionary.

Sudachi/src/main/java/com/worksap/nlp/sudachi/dictionary/DictionaryBuilder.java

Line 330 in 423d070

buffer.put((byte)wi.getLength());

Overflow can occur.

Sentence splitting

Split text into sentences.

Node missing problem

In sudachi-0.1.1-SNAPSHOT, this problem may occur in some cases where the input sentence contains successive alphabets.

To reproduce

$ echo "阿qd" | java -jar sudachi-0.1.1-SNAPSHOT.jar -d
=== Input dump:
阿qd
=== Lattice dump:
0: 5 5 (null)(0) BOS/EOS 0 0 0: 
1: 0 4 阿Q(1441792) 名詞,固有名詞,人名,一般,*,* 4788 4788 9443: 1 
2: 0 0 (null)(0) BOS/EOS 0 0 0: 0 
Exception in thread "main" java.lang.IllegalStateException: EOS isn't connected to BOS
	at com.worksap.nlp.sudachi.LatticeImpl.getBestPath(LatticeImpl.java:150)
	at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenize(JapaneseTokenizer.java:81)
	at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:43)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:148)

In this example, it seems that the node which contains the character "d" wasn't generated, so one cannot reach to EOS node.

Adjust order of splitting and normalization for numerical expression

「にぎり寿司三億年」

=== Before rewriting:
0: 0 24 にぎり寿司三億年(1634131) 2 5142 5154 13355

=== After rewriting:
0: 0 9 にぎり(136394) 4 0 0 0
1: 9 15 寿司(700728) 4 0 0 0
2: 15 18 三(316771) 5 0 0 0
3: 18 21 億(434255) 5 0 0 0
4: 21 24 年(772367) 9 0 0 0

===
にぎり名詞,普通名詞,一般,,,* 握りにぎりニギリ 0
寿司名詞,普通名詞,一般,,,* 寿司寿司ズシ 0
三名詞,数詞,,,, 三三サン 0
億名詞,数詞,,,, 億億オク 0
年名詞,普通名詞,助数詞可能,,,* 年年ネン 0

Use system dictionary in splitting information of user dictionary

In user dictionary, we can use only user dictionary words in splitting information.
Allow to use system dictionary words.

Empty surface morpheme(s) before output "´" or "…"

SudachiCommandLine outputs empty surface morpheme(s) before output "´" or "…" depending on following context. Seems ACCENT char causes buggy behaviour but HORIZONTAL ELLIPSIS char is normalized for triple DOT chars. The possibility of morpheme having zero length surface should be noted in README.md, I think.

java -jar sudachi-0.1.2-SNAPSHOT.jar -a -m A
´
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´´
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
EOS
´。
	空白,*,*,*,*,*	 	 	キゴウ	0
´	補助記号,一般,*,*,*,*				-1	(OOV)
。	補助記号,句点,*,*,*,*	。	。	キゴウ	0
EOS
´あ
´	補助記号,一般,*,*,*,*	´	´	キゴウ	0
あ	感動詞,一般,*,*,*,*	あっ	あ	ア	0
EOS
…   
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
	補助記号,句点,*,*,*,*	.	.	キゴウ	0
…	補助記号,句点,*,*,*,*	.	.	キゴウ	0
EOS

invalid format at buildLexicon

Thanks for developing this tool. I tried to install Sudachi on a CentOS 6 server. I tried 'mvn package' but ended up with the following message. Files downloaded in './target/dictionary/unidic-mecab-2.1.2_src' seems to be OK. Is there any specific file that I should check?

[INFO] --- iterator-maven-plugin:0.5.1:iterator (build-system-dictionary) @ sudachi ---
[INFO] ------ (core) org.codehaus.mojo:exec-maven-plugin:1.6.0:java
reading the source file...Error: invalid format at line 1
[WARNING]
java.lang.IllegalArgumentException: invalid format
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.buildLexicon (DictionaryBuilder.java:114)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.build (DictionaryBuilder.java:89)
    at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.main (DictionaryBuilder.java:432)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:497)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
    at java.lang.Thread.run (Thread.java:745)

Replace System.err & System.out with proper logger API

Current implementation uses System.out and System.err to output debug information, but it should be avoided in production code. Because using stdout and/or stderr directly makes system maintenance difficult, it doesn't support filtering base on log level nor class. It is also hard to apply log rotation and other log management mechanism.

I cannot judge which logger API we should use, probably it is one of followings:

SLF4J API, which is common in Java ecosystem
JUL (java.util.logging) API, which is standard in Java but little bit hard to maintain
Log4J 2 which is used by Elasticsearch core

distribution zip contains two Sudachi artifacts

In zip file, we can find sudachi-0.0.1-SNAPSHOT.jar and sudachi-0.1-SNAPSHOT.jar.
We don't need to package two jar files, so it's better to delete one of them.

Configuration: Improve Error messages

Right now if a resource does not exist, Sudachi throws NullPointerException, which is bad and not useful. Instead, report exact configuration string and all actually tried paths.

Check part-of-speech in JoinNumericPlugin

Normalize only numerical expression

加藤名詞,固有名詞,人名,姓,, 加藤
一二三名詞,固有名詞,人名,名,, 123

Guess person name by RewritePathPlugin

Change OOV to person name by context (part-of-speech or title)

Remove prolonged sound mark

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださ  動詞,一般,*,*,五段-サ行,未然形-一般     下す
いーー  感動詞,フィラー,*,*,*,* いー
!       補助記号,句点,*,*,*,*   ！
!       補助記号,句点,*,*,*,*   ！

Expected result:

ご期待くださいーー!!
ご      接頭辞,*,*,*,*,*        御
期待    名詞,普通名詞,サ変可能,*,*,*    期待
くださいーー  動詞,非自立可能,*,*,五段-ラ行,命令形 下さる
!       補助記号,句点,*,*,*,*   ！
!       補助記号,句点,*,*,*,*   ！

DoubleArray fails when key contains negative byte

Hi, there. Thank you for your nice implementation.

I have been interested in Double Array Trie implementation for years, and studied a similar dart-clone-java implementation before. His implementation works well.

But it seems that your implementation can't handle keys containing negative byte. Example:

    public void testRaw() throws Exception
    {
        byte[][] keys = new byte[3][];
        keys[0] = new byte[]{1};
        keys[1] = new byte[]{1, 2, -1};
        keys[2] = "東京都".getBytes(StandardCharsets.UTF_8);
        DoubleArray doubleArray = new DoubleArray();
        doubleArray.build(keys, new int[]{0, 1, 2}, null);

        for (int i = 0; i < 3; ++i)
        {
            assertEquals(i, doubleArray.exactMatchSearch(keys[i])[0]);
            System.out.printf("Good for %s\n", Arrays.toString(keys[i]));
        }
    }

By the way, your repo is over its data quota.

~ git clone https://github.com/WorksApplications/Sudachi.git
Cloning into 'Sudachi'...
remote: Counting objects: 2353, done.
remote: Compressing objects: 100% (955/955), done.
remote: Total 2353 (delta 754), reused 2307 (delta 753), pack-reused 0
Receiving objects: 100% (2353/2353), 306.04 KiB | 86.00 KiB/s, done.
Resolving deltas: 100% (754/754), done.
Checking connectivity... done.
Downloading dictionary/sudachi_lex.csv (517 MB)
Error downloading object: dictionary/sudachi_lex.csv (acabc71): Smudge error: Error downloading dictionary/sudachi_lex.csv (acabc71c0b63936801fbdaf1749f8b91e7fa8b15532461e05fec9b98f5f474db): batch response: This repository is over its data quota. Purchase more data packs to restore access.

So I can't download the csv file or debug into your com.worksap.nlp.sudachi.dictionary.DictionaryBuilder. Is there any magic inside the DictionaryBuilder?

Remove MorphemeFormatterPlugin as Plugin infrastructure

It is not useful
It is not really used inside Sudachi

oovPOS is invalid

I tried to start sudachi in this way but end up like this.

target]$ java -jar sudachi-0.1.1-SNAPSHOT.jar
Exception in thread "main" java.lang.IllegalArgumentException: oovPOS is invalid:????,??,*,*,*,*
	at com.worksap.nlp.sudachi.SimpleOovProviderPlugin.setUp(SimpleOovProviderPlugin.java:63)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:89)
	at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:56)
	at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:44)
	at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:135)

I've checked the content in ./src/main/resources/sudachi.json and the oovProviderPlugin part seems OK.

Build is broken on JDK 16+

See diffplug/spotless#834

Port build to gradle

[Feature Sync] Reduce maximum number of user dictionaries to 14

Why: Use 0xf pattern for marker if a word is OOV, for reducing LatticeNodeImpl size.

Add licenses of UniDic

This project depends on UniDic but its distribution has no LICENSE file of it. Note that UniDic is triple license (GPL/LGPL/BSD License).

http://unidic.ninjal.ac.jp/commerce_use

Emoji after "。" causes StringIndexOutOfBoundsException

Description

When trying to analyze text containing an emoji after a "。", java.lang.StringIndexOutOfBoundsException is thrown.

Environment

Sudachi 0.4.2
system_core.dic @ 20200330
OpenJDK 8u252

Steps to reproduce

$ echo "。😀" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
        at java.lang.String.substring(String.java:1963)
        at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
        at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
        at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
        at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
        at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
        at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)

Configuration: Merging Plugins

For plugin configuration use the following strategy.

Order plugins in the specified configuration completely overrides order of plugins in the default configuration.
Plugin configuration is merged with the default configuration by replacing options from the default configuration.
class "remaining" allows to put remaining plugins to the specific point.

`1人` is tokenized into single token while `6人` is not.

新任1人を含む6人の取締役の選任など3議案を原案通りに可決した

Could someone tell me why 1人 in the sentence above is tokenized into 1人 while 6人 is to 6 and 人?

Sudachi: 0.7.0
sudachi-dictionary-20220729/system_core.dic

output

// surface: 新任, normalized: 新任, part of speach: 名詞,普通名詞,サ変可能,*,*,*, read: シンニン
// surface: 1人, normalized: 一人, part of speach: 名詞,普通名詞,副詞可能,*,*,*, read: ヒトリ
// surface: を, normalized: を, part of speach: 助詞,格助詞,*,*,*,*,  read: ヲ
// surface: 含む, normalized: 含む, part of speach: 動詞,一般,*,*,五段-マ行,連体形-一般,  read: フクム
// surface: 6, normalized: 6, part of speach: 名詞,数詞,*,*,*,*,  read: ロク
// surface: 人, normalized: 人, part of speach: 接尾辞,名詞的,一般,*,*,*, read: ニン
// ...

Add advanced validations for dictionary building

We plan to introduce dictionary build warnings, which will not abort the building of the dictionary, but will report that something was not good.

Warning-producing checks will be optional, but enabled by default.

Proposed list of warnings:

Surface forms are not normalized. Words with such surfaces will not be possible to lookup via Trie index (this is current behavior) and those problems seem to appear somewhat frequently with user dictionaries.
Word segmentation producing non-consistent splitting. Concatenation of word splitting surfaces should produce the surface of the original word.
Having non-distinguishable dictionary entries (with same left/right connection IDs + surface). In this case an entry with the highest cost wins, otherwise the last dictionary entry wins. We will remove all other entries from index.

Introduce dictionary build API

Regex OOV Plugin: Be able to use POS which are not present in dictionary set

Should other OOV plugins have the same feature?

[Question] Thread safety

Hi. I want to create a multi-threaded program using Sudachi. Which class instances can be shared between threads? Should I create every instances for each threads?

I want to know about the following classes;

Dictionary (JapaneseDictionary)
~~Tokenizer (JapaneseTokenizer)~~

Update: I checked the implementation and found out that JapaneseTokenizer mutates it's member variable. So, the tokenizer is not thread safe.

Thanks.

Invalid space with voiced/semi-voiced sound mark

$ echo "愛゛の゛ム゛チ゛" | java -jar sudachi-0.1.1-SNAPSHOT.jar
愛      名詞,普通名詞,一般,*,*,*        愛
        空白,*,*,*,*,*
゛の    名詞,普通名詞,一般,*,*,*        ゙の
゛      補助記号,一般,*,*,*,*   ゛
ム      接頭辞,*,*,*,*,*        ム
゛      補助記号,一般,*,*,*,*   ゛
チ゛    名詞,普通名詞,一般,*,*,*        ヂ
EOS

Ignore brackets and yomigaka

We want to analyze 「行(おこな)う」 as 「行う」because「おこな」 means pronunciation (yomigana).

User dictionary source File Creation from Token and POS mapped file

if there any Utility to generate the “User dictionary source File” from a raw file ,which has Sentence and its Tokens and POS Mapping for Each Token .
I mean if we have Token and POS mapping , if there any easy way to generate the “User dictionary source File”

For Example , if we have a file as below , or any similar format, can we generate the “User dictionary source File”

SentenceDetector splits at whitespace regardless of text length when no break point found

The current behavior:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
EOS
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

We want it to return the full text if it is shorter than the buffer size like:

> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
        空白,*,*,*,*,*
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

potential bug at connection matrix

@kazuma-t

row and column of connection matrix looks traversed and I think setConnectionCost at GrammerImpl contains bug.

setConnectionCost looks properly used (rightid is right, leftid is left)

Sudachi/src/main/java/com/worksap/nlp/sudachi/EditConnectionCostPlugin.java

Line 79 in 979404d

grammar.setConnectCost(leftId, rightId, Grammar.INHIBITED_CONNECTION);
getConnectionCost looks not properly used (rightid is left, leftid is right)

Sudachi/src/main/java/com/worksap/nlp/sudachi/LatticeImpl.java

Line 125 in 979404d

short connectCost = grammar.getConnectCost(lNode.rightId, rNode.leftId);

setConnectionCost is not frequently used, so the effect is small.
It's better to have consensus between 1 and 2 and shape of connection matrix.
Off course, the format of dictionary file must be kept.

Introduce non-JSON configuration API

Adjust lengths of known word and OOV

Known word and OOV are different in segmentation although their word structures are the same.

全国的名詞,普通名詞,形状詞可能,,,* 全国的

間接名詞,普通名詞,一般,,,* 間接
的接尾辞,形状詞的,,,, 的

Adjust them by PathRewritePlugin

This repository is over its data quota. Purchase more data packs to restore access.

Thank you for publishing a great tool.

I got this error while trying git lfs pull.

$ git lfs pull
Git LFS: (0 of 1 files) 0 B / 493.43 MB
batch response: This repository is over its data quota. Purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/WorksApplications/Sudachi.git/info/lfs'

[Question] What are different Parts of Speech values?

What are the different elements returned in the parts of speech?
For example, on tokenizing 太郎 we get:

太郎	名詞,固有名詞,人名,名,*,*	太郎

I want to understand the various parts under:

名詞,固有名詞,人名,名,*,*

Do each of these comma separated values have a descriptive name?

Sorry, if it is mentioned somewhere in the code! I couldn't find it!

And, thank you for making such an awesome package and making it open source!!

Slack invitation link is expired.

Slack invitation link in ReadMe is expired.
I think the link in SudachiPy repository is also expired.

Blank reading form in OOV

Currently, unknown (OOV) words have the same reading form is blank string ("")

Can u have any solutions to Unknown words can return itself in reading form ?

example:
text: サンドウィッチマンライブ
output:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

expect:

 text               POS               Base_form          Reading_form
 サンドウィッチマン  名詞-普通名詞-一般   サンドウィッチマン   サンドウィッチマン
 ライブ            名詞-普通名詞-一般   ライブ             ライブ

I suggest a solution is adding a flag in SimpleOovProviderPlugin to check when the unknown words return itself in reading form:


@Override
   public List<LatticeNode> provideOOV(InputText inputText, int offset, boolean hasOtherWords) {
       if (!hasOtherWords) {
           LatticeNode node = createNode();
           node.setParameter(leftId, rightId, cost);
           int length = inputText.getWordCandidateLength(offset);
           String s = inputText.getSubstring(offset, offset + length);
           WordInfo info = new WordInfo(s, (short) length, oovPOSId, s, s, "");
           node.setWordInfo(info);
           return Collections.singletonList(node);
       } else {
           return Collections.emptyList();
       }
   }

easy installable dictionary

explosion/spaCy#3756 (comment)

Asking PyPI organization allowing 60MB limit exception for full and core dictionary.
This issue is heavily related to https://github.com/WorksApplications/SudachiDict

Unexpected char 65.533

I don't know if it's an error on my part or the dictionary files aren't working. In case it is my fault, I'm sorry, I didn't understand how dictionaries work, I think it was just load them to use the tokenizer.
I am trying to use in my program to generate phrases in anki for my studies.

When reading the "system_small.dic" or "system_core.dic" files, I get the following message in eclipse. I downloaded the files found on the home page, and performing a test by cmd I find the same error reported as shown in the image.

sudachi-dictionary-20200330-small.zip
sudachi-dictionary-20200330-core.zip

CMD:

Eclipse:

Exception in thread "JavaFX Application Thread" java.lang.IllegalArgumentException: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:115)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.buildSettings(JapaneseDictionary.java:97)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:52)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:48)
at [email protected]/com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:47)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.processaTexto(TelaProcessarFrasesController.java:217)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.lambda$4(TelaProcessarFrasesController.java:454)
at javafx.base/com.sun.javafx.binding.ExpressionHelper$Generic.fireValueChangedEvent(ExpressionHelper.java:360)
at javafx.base/com.sun.javafx.binding.ExpressionHelper.fireValueChangedEvent(ExpressionHelper.java:80)
at javafx.base/javafx.beans.property.ReadOnlyBooleanPropertyBase.fireValueChangedEvent(ReadOnlyBooleanPropertyBase.java:72)
at javafx.graphics/javafx.scene.Node$FocusedProperty.notifyListeners(Node.java:8159)
at javafx.graphics/javafx.scene.Scene$12.invalidated(Scene.java:2158)
at javafx.base/javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:112)
at javafx.base/javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:147)
at javafx.graphics/javafx.scene.Scene$KeyHandler.setFocusOwner(Scene.java:4030)
at javafx.graphics/javafx.scene.Scene$KeyHandler.requestFocus(Scene.java:4077)
at javafx.graphics/javafx.scene.Scene.requestFocus(Scene.java:2125)
at javafx.graphics/javafx.scene.Node.requestFocus(Node.java:8320)
at javafx.controls/com.sun.javafx.scene.control.behavior.TextAreaBehavior.mousePressed(TextAreaBehavior.java:264)
at javafx.controls/javafx.scene.control.skin.TextAreaSkin$ContentView.lambda$new$0(TextAreaSkin.java:1201)
at javafx.base/com.sun.javafx.event.CompositeEventHandler$NormalEventHandlerRecord.handleBubblingEvent(CompositeEventHandler.java:218)
at javafx.base/com.sun.javafx.event.CompositeEventHandler.dispatchBubblingEvent(CompositeEventHandler.java:80)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:238)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:191)
at javafx.base/com.sun.javafx.event.CompositeEventDispatcher.dispatchBubblingEvent(CompositeEventDispatcher.java:59)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:58)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.EventUtil.fireEventImpl(EventUtil.java:74)
at javafx.base/com.sun.javafx.event.EventUtil.fireEvent(EventUtil.java:54)
at javafx.base/javafx.event.Event.fireEvent(Event.java:198)
at javafx.graphics/javafx.scene.Scene$MouseHandler.process(Scene.java:3862)
at javafx.graphics/javafx.scene.Scene.processMouseEvent(Scene.java:1849)
at javafx.graphics/javafx.scene.Scene$ScenePeerListener.mouseEvent(Scene.java:2590)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:409)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:299)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.lambda$handleMouseEvent$2(GlassViewEventHandler.java:447)
at javafx.graphics/com.sun.javafx.tk.quantum.QuantumToolkit.runWithoutRenderLock(QuantumToolkit.java:411)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.handleMouseEvent(GlassViewEventHandler.java:446)
at javafx.graphics/com.sun.glass.ui.View.handleMouseEvent(View.java:556)
at javafx.graphics/com.sun.glass.ui.View.notifyMouse(View.java:942)
at javafx.graphics/com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
at javafx.graphics/com.sun.glass.ui.win.WinApplication.lambda$runLoop$3(WinApplication.java:174)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at org.glassfish.json.JsonTokenizer.unexpectedChar(JsonTokenizer.java:601)
at org.glassfish.json.JsonTokenizer.nextToken(JsonTokenizer.java:418)
at org.glassfish.json.JsonParserImpl$NoneContext.getNextEvent(JsonParserImpl.java:413)
at org.glassfish.json.JsonParserImpl.next(JsonParserImpl.java:363)
at org.glassfish.json.JsonReaderImpl.read(JsonReaderImpl.java:90)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:106)
... 57 more

multiple blank parsed multiple words

related explosion/spaCy#3756 (comment)

echo "東京都  へ 行く" | java -jar target/sudachi-0.3.0.jar
東京都	名詞,固有名詞,地名,一般,*,*	東京都
 	空白,*,*,*,*,*	 
 	空白,*,*,*,*,*	 
へ	助詞,格助詞,*,*,*,*	へ
 	空白,*,*,*,*,*	 
行く	動詞,非自立可能,*,*,五段-カ行,終止形-一般	行く

Is this expected result ? Multiple blanks parsed to multiple 空白,,,,,*

Words w/o normalization are ignored in user dictionary

The first column of a source of user dictionary is a headword for TRIE.
Because input texts are normalized by DefaultInputTextPlugin, the headwords must be normalized in the same way.
Normalize by DefaultInputTextPlugin when build dictionary, or add a caution to the document.

Do you have motivation to introduce dependency updator?

Hi, as you may know GitHub recently start providing dependency checker for ruby and node. However Java (Maven) is not supported yet.

If you want to keep your dependency updated, I think https://dependabot.com/ can be good solution. If you need, please notice me then I'll enable it in this repository.

Recently GitHub provides platform for bots, if you want to introduce more like WIP, it is also welcome to install.

Version management policy

Currently version of Sudachi is 0.1-SNAPSHOT (maybe major = 0, minor = 1), however version of elasticsearch-sudachi is 1.0.0-SNAPSHOT (major = 1, minor = 0, patch = 0). It seems that they use different versioning policy. I recommend you to unify it.

And if you follow the semver2.0, I recommend you to change 1.0.0-SNAPSHOT to 0.1.0-SNAPSHOT. 0.x.y is know as initial development phase so you can break backward compatibility.

Remove EditConnectionCost plugin

This functionality will be removed in 1.0 as nobody seems to be using it.

Please comment here if you are actually using it and do not want to have it removed.

Dictionary file Loading support from jar

if the Dictionary file is kept inside some resource jar , then the MMap.class cannot read the Dictionary file,

so can this be modified to support BinaryDictionary.class.getClassLoader().getResourceAsStream("system_core.dic ")
Then from InputStream we can get ByteBuffer using below logic
ByteBuffer byteBuffer = ByteBuffer.allocate(initialStream.available());
ReadableByteChannel channel = newChannel(initialStream);
IOUtils.readFully(channel, byteBuffer);

worksapplications / sudachi Goto Github PK

sudachi's Issues

Description

Environment

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org