worksapplications / sudachi Goto Github PK
View Code? Open in Web Editor NEWA Japanese Tokenizer for Business
A Japanese Tokenizer for Business
Allow user to define his/her own part-of-speech in user dictionary.
抑える
is normalized to 押さえる
. Is this expected behavior?
{
"systemDict" : "/dbfs/FileStore/sudachi/system_core.dic",
"oovProviderPlugin" : [
{ "class" : "com.worksap.nlp.sudachi.SimpleOovProviderPlugin",
"oovPOS" : [ "名詞", "普通名詞", "一般", "*", "*", "*" ]}
]
}
I found second pom.xml
in elasticsearch
directory. Then why we do not use multi-module project?
If we have no clear reason, I will propose a PR to change this project to multi-Module. Here is condition to do so:
<version>
for each module (not mandatory but seems possible)src
, licenses
and pom.xml
into sudachi/src
, sudachi/licenses
and sudachi/pom.xml
sudachi-core
or what you preferAt ae2b047, the dictionary source was split into two files, but there are no clear definition of them and the difference. Could you add it to README?
In addition to that, since all the examples of the named entities on README such as "医薬品安全管理責任者" and "自転車安全整備士" are now included in non_core.csv
, the default settings don't achieve the three mode difference on README (Both B
and C
end up with "医薬品/安全/管理/責任者") and users need to specify system_full.dic
as system dictionary. It might be better to describe that it requires full dictionary, or to replace them to another example which is included in core dictionary to avoid any confusion.
There is a long headword line more than 255 bytes in core_lex.csv.
The latest core_lex.csv: line 49605
あなたの幸せが私の幸せ世の為人の為人類幸福繋がり創造即ち我らの使命なり今まさに変革の時ここに熱き魂と愛と情鉄の勇気と利他の精神を持つ者が結集せり日々感謝喜び笑顔繋がりを確かな一歩とし地球の永続を約束する公益の志溢れる我らの足跡に歴史の花が咲くいざゆかん浪漫輝く航海へ
This headword is 399 bytes in UTF-8.
It will store in short(2 bytes) when reading from CSV.
But it will cast to byte(1 byte) when writing to the dictionary.
Overflow can occur.
Split text into sentences.
In sudachi-0.1.1-SNAPSHOT, this problem may occur in some cases where the input sentence contains successive alphabets.
To reproduce
$ echo "阿qd" | java -jar sudachi-0.1.1-SNAPSHOT.jar -d
=== Input dump:
阿qd
=== Lattice dump:
0: 5 5 (null)(0) BOS/EOS 0 0 0:
1: 0 4 阿Q(1441792) 名詞,固有名詞,人名,一般,*,* 4788 4788 9443: 1
2: 0 0 (null)(0) BOS/EOS 0 0 0: 0
Exception in thread "main" java.lang.IllegalStateException: EOS isn't connected to BOS
at com.worksap.nlp.sudachi.LatticeImpl.getBestPath(LatticeImpl.java:150)
at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenize(JapaneseTokenizer.java:81)
at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:43)
at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:148)
In this example, it seems that the node which contains the character "d" wasn't generated, so one cannot reach to EOS node.
「にぎり寿司三億年」
=== Before rewriting:
0: 0 24 にぎり寿司三億年(1634131) 2 5142 5154 13355
=== After rewriting:
0: 0 9 にぎり(136394) 4 0 0 0
1: 9 15 寿司(700728) 4 0 0 0
2: 15 18 三(316771) 5 0 0 0
3: 18 21 億(434255) 5 0 0 0
4: 21 24 年(772367) 9 0 0 0
===
にぎり 名詞,普通名詞,一般,,,* 握り にぎり ニギリ 0
寿司 名詞,普通名詞,一般,,,* 寿司 寿司 ズシ 0
三 名詞,数詞,,,, 三 三 サン 0
億 名詞,数詞,,,, 億 億 オク 0
年 名詞,普通名詞,助数詞可能,,,* 年 年 ネン 0
In user dictionary, we can use only user dictionary words in splitting information.
Allow to use system dictionary words.
SudachiCommandLine outputs empty surface morpheme(s) before output "´" or "…" depending on following context. Seems ACCENT char causes buggy behaviour but HORIZONTAL ELLIPSIS char is normalized for triple DOT chars. The possibility of morpheme having zero length surface should be noted in README.md, I think.
java -jar sudachi-0.1.2-SNAPSHOT.jar -a -m A
´
´ 補助記号,一般,*,*,*,* ´ ´ キゴウ 0
EOS
´´
空白,*,*,*,*,* キゴウ 0
´ 補助記号,一般,*,*,*,* -1 (OOV)
´ 補助記号,一般,*,*,*,* ´ ´ キゴウ 0
EOS
´。
空白,*,*,*,*,* キゴウ 0
´ 補助記号,一般,*,*,*,* -1 (OOV)
。 補助記号,句点,*,*,*,* 。 。 キゴウ 0
EOS
´あ
´ 補助記号,一般,*,*,*,* ´ ´ キゴウ 0
あ 感動詞,一般,*,*,*,* あっ あ ア 0
EOS
…
補助記号,句点,*,*,*,* . . キゴウ 0
補助記号,句点,*,*,*,* . . キゴウ 0
… 補助記号,句点,*,*,*,* . . キゴウ 0
EOS
Thanks for developing this tool. I tried to install Sudachi on a CentOS 6 server. I tried 'mvn package
' but ended up with the following message. Files downloaded in './target/dictionary/unidic-mecab-2.1.2_src
' seems to be OK. Is there any specific file that I should check?
[INFO] --- iterator-maven-plugin:0.5.1:iterator (build-system-dictionary) @ sudachi ---
[INFO] ------ (core) org.codehaus.mojo:exec-maven-plugin:1.6.0:java
reading the source file...Error: invalid format at line 1
[WARNING]
java.lang.IllegalArgumentException: invalid format
at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.buildLexicon (DictionaryBuilder.java:114)
at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.build (DictionaryBuilder.java:89)
at com.worksap.nlp.sudachi.dictionary.DictionaryBuilder.main (DictionaryBuilder.java:432)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run (ExecJavaMojo.java:282)
at java.lang.Thread.run (Thread.java:745)
Current implementation uses System.out
and System.err
to output debug information, but it should be avoided in production code. Because using stdout and/or stderr directly makes system maintenance difficult, it doesn't support filtering base on log level nor class. It is also hard to apply log rotation and other log management mechanism.
I cannot judge which logger API we should use, probably it is one of followings:
In zip file, we can find sudachi-0.0.1-SNAPSHOT.jar
and sudachi-0.1-SNAPSHOT.jar
.
We don't need to package two jar files, so it's better to delete one of them.
Right now if a resource does not exist, Sudachi throws NullPointerException, which is bad and not useful. Instead, report exact configuration string and all actually tried paths.
Normalize only numerical expression
加藤 名詞,固有名詞,人名,姓,, 加藤
一二三 名詞,固有名詞,人名,名,, 123
Change OOV to person name by context (part-of-speech or title)
ご期待くださいーー!!
ご 接頭辞,*,*,*,*,* 御
期待 名詞,普通名詞,サ変可能,*,*,* 期待
くださ 動詞,一般,*,*,五段-サ行,未然形-一般 下す
いーー 感動詞,フィラー,*,*,*,* いー
! 補助記号,句点,*,*,*,* !
! 補助記号,句点,*,*,*,* !
Expected result:
ご期待くださいーー!!
ご 接頭辞,*,*,*,*,* 御
期待 名詞,普通名詞,サ変可能,*,*,* 期待
くださいーー 動詞,非自立可能,*,*,五段-ラ行,命令形 下さる
! 補助記号,句点,*,*,*,* !
! 補助記号,句点,*,*,*,* !
Hi, there. Thank you for your nice implementation.
I have been interested in Double Array Trie implementation for years, and studied a similar dart-clone-java implementation before. His implementation works well.
But it seems that your implementation can't handle keys containing negative byte. Example:
public void testRaw() throws Exception
{
byte[][] keys = new byte[3][];
keys[0] = new byte[]{1};
keys[1] = new byte[]{1, 2, -1};
keys[2] = "東京都".getBytes(StandardCharsets.UTF_8);
DoubleArray doubleArray = new DoubleArray();
doubleArray.build(keys, new int[]{0, 1, 2}, null);
for (int i = 0; i < 3; ++i)
{
assertEquals(i, doubleArray.exactMatchSearch(keys[i])[0]);
System.out.printf("Good for %s\n", Arrays.toString(keys[i]));
}
}
By the way, your repo is over its data quota.
~ git clone https://github.com/WorksApplications/Sudachi.git
Cloning into 'Sudachi'...
remote: Counting objects: 2353, done.
remote: Compressing objects: 100% (955/955), done.
remote: Total 2353 (delta 754), reused 2307 (delta 753), pack-reused 0
Receiving objects: 100% (2353/2353), 306.04 KiB | 86.00 KiB/s, done.
Resolving deltas: 100% (754/754), done.
Checking connectivity... done.
Downloading dictionary/sudachi_lex.csv (517 MB)
Error downloading object: dictionary/sudachi_lex.csv (acabc71): Smudge error: Error downloading dictionary/sudachi_lex.csv (acabc71c0b63936801fbdaf1749f8b91e7fa8b15532461e05fec9b98f5f474db): batch response: This repository is over its data quota. Purchase more data packs to restore access.
So I can't download the csv file or debug into your com.worksap.nlp.sudachi.dictionary.DictionaryBuilder
. Is there any magic inside the DictionaryBuilder
?
I tried to start sudachi
in this way but end up like this.
target]$ java -jar sudachi-0.1.1-SNAPSHOT.jar
Exception in thread "main" java.lang.IllegalArgumentException: oovPOS is invalid:????,??,*,*,*,*
at com.worksap.nlp.sudachi.SimpleOovProviderPlugin.setUp(SimpleOovProviderPlugin.java:63)
at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:89)
at com.worksap.nlp.sudachi.JapaneseDictionary.<init>(JapaneseDictionary.java:56)
at com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:44)
at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:135)
I've checked the content in ./src/main/resources/sudachi.json
and the oovProviderPlugin
part seems OK.
Why: Use 0xf pattern for marker if a word is OOV, for reducing LatticeNodeImpl size.
This project depends on UniDic but its distribution has no LICENSE file of it. Note that UniDic is triple license (GPL/LGPL/BSD License).
When trying to analyze text containing an emoji after a "。", java.lang.StringIndexOutOfBoundsException
is thrown.
$ echo "。😀" | java -jar ./sudachi-0.4.2.jar
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 3
at java.lang.String.substring(String.java:1963)
at com.worksap.nlp.sudachi.UTF8InputText.getSubstring(UTF8InputText.java:82)
at com.worksap.nlp.sudachi.MeCabOovProviderPlugin.provideOOV(MeCabOovProviderPlugin.java:101)
at com.worksap.nlp.sudachi.OovProviderPlugin.getOOV(OovProviderPlugin.java:75)
at com.worksap.nlp.sudachi.JapaneseTokenizer.buildLattice(JapaneseTokenizer.java:220)
at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentence(JapaneseTokenizer.java:162)
at com.worksap.nlp.sudachi.JapaneseTokenizer.tokenizeSentences(JapaneseTokenizer.java:93)
at com.worksap.nlp.sudachi.SudachiCommandLine.run(SudachiCommandLine.java:66)
at com.worksap.nlp.sudachi.SudachiCommandLine.main(SudachiCommandLine.java:218)
For plugin configuration use the following strategy.
新任1人を含む6人の取締役の選任など3議案を原案通りに可決した
Could someone tell me why 1人
in the sentence above is tokenized into 1人
while 6人
is to 6
and 人
?
output
// surface: 新任, normalized: 新任, part of speach: 名詞,普通名詞,サ変可能,*,*,*, read: シンニン
// surface: 1人, normalized: 一人, part of speach: 名詞,普通名詞,副詞可能,*,*,*, read: ヒトリ
// surface: を, normalized: を, part of speach: 助詞,格助詞,*,*,*,*, read: ヲ
// surface: 含む, normalized: 含む, part of speach: 動詞,一般,*,*,五段-マ行,連体形-一般, read: フクム
// surface: 6, normalized: 6, part of speach: 名詞,数詞,*,*,*,*, read: ロク
// surface: 人, normalized: 人, part of speach: 接尾辞,名詞的,一般,*,*,*, read: ニン
// ...
We plan to introduce dictionary build warnings, which will not abort the building of the dictionary, but will report that something was not good.
Warning-producing checks will be optional, but enabled by default.
Proposed list of warnings:
Should other OOV plugins have the same feature?
Hi. I want to create a multi-threaded program using Sudachi. Which class instances can be shared between threads? Should I create every instances for each threads?
I want to know about the following classes;
Dictionary
(JapaneseDictionary
)Tokenizer
(JapaneseTokenizer
)Update: I checked the implementation and found out that JapaneseTokenizer mutates it's member variable. So, the tokenizer is not thread safe.
Thanks.
$ echo "愛゛の゛ム゛チ゛" | java -jar sudachi-0.1.1-SNAPSHOT.jar
愛 名詞,普通名詞,一般,*,*,* 愛
空白,*,*,*,*,*
゛の 名詞,普通名詞,一般,*,*,* ゙の
゛ 補助記号,一般,*,*,*,* ゛
ム 接頭辞,*,*,*,*,* ム
゛ 補助記号,一般,*,*,*,* ゛
チ゛ 名詞,普通名詞,一般,*,*,* ヂ
EOS
We want to analyze 「行(おこな)う」 as 「行う」because「おこな」 means pronunciation (yomigana).
if there any Utility to generate the “User dictionary source File” from a raw file ,which has Sentence and its Tokens and POS Mapping for Each Token .
I mean if we have Token and POS mapping , if there any easy way to generate the “User dictionary source File”
For Example , if we have a file as below , or any similar format, can we generate the “User dictionary source File”
The current behavior:
> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都 名詞,固有名詞,地名,一般,*,* 東京都
空白,*,*,*,*,*
EOS
へ 助詞,格助詞,*,*,*,* へ
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く
EOS
We want it to return the full text if it is shorter than the buffer size like:
> echo "東京都 へ行く" | java -jar target/sudachi-0.5.2.jar
東京都 名詞,固有名詞,地名,一般,*,* 東京都
空白,*,*,*,*,*
へ 助詞,格助詞,*,*,*,* へ
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く
EOS
row and column of connection matrix looks traversed and I think setConnectionCost
at GrammerImpl
contains bug.
setConnectionCost
looks properly used (rightid is right, leftid is left)
getConnectionCost
looks not properly used (rightid is left, leftid is right)
setConnectionCost
is not frequently used, so the effect is small.
It's better to have consensus between 1 and 2 and shape of connection matrix.
Off course, the format of dictionary file must be kept.
Known word and OOV are different in segmentation although their word structures are the same.
全国的 名詞,普通名詞,形状詞可能,,,* 全国的
間接 名詞,普通名詞,一般,,,* 間接
的 接尾辞,形状詞的,,,, 的
Adjust them by PathRewritePlugin
Thank you for publishing a great tool.
I got this error while trying git lfs pull.
$ git lfs pull
Git LFS: (0 of 1 files) 0 B / 493.43 MB
batch response: This repository is over its data quota. Purchase more data packs to restore access.
error: failed to fetch some objects from 'https://github.com/WorksApplications/Sudachi.git/info/lfs'
What are the different elements returned in the parts of speech?
For example, on tokenizing 太郎
we get:
太郎 名詞,固有名詞,人名,名,*,* 太郎
I want to understand the various parts under:
名詞,固有名詞,人名,名,*,*
Do each of these comma separated values have a descriptive name?
Sorry, if it is mentioned somewhere in the code! I couldn't find it!
And, thank you for making such an awesome package and making it open source!!
Slack invitation link in ReadMe is expired.
I think the link in SudachiPy repository is also expired.
Currently, unknown (OOV) words have the same reading form is blank string ("")
Can u have any solutions to Unknown words can return itself in reading form ?
example:
text: サンドウィッチマン ライブ
output:
text POS Base_form Reading_form
サンドウィッチマン 名詞-普通名詞-一般 サンドウィッチマン
ライブ 名詞-普通名詞-一般 ライブ ライブ
expect:
text POS Base_form Reading_form
サンドウィッチマン 名詞-普通名詞-一般 サンドウィッチマン サンドウィッチマン
ライブ 名詞-普通名詞-一般 ライブ ライブ
I suggest a solution is adding a flag in SimpleOovProviderPlugin
to check when the unknown words return itself in reading form:
@Override
public List<LatticeNode> provideOOV(InputText inputText, int offset, boolean hasOtherWords) {
if (!hasOtherWords) {
LatticeNode node = createNode();
node.setParameter(leftId, rightId, cost);
int length = inputText.getWordCandidateLength(offset);
String s = inputText.getSubstring(offset, offset + length);
WordInfo info = new WordInfo(s, (short) length, oovPOSId, s, s, "");
node.setWordInfo(info);
return Collections.singletonList(node);
} else {
return Collections.emptyList();
}
}
explosion/spaCy#3756 (comment)
Asking PyPI organization allowing 60MB limit exception for full
and core
dictionary.
This issue is heavily related to https://github.com/WorksApplications/SudachiDict
I don't know if it's an error on my part or the dictionary files aren't working. In case it is my fault, I'm sorry, I didn't understand how dictionaries work, I think it was just load them to use the tokenizer.
I am trying to use in my program to generate phrases in anki for my studies.
When reading the "system_small.dic" or "system_core.dic" files, I get the following message in eclipse. I downloaded the files found on the home page, and performing a test by cmd I find the same error reported as shown in the image.
sudachi-dictionary-20200330-small.zip
sudachi-dictionary-20200330-core.zip
Eclipse:
Exception in thread "JavaFX Application Thread" java.lang.IllegalArgumentException: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:115)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.buildSettings(JapaneseDictionary.java:97)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:52)
at [email protected]/com.worksap.nlp.sudachi.JapaneseDictionary.(JapaneseDictionary.java:48)
at [email protected]/com.worksap.nlp.sudachi.DictionaryFactory.create(DictionaryFactory.java:47)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.processaTexto(TelaProcessarFrasesController.java:217)
at TextosJapones/org.jisho.textosJapones.controller.TelaProcessarFrasesController.lambda$4(TelaProcessarFrasesController.java:454)
at javafx.base/com.sun.javafx.binding.ExpressionHelper$Generic.fireValueChangedEvent(ExpressionHelper.java:360)
at javafx.base/com.sun.javafx.binding.ExpressionHelper.fireValueChangedEvent(ExpressionHelper.java:80)
at javafx.base/javafx.beans.property.ReadOnlyBooleanPropertyBase.fireValueChangedEvent(ReadOnlyBooleanPropertyBase.java:72)
at javafx.graphics/javafx.scene.Node$FocusedProperty.notifyListeners(Node.java:8159)
at javafx.graphics/javafx.scene.Scene$12.invalidated(Scene.java:2158)
at javafx.base/javafx.beans.property.ObjectPropertyBase.markInvalid(ObjectPropertyBase.java:112)
at javafx.base/javafx.beans.property.ObjectPropertyBase.set(ObjectPropertyBase.java:147)
at javafx.graphics/javafx.scene.Scene$KeyHandler.setFocusOwner(Scene.java:4030)
at javafx.graphics/javafx.scene.Scene$KeyHandler.requestFocus(Scene.java:4077)
at javafx.graphics/javafx.scene.Scene.requestFocus(Scene.java:2125)
at javafx.graphics/javafx.scene.Node.requestFocus(Node.java:8320)
at javafx.controls/com.sun.javafx.scene.control.behavior.TextAreaBehavior.mousePressed(TextAreaBehavior.java:264)
at javafx.controls/javafx.scene.control.skin.TextAreaSkin$ContentView.lambda$new$0(TextAreaSkin.java:1201)
at javafx.base/com.sun.javafx.event.CompositeEventHandler$NormalEventHandlerRecord.handleBubblingEvent(CompositeEventHandler.java:218)
at javafx.base/com.sun.javafx.event.CompositeEventHandler.dispatchBubblingEvent(CompositeEventHandler.java:80)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:238)
at javafx.base/com.sun.javafx.event.EventHandlerManager.dispatchBubblingEvent(EventHandlerManager.java:191)
at javafx.base/com.sun.javafx.event.CompositeEventDispatcher.dispatchBubblingEvent(CompositeEventDispatcher.java:59)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:58)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.BasicEventDispatcher.dispatchEvent(BasicEventDispatcher.java:56)
at javafx.base/com.sun.javafx.event.EventDispatchChainImpl.dispatchEvent(EventDispatchChainImpl.java:114)
at javafx.base/com.sun.javafx.event.EventUtil.fireEventImpl(EventUtil.java:74)
at javafx.base/com.sun.javafx.event.EventUtil.fireEvent(EventUtil.java:54)
at javafx.base/javafx.event.Event.fireEvent(Event.java:198)
at javafx.graphics/javafx.scene.Scene$MouseHandler.process(Scene.java:3862)
at javafx.graphics/javafx.scene.Scene.processMouseEvent(Scene.java:1849)
at javafx.graphics/javafx.scene.Scene$ScenePeerListener.mouseEvent(Scene.java:2590)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:409)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler$MouseEventNotification.run(GlassViewEventHandler.java:299)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.lambda$handleMouseEvent$2(GlassViewEventHandler.java:447)
at javafx.graphics/com.sun.javafx.tk.quantum.QuantumToolkit.runWithoutRenderLock(QuantumToolkit.java:411)
at javafx.graphics/com.sun.javafx.tk.quantum.GlassViewEventHandler.handleMouseEvent(GlassViewEventHandler.java:446)
at javafx.graphics/com.sun.glass.ui.View.handleMouseEvent(View.java:556)
at javafx.graphics/com.sun.glass.ui.View.notifyMouse(View.java:942)
at javafx.graphics/com.sun.glass.ui.win.WinApplication._runLoop(Native Method)
at javafx.graphics/com.sun.glass.ui.win.WinApplication.lambda$runLoop$3(WinApplication.java:174)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.json.stream.JsonParsingException: Unexpected char 65.533 at (line no=1, column no=1, offset=0)
at org.glassfish.json.JsonTokenizer.unexpectedChar(JsonTokenizer.java:601)
at org.glassfish.json.JsonTokenizer.nextToken(JsonTokenizer.java:418)
at org.glassfish.json.JsonParserImpl$NoneContext.getNextEvent(JsonParserImpl.java:413)
at org.glassfish.json.JsonParserImpl.next(JsonParserImpl.java:363)
at org.glassfish.json.JsonReaderImpl.read(JsonReaderImpl.java:90)
at [email protected]/com.worksap.nlp.sudachi.Settings.parseSettings(Settings.java:106)
... 57 more
related explosion/spaCy#3756 (comment)
echo "東京都 へ 行く" | java -jar target/sudachi-0.3.0.jar
東京都 名詞,固有名詞,地名,一般,*,* 東京都
空白,*,*,*,*,*
空白,*,*,*,*,*
へ 助詞,格助詞,*,*,*,* へ
空白,*,*,*,*,*
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く
Is this expected result ? Multiple blanks parsed to multiple 空白,,,,,*
The first column of a source of user dictionary is a headword for TRIE.
Because input texts are normalized by DefaultInputTextPlugin
, the headwords must be normalized in the same way.
Normalize by DefaultInputTextPlugin
when build dictionary, or add a caution to the document.
Hi, as you may know GitHub recently start providing dependency checker for ruby and node. However Java (Maven) is not supported yet.
If you want to keep your dependency updated, I think https://dependabot.com/ can be good solution. If you need, please notice me then I'll enable it in this repository.
Recently GitHub provides platform for bots, if you want to introduce more like WIP, it is also welcome to install.
Currently version of Sudachi is 0.1-SNAPSHOT
(maybe major = 0, minor = 1), however version of elasticsearch-sudachi
is 1.0.0-SNAPSHOT
(major = 1, minor = 0, patch = 0). It seems that they use different versioning policy. I recommend you to unify it.
And if you follow the semver2.0, I recommend you to change 1.0.0-SNAPSHOT
to 0.1.0-SNAPSHOT
. 0.x.y
is know as initial development phase so you can break backward compatibility.
This functionality will be removed in 1.0 as nobody seems to be using it.
Please comment here if you are actually using it and do not want to have it removed.
if the Dictionary file is kept inside some resource jar , then the MMap.class cannot read the Dictionary file,
so can this be modified to support BinaryDictionary.class.getClassLoader().getResourceAsStream("system_core.dic ")
Then from InputStream we can get ByteBuffer using below logic
ByteBuffer byteBuffer = ByteBuffer.allocate(initialStream.available());
ReadableByteChannel channel = newChannel(initialStream);
IOUtils.readFully(channel, byteBuffer);
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.