Comments (14)
Update: I get the same exception using the jumandic
version, so it's probably more related to kuromoji-core
, which is where the IntegerArrayIO.readArray()
is being handled
from kuromoji.
Hi hk0i,
I'm having a look at this issue for you.
If you are testing using an Android Virtual Device, could you please give me an overview of its AVD configuration (version and memory size etc). If you are testing on a real device, a broad overview of that device would be very helpful.
from kuromoji.
I managed to reproduce the problem - it exists in Kuromoji 1.0 too.
It is caused by the nio wrappers around an InputStream altering its offset in a manner that differs from the regular JVM's nio implementation, presumably due to buffering and readaheads. When loading the WordIdMap, this causes the reading of the second of two arrays read from the same InputStream to fail.
I've have a workaround and have managed to tokenize a sentence successfully in an Android app. I'll improve the performance of this workaround and submit a patch shortly.
from kuromoji.
Thanks for the detailed analysis @hk0i and for the fix @gerryhocks. The fix has been merged to master, but I think this fix could warrant a 0.9.1 release to make Android users happy.
from kuromoji.
Sorry I couldn't get back to you sooner @gerryhocks. Looks like you figured it out either way.
For the record I was testing on an LG Nexus 5X, but as you said the issue was more generic than device-specific. Thanks for all the hard work everyone.
I was able to create a similar patch on my own as a quick temporary work around but the performance wasn't so great so I'd like to see what you came up with. Tokenization on my 5X takes around 2-3 seconds so I have to do it asynchronously.
I'll dig through the revision history on master to see if I can find it.
from kuromoji.
No problem @hk0i - the problem manifested on my first attempt, even on an x86 image.. perhaps I was premature asking for extra details.
The changes are not overly extensive and in the same flavour as the earlier version so should be easy to spot.
Out if interest, how much text are you tokenizing when you experience 2-3 seconds of processing?
from kuromoji.
I just took a look at your changes and—from memory—the hack that I put in to get things "working" was very different.
As for the amount of text to tokenize—not much. From my experience short sentences and long paragraphs were taking close enough to the same amount of time to tokenize.
The following text (on my implementation, not the fix that was just added) takes 2.52 seconds:
新しい言葉使いましょう
However, something longer—an excerpt from the Wikipedia entry on 日本語—takes 2.40 seconds:
日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって
Again this is not the official change from upstream but a different quick fix I found for this in another ticket.
I'll be able to pull the official kuromoji change from master and test the performance of that change later today within the next 5 hours or so.
from kuromoji.
Understood. Thanks!
Would I be right in assuming that the timing includes the creation of a Tokenizer, followed by the tokenize() call itself?
from kuromoji.
That's right, I'm also iterating over the tokens with a foreach loop, something like this:
Tokenizer tokenizer = new Tokenizer.Builder().build();
for (Token token : tokenizer.tokenize(mInputText)) {
/* store tokens in a list... */
}
from kuromoji.
I just gave it a new shot using the code from your fix, @gerryhocks. It actually seems to be a few hundred milliseconds slower than the hot-fix I was using prior.
Testing with the same two sentences this is what I get...
Around 2.7–2.8 seconds for the first sentence:
新しい言葉使いましょう
and around roughly the same amount of time for the second sentence (2.75–2.82s 2.75–2.90s):
日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって
The hot fix I was using was less centalized than what you did and more focused on IntegerArrayIO.java
specifically:
diff --git a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
index 55ab53e..862222d 100644
--- a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
+++ b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
@@ -36,8 +36,9 @@ public class IntegerArrayIO {
int length = dataInput.readInt();
ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES);
- ReadableByteChannel channel = Channels.newChannel(dataInput);
- channel.read(tmpBuffer);
+ byte[] buffer = new byte[tmpBuffer.remaining()];
+ dataInput.readFully(buffer);
+ tmpBuffer.put(buffer, 0, buffer.length);
tmpBuffer.rewind();
IntBuffer intBuffer = tmpBuffer.asIntBuffer();
I'm guessing that because I'm trying to use this code on mobile devices that these things become more time critical because working with my laptop the time taken for tokenization is not even noticeable.
Edit: Updated times for second sentence.. Could it be possible object creation overhead? I think your solution also added some more loops maybe, I'm not sure how the internals of readFully()
work.
from kuromoji.
Thanks again for the detailed info @hk0i.
If you were using the complete master source (1.0), it's worth noting that it uses a different underlying data structure (an FST), whereas the 0.9 release uses a Double Array Trie.
The new structure has a very small performance difference when loading the default dictionaries, but is significantly faster when loading user dictionaries.
These changes could account for the difference in times you are experiencing.
The instantiation of the Tokenizer() is going to be taking most of the time in your example. Depending on how you use the Tokenizer, it might be worth considering using a static instance or a singleton so the Tokenizer is only created once within the app.. naturally this will prevent the tokenizer from being garbage collected, but it might be a reasonable trade-off for you.
from kuromoji.
Thanks again @gerryhocks.
I gave it another shot with a static instance and the performance improves significantly, it went from 2-3 for the first instantiation to seconds to 0-100 ms when reusing the same instance. I'll probably go forward with this.
from kuromoji.
Glad to hear you've found a workable solution, @hk0i. Best of luck with your project.
from kuromoji.
Thanks a lot for sorting this out. I'll close the issue.
from kuromoji.
Related Issues (20)
- Internals documentation and academic papers? HOT 1
- Unidic design flaw HOT 4
- Why does tokenized Kanji features never contains Hiragana ? HOT 1
- How to use Kuromoji in Gradle? HOT 4
- Obtain furigana? HOT 1
- Nexus Repository is Offline? HOT 1
- tokenize 一人(ひとり,hitori)will be seperate as 一(いち,ichi) 人(ひと,hito) HOT 1
- Normalized surface in user dictionary. HOT 5
- Kuromoji POS Train
- Optimization opportunity in the fst usage. HOT 2
- Next release? HOT 2
- Configuring with Maven HOT 3
- how to increase heap size other than MAVEN_OPS
- How to enable discardPunctuation in Kuromoji Java
- Question: Is there any way to update neologd dictionary?
- ソーシャルメディア is not tokenized into two words
- 日本人 is not divided into two sections even in extended mode
- Handling of userDictionary comments
- Kanji penalty and other penalty
- Kuromoji_tokenizer: sort clause does not seem to work for some specific character combinations
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kuromoji.