Giter Site home page Giter Site logo

Comments (14)

hk0i avatar hk0i commented on May 18, 2024

Update: I get the same exception using the jumandic version, so it's probably more related to kuromoji-core, which is where the IntegerArrayIO.readArray() is being handled

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

Hi hk0i,

I'm having a look at this issue for you.

If you are testing using an Android Virtual Device, could you please give me an overview of its AVD configuration (version and memory size etc). If you are testing on a real device, a broad overview of that device would be very helpful.

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

I managed to reproduce the problem - it exists in Kuromoji 1.0 too.

It is caused by the nio wrappers around an InputStream altering its offset in a manner that differs from the regular JVM's nio implementation, presumably due to buffering and readaheads. When loading the WordIdMap, this causes the reading of the second of two arrays read from the same InputStream to fail.

I've have a workaround and have managed to tokenize a sentence successfully in an Android app. I'll improve the performance of this workaround and submit a patch shortly.

from kuromoji.

cmoen avatar cmoen commented on May 18, 2024

Thanks for the detailed analysis @hk0i and for the fix @gerryhocks. The fix has been merged to master, but I think this fix could warrant a 0.9.1 release to make Android users happy.

from kuromoji.

hk0i avatar hk0i commented on May 18, 2024

Sorry I couldn't get back to you sooner @gerryhocks. Looks like you figured it out either way.

For the record I was testing on an LG Nexus 5X, but as you said the issue was more generic than device-specific. Thanks for all the hard work everyone.

I was able to create a similar patch on my own as a quick temporary work around but the performance wasn't so great so I'd like to see what you came up with. Tokenization on my 5X takes around 2-3 seconds so I have to do it asynchronously.

I'll dig through the revision history on master to see if I can find it.

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

No problem @hk0i - the problem manifested on my first attempt, even on an x86 image.. perhaps I was premature asking for extra details.

The changes are not overly extensive and in the same flavour as the earlier version so should be easy to spot.

Out if interest, how much text are you tokenizing when you experience 2-3 seconds of processing?

from kuromoji.

hk0i avatar hk0i commented on May 18, 2024

I just took a look at your changes and—from memory—the hack that I put in to get things "working" was very different.

As for the amount of text to tokenize—not much. From my experience short sentences and long paragraphs were taking close enough to the same amount of time to tokenize.

The following text (on my implementation, not the fix that was just added) takes 2.52 seconds:

新しい言葉使いましょう

However, something longer—an excerpt from the Wikipedia entry on 日本語—takes 2.40 seconds:

日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって

Again this is not the official change from upstream but a different quick fix I found for this in another ticket.

I'll be able to pull the official kuromoji change from master and test the performance of that change later today within the next 5 hours or so.

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

Understood. Thanks!

Would I be right in assuming that the timing includes the creation of a Tokenizer, followed by the tokenize() call itself?

from kuromoji.

hk0i avatar hk0i commented on May 18, 2024

That's right, I'm also iterating over the tokens with a foreach loop, something like this:

        Tokenizer tokenizer = new Tokenizer.Builder().build();
        for (Token token : tokenizer.tokenize(mInputText)) {
            /* store tokens in a list... */
        }

from kuromoji.

hk0i avatar hk0i commented on May 18, 2024

I just gave it a new shot using the code from your fix, @gerryhocks. It actually seems to be a few hundred milliseconds slower than the hot-fix I was using prior.

Testing with the same two sentences this is what I get...

Around 2.7–2.8 seconds for the first sentence:

新しい言葉使いましょう

and around roughly the same amount of time for the second sentence (2.75–2.82s 2.75–2.90s):

日本語(にほんご、にっぽんご)は、主に日本国内や日本人同士の間で使われている言語である。日本は法令によって

The hot fix I was using was less centalized than what you did and more focused on IntegerArrayIO.java specifically:

diff --git a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
index 55ab53e..862222d 100644
--- a/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
+++ b/kuromoji-core/src/main/java/com/atilika/kuromoji/io/IntegerArrayIO.java
@@ -36,8 +36,9 @@ public class IntegerArrayIO {
         int length = dataInput.readInt();

         ByteBuffer tmpBuffer = ByteBuffer.allocate(length * INT_BYTES);
-        ReadableByteChannel channel = Channels.newChannel(dataInput);
-        channel.read(tmpBuffer);
+        byte[] buffer = new byte[tmpBuffer.remaining()];
+        dataInput.readFully(buffer);
+        tmpBuffer.put(buffer, 0, buffer.length);

         tmpBuffer.rewind();
         IntBuffer intBuffer = tmpBuffer.asIntBuffer();

I'm guessing that because I'm trying to use this code on mobile devices that these things become more time critical because working with my laptop the time taken for tokenization is not even noticeable.

Edit: Updated times for second sentence.. Could it be possible object creation overhead? I think your solution also added some more loops maybe, I'm not sure how the internals of readFully() work.

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

Thanks again for the detailed info @hk0i.

If you were using the complete master source (1.0), it's worth noting that it uses a different underlying data structure (an FST), whereas the 0.9 release uses a Double Array Trie.

The new structure has a very small performance difference when loading the default dictionaries, but is significantly faster when loading user dictionaries.

These changes could account for the difference in times you are experiencing.

The instantiation of the Tokenizer() is going to be taking most of the time in your example. Depending on how you use the Tokenizer, it might be worth considering using a static instance or a singleton so the Tokenizer is only created once within the app.. naturally this will prevent the tokenizer from being garbage collected, but it might be a reasonable trade-off for you.

from kuromoji.

hk0i avatar hk0i commented on May 18, 2024

Thanks again @gerryhocks.

I gave it another shot with a static instance and the performance improves significantly, it went from 2-3 for the first instantiation to seconds to 0-100 ms when reusing the same instance. I'll probably go forward with this.

from kuromoji.

gerryhocks avatar gerryhocks commented on May 18, 2024

Glad to hear you've found a workable solution, @hk0i. Best of luck with your project.

from kuromoji.

cmoen avatar cmoen commented on May 18, 2024

Thanks a lot for sorting this out. I'll close the issue.

from kuromoji.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.