Giter Site home page Giter Site logo

Default Normalizers not working about kumo HOT 9 OPEN

kennycason avatar kennycason commented on September 28, 2024
Default Normalizers not working

from kumo.

Comments (9)

kennycason avatar kennycason commented on September 28, 2024

Thanks for posting this . I'll check it out. Seems like it should be a straight forward fix.

from kumo.

kennycason avatar kennycason commented on September 28, 2024

Could you give me a sample input? Are you loading from a raw text file? or are you loading a "Frequency file" of the format:

100: frog
94: dog
43: cog
3: fog
1: log
1: pog

from kumo.

kennycason avatar kennycason commented on September 28, 2024

I created a simple unit tests with some weird text and could not immediately replicate your issue.
Test

    @Test
    public void defaultTokenizerTrimTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("trim_test.txt"));

        final Map<String, WordFrequency> wordFrequencyMap = wordFrequencies
                .stream()
                .collect(Collectors.toMap(WordFrequency::getWord,
                                          Function.identity()));

        assertEquals(2, wordFrequencyMap.get("random").getFrequency());
        assertEquals(1, wordFrequencyMap.get("some").getFrequency());
        assertEquals(1, wordFrequencyMap.get("with").getFrequency());
        assertEquals(1, wordFrequencyMap.get("spaces").getFrequency());
        assertEquals(1, wordFrequencyMap.get("i'm").getFrequency());
    }

The contents of trim_test.txt:
I'm some random random text with spaces .

Feel free to post your raw text/file and I can add tests around it and help debug.

from kumo.

kennycason avatar kennycason commented on September 28, 2024

I went ahead and pushed up the test since there was no existing FrequencyAnalyzerTest. https://github.com/kennycason/kumo/blob/master/kumo-core/src/test/java/com/kennycason/kumo/nlp/FrequencyAnalyzerTest.java

from kumo.

thomasegense avatar thomasegense commented on September 28, 2024

Here is an example text file with the bug.
(removed sample file)
It gives same result loading from a text-file or from inputstream.
Most special characters are removed, but not -. Am not sure this is intended.
But I end up with different tokens:

-
--
---

etc.

from kumo.

kennycason avatar kennycason commented on September 28, 2024

@thomasegense thanks for the sample! I'll check it out.

from kumo.

thomasegense avatar thomasegense commented on September 28, 2024

Hi again, can you reproduce the error?

from kumo.

kennycason avatar kennycason commented on September 28, 2024

@thomasegense Hi, Sorry this week has been hectic for me at work. I'll try and look at over this weekend. I have this tab open in my browser. :)

from kumo.

kennycason avatar kennycason commented on September 28, 2024

I was able to replicate this error.

    @Test
    public void largeTextFileTest() throws IOException {
        final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
        final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
                Thread.currentThread().getContextClassLoader().getResourceAsStream("text/csdb.txt"));

        wordFrequencies
                .forEach(wordFrequency ->
                                 System.out.println(
                                         String.format("[%s] -> [%d]", wordFrequency.getWord(), wordFrequency.getFrequency())));
    }

Result:

[  ] -> [258594]
[the] -> [251345]
[music] -> [82106]
[and] -> [69944]
[user] -> [66652]
[ crack] -> [55529]
[this] -> [54919]
[you] -> [54355]
[csdb] -> [53250]
[comment] -> [50887]
[submitted] -> [50417]
[for] -> [49680]
[graphics] -> [44411]
[scene] -> [40164]
[demo] -> [38855]
[crack  ] -> [37584]
[c64] -> [36656]
[crack] -> [35495]
[demo  ] -> [35339]
[can] -> [31646]
[made] -> [28503]
[commodore] -> [27584]
[find] -> [27268]
[all] -> [25895]
[one-file] -> [25843]
[intro] -> [25235]
[1990] -> [22883]
[about] -> [22095]
[out] -> [21743]
[1989] -> [21269]
[here] -> [21171]
[not] -> [21055]
[but] -> [21001]
[which] -> [20647]
[was] -> [20377]
[are] -> [20349]
[forum] -> [20110]
[release] -> [20101]
[search] -> [19774]
[sceners] -> [19406]
[page] -> [19343]
[home] -> [19306]
[1988] -> [19037]
[that] -> [18841]
[code] -> [18535]
[website] -> [18503]
[computer] -> [18459]
[] -> [18446]
[1991] -> [17545]
[comments] -> [17502]

Looking at [ crack] in the debugger shows ascii character 160, which is a non-breaking space

image

image

One unquestionable bug is the empty token I found here:
image

I will consider how to handle these use-cases, In the mean time I recommend you strip the ascii character 160 from your text file. The hex code, and regex to match ASCII 160 is \xA0

from kumo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.