Comments (9)
Thanks for posting this . I'll check it out. Seems like it should be a straight forward fix.
from kumo.
Could you give me a sample input? Are you loading from a raw text file? or are you loading a "Frequency file" of the format:
100: frog
94: dog
43: cog
3: fog
1: log
1: pog
from kumo.
I created a simple unit tests with some weird text and could not immediately replicate your issue.
Test
@Test
public void defaultTokenizerTrimTest() throws IOException {
final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
Thread.currentThread().getContextClassLoader().getResourceAsStream("trim_test.txt"));
final Map<String, WordFrequency> wordFrequencyMap = wordFrequencies
.stream()
.collect(Collectors.toMap(WordFrequency::getWord,
Function.identity()));
assertEquals(2, wordFrequencyMap.get("random").getFrequency());
assertEquals(1, wordFrequencyMap.get("some").getFrequency());
assertEquals(1, wordFrequencyMap.get("with").getFrequency());
assertEquals(1, wordFrequencyMap.get("spaces").getFrequency());
assertEquals(1, wordFrequencyMap.get("i'm").getFrequency());
}
The contents of trim_test.txt:
I'm some random random text with spaces .
Feel free to post your raw text/file and I can add tests around it and help debug.
from kumo.
I went ahead and pushed up the test since there was no existing FrequencyAnalyzerTest. https://github.com/kennycason/kumo/blob/master/kumo-core/src/test/java/com/kennycason/kumo/nlp/FrequencyAnalyzerTest.java
from kumo.
Here is an example text file with the bug.
(removed sample file)
It gives same result loading from a text-file or from inputstream.
Most special characters are removed, but not -. Am not sure this is intended.
But I end up with different tokens:
-
--
---
etc.
from kumo.
@thomasegense thanks for the sample! I'll check it out.
from kumo.
Hi again, can you reproduce the error?
from kumo.
@thomasegense Hi, Sorry this week has been hectic for me at work. I'll try and look at over this weekend. I have this tab open in my browser. :)
from kumo.
I was able to replicate this error.
@Test
public void largeTextFileTest() throws IOException {
final FrequencyAnalyzer frequencyAnalyzer = new FrequencyAnalyzer();
final List<WordFrequency> wordFrequencies = frequencyAnalyzer.load(
Thread.currentThread().getContextClassLoader().getResourceAsStream("text/csdb.txt"));
wordFrequencies
.forEach(wordFrequency ->
System.out.println(
String.format("[%s] -> [%d]", wordFrequency.getWord(), wordFrequency.getFrequency())));
}
Result:
[ ] -> [258594]
[the] -> [251345]
[music] -> [82106]
[and] -> [69944]
[user] -> [66652]
[ crack] -> [55529]
[this] -> [54919]
[you] -> [54355]
[csdb] -> [53250]
[comment] -> [50887]
[submitted] -> [50417]
[for] -> [49680]
[graphics] -> [44411]
[scene] -> [40164]
[demo] -> [38855]
[crack ] -> [37584]
[c64] -> [36656]
[crack] -> [35495]
[demo ] -> [35339]
[can] -> [31646]
[made] -> [28503]
[commodore] -> [27584]
[find] -> [27268]
[all] -> [25895]
[one-file] -> [25843]
[intro] -> [25235]
[1990] -> [22883]
[about] -> [22095]
[out] -> [21743]
[1989] -> [21269]
[here] -> [21171]
[not] -> [21055]
[but] -> [21001]
[which] -> [20647]
[was] -> [20377]
[are] -> [20349]
[forum] -> [20110]
[release] -> [20101]
[search] -> [19774]
[sceners] -> [19406]
[page] -> [19343]
[home] -> [19306]
[1988] -> [19037]
[that] -> [18841]
[code] -> [18535]
[website] -> [18503]
[computer] -> [18459]
[] -> [18446]
[1991] -> [17545]
[comments] -> [17502]
Looking at [ crack]
in the debugger shows ascii character 160
, which is a non-breaking space
One unquestionable bug is the empty token I found here:
I will consider how to handle these use-cases, In the mean time I recommend you strip the ascii character 160 from your text file. The hex code, and regex to match ASCII 160 is \xA0
from kumo.
Related Issues (20)
- FrequencyAnalyzer is de-duplicating words HOT 2
- Multi Language support
- Slow Build process
- I need help HOT 1
- class not found problem with WordTokenizer
- Not able to download the JARs for 1.28 from Maven Dependency HOT 1
- How can I position text horizontally??? HOT 1
- Traditional Chinese words always displayed in Simplified Chinese version in word cloud frequencies and image
- OpenJDK8 do not support Font
- LogFontScalar has incorrect calculations HOT 2
- stop-words-file fix HOT 2
- JDK Version issue HOT 1
- It's invalid to setBackground with a picture,the result is always a rectangle background picture HOT 2
- Homebrew install fails HOT 2
- persian word support
- How to repeat word and auto fill the Mask Bitmap? HOT 3
- Check CI Failure
- category feature
- Hi,Kenny,If the number of words is very small, can I repeat the words? I have been looking for this example for a long time HOT 2
- add support for users using java 9+ modules HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kumo.