Giter Site home page Giter Site logo

rxp90 / jsymspell Goto Github PK

View Code? Open in Web Editor NEW
19.0 2.0 7.0 2.63 MB

Java 8+ zero-dependency port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Home Page: https://medium.com/@wolfgarbe/1000x-faster-spelling-correction-algorithm-2012-8701fcd87a5f

License: MIT License

Java 100.00%
java symspell spelling-correction spelling spellcheck

jsymspell's Introduction

JSymSpell

JSymSpell is a zero-dependency Java 8+ port of SymSpell

forthebadge

codecov Maven Central License MIT Open Source Love svg1

Overview

The Symmetric Delete spelling correction algorithm speeds up the process up by orders of magnitude.

It achieves this by generating delete-only candidates in advance from a given lexicon.

Setup

Add the latest JSymSpell dependency to your project

Getting Started

To start, we'll load the data sets of unigrams and bigrams:

Map<Bigram, Long> bigrams = Files.lines(Paths.get("src/test/resources/bigrams.txt"))
                                 .map(line -> line.split(" "))
                                 .collect(Collectors.toMap(tokens -> new Bigram(tokens[0], tokens[1]), tokens -> Long.parseLong(tokens[2])));
Map<String, Long> unigrams = Files.lines(Paths.get("src/test/resources/words.txt"))
                                  .map(line -> line.split(","))
                                  .collect(Collectors.toMap(tokens -> tokens[0], tokens -> Long.parseLong(tokens[1])));

Let's now create an instance of SymSpell by using the builder and load these maps. For this example we'll limit the max edit distance to 2:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setBigramLexicon(bigrams)
                                         .setMaxDictionaryEditDistance(2)
                                         .createSymSpell();

And we are ready!

int maxEditDistance = 2;
boolean includeUnknowns = false;
List<SuggestItem> suggestions = symSpell.lookupCompound("Nostalgiais truly one of th greatests human weakneses", maxEditDistance, includeUnknowns);
System.out.println(suggestions.get(0).getSuggestion());
// Output: nostalgia is truly one of the greatest human weaknesses
// ... only second to the neck!

Custom String Distance Algorithms

By default, JSymSpell calculates Damerau-Levenshtein distance. Depending on your use case, you may want to use a different one.

Other algorithms to calculate String Distance that might result of interest are:

Here's an example using Hamming Distance:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setStringDistanceAlgorithm((string1, string2, maxDistance) -> {
                                             if (string1.length() != string2.length()){
                                                 return -1;
                                             }
                                             char[] chars1 = string1.toCharArray();
                                             char[] chars2 = string2.toCharArray();
                                             int distance = 0;
                                             for (int i = 0; i < chars1.length; i++) {
                                                 if (chars1[i] != chars2[i]) {
                                                     distance += 1;
                                                 }
                                             }
                                             return distance;
                                         })
                                         .createSymSpell();

Custom character comparison

Let's say you are building a query engine for country names where the input form allows Unicode characters, but the database is all ASCII. You might want searches for Espana to return España entries with distance 0:

CharComparator customCharComparator = new CharComparator() {
    @Override
    public boolean areEqual(char ch1, char ch2) {
        if (ch1 == 'ñ' || ch2 == 'ñ') {
            return ch1 == 'n' || ch2 == 'n';
        }
        return ch1 == ch2;
    }
};
StringDistance damerauLevenshteinOSA = new DamerauLevenshteinOSA(customCharComparator);
SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(Map.of("España", 10L))
                                         .setStringDistanceAlgorithm(damerauLevenshteinOSA)
                                         .createSymSpell();
List<SuggestItem> suggestions = symSpell.lookup("Espana", Verbosity.ALL);
assertEquals(0, suggestions.get(0).getEditDistance());

Frequency dictionaries in other languages

As in the original SymSpell project, this port contains an English frequency dictionary that you can find at src/test/resources/words.txt If you need a different one, you just need to compute a Map<String, Long> where the key is the word and the value is the frequency in the corpus.

Map<String, Long> unigrams = Arrays.stream("A B A B C A B A C A".split(" "))
                                   .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
System.out.println(unigrams);
// Output: {a=5, b=3, c=2}

Built With

  • Maven - Dependency Management

Versioning

We use SemVer for versioning.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • Wolf Garbe

jsymspell's People

Contributors

indra-rosadi-rally avatar rxp90 avatar samsieber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

jsymspell's Issues

Race condition prevents library initialization ~1/100 times.

We've started using this at $WORK, and we've seen that sometimes initializing the library fails.

new SymSpellBuilder().setUnigramLexicon(unigrams)
                            .setBigramLexicon(bigrams)
                            .setMaxDictionaryEditDistance(2)
                            .createSymSpell();

Throws this exception:

        java.lang.ArrayIndexOutOfBoundsException
                at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
                at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
                at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
                at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
                at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
                at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
                at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
                at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
                at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
                at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:650)
                at io.gitlab.rxp90.jsymspell.SymSpellImpl.<init>(SymSpellImpl.java:43)
                at io.gitlab.rxp90.jsymspell.SymSpellBuilder.createSymSpell(SymSpellBuilder.java:64)
                ... 
        Caused by: java.lang.ArrayIndexOutOfBoundsException
                at java.lang.System.arraycopy(Native Method)
                at java.util.ArrayList.addAll(ArrayList.java:586)
                at io.gitlab.rxp90.jsymspell.SymSpellImpl.lambda$null$1(SymSpellImpl.java:45)
                at java.util.HashMap.forEach(HashMap.java:1289)
                at io.gitlab.rxp90.jsymspell.SymSpellImpl.lambda$new$2(SymSpellImpl.java:45)
                at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
                at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
                at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
                at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
                at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
                at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
                at java.util.concurrent.ForkJoinPool$WorkQueue.execLocalTasks(ForkJoinPool.java:1040)
                at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1058)
                at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
                at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)

We've wrapped the initialization code in a loop as a stop gap, but it appears the problem is here:

this.unigramLexicon.keySet().parallelStream().forEach((word) -> {
            Map<String, Collection<String>> edits = this.generateEdits(word);
            edits.forEach((string, suggestions) -> {
                ((Collection)this.deletes.computeIfAbsent(string, (ignored) -> {
                    return new ArrayList();
                })).addAll(suggestions);
            });
        });

The this.deletes collection is a ConcurrentHashMap, but the new ArrayList() it returns is not thread-safe. I believe switching the computeIfAbsent to a compute call that also adds the values to the ArrayList will fix it, but I'm not sure.

Error importing from maven

To reproduce:

  1. Add implementation 'io.gitlab.rxp90:jsymspell:1.0' to grade and "build".

Result:

error: cannot access Bigram

import io.gitlab.rxp90.jsymspell.api.Bigram;
^
bad class file: /Users/home/.gradle/caches/modules-2/files-2.1/io.gitlab.rxp90/jsymspell/1.0/8367b65ce9301a734bb6368a7d7149299ccb964d/jsymspell-1.0.jar(io/gitlab/rxp90/jsymspell/api/Bigram.class)
class file has wrong version 55.0, should be 52.0
Please remove or make sure it appears in the correct subdirectory of the classpath.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.