Giter Site home page Giter Site logo

ascii1011 / liblevenshtein Goto Github PK

View Code? Open in Web Editor NEW

This project forked from universal-automata/liblevenshtein

0.0 3.0 0.0 515 KB

Various utilities regarding Levenshtein Automata and related.

Home Page: http://dylon.github.com/liblevenshtein

liblevenshtein's Introduction

libLevenshtein

A library for generating Finite State Transducers based on Levenshtein Automata.

Basic Usage:

Add the following <script> tag to the <head> section of your document:

<script type="text/javascript"
	src="http://dylon.github.com/liblevenshtein/javascripts/v1.1/liblevenshtein.min.js">
</script>

Then, within your JavaScript logic, you may use the library as follows:

var algorithm = "transposition"; // "standard", "transposition", or "merge_and_split"

var dictionary = [ /* some list of words */ ];
var transduce = levenshtein.transducer({dictionary: dictionary, algorithm: algorithm});

var query_term = "mispelled";
var max_edit_distance = 2;

var matches = transduce(query_term, max_edit_distance); // list of terms matching your query

var other_term = "oter";
var other_matches = transduce(other_term, max_edit_distance); // reuse the transducer

The default behavior of the transducer is to sort the results, ascendingly, in the following fashion: first according to the transduced terms' Levenshtein distances from the query term, then lexicographically, in a case insensitive manner. Each result is a pair consisting of the transduced term and its Levenshtein distance from the query term, as follows: [term, distance]

var pair, term, distance, i = 0;
while ((pair = matches[i]) !== undefined) {
	term = pair[0]; distance = pair[1];
	// do something with `term` and `distance`
	i += 1;
}

If you would prefer to sort the results yourself, or do not care about order, you may do the following:

var transduce = levenshtein.transducer({
  dictionary: dictionary,
  algorithm: algorithm,
  sort_matches: false
});

The sorting options are as follows:

  1. sort_matches := Whether to sort the transduced terms (boolean).
  2. include_distance := Whether to include the Levenshtein distances with the transduced terms (boolean).
  3. case_insensitive := Whether to sort the results in a case-insensitive manner (boolean).

Each option defaults to true. You can get the original behavior of the transducer by setting each option to false (where the original behavior was to return the terms unsorted and excluding their distances).

Background

Based largely on the works of Stoyan Mihov, Klaus Schulz, and Petar Nikolaev Mitankin, this library generates Levenshtein transducers using nothing more than an input list of dictionary terms. The referenced literature includes: "Fast String Correction with Levenshtein-Automata", which defines the algorithm used to generate the Levenshtein automata, "Universal Levenshtein Automata. Building and Properties", which provided many mathematical proofs that helped me understand the algorithm and supplied the recursive definitions upon which I based my distance functions, and "Incremental Construction of Minimal Acyclic Finite-State Automata", that defined an algorithm for constructing Minimal Acyclic Finite-State Automata in linear time (i.e. MA-FSA, also known as DAWGs: Directed Acyclic Word Graphs) which I use to store the dictionary of terms.

Upon construction, the list of dictionary terms is indexed as an MA-FSA and a transducer is initialized according to the maximum edit distance and algorithm provided. When queried against, the states of the Levenshtein automaton corresponding to the query term, maximum edit distance, and algorithm specified are constructed on-demand (lazily) as the automaton is evaluated, as described by the paper, "Fast String Correction with Levenshtein-Automata". The Levenshtein automaton is intersected with the dictionary MA-FSA, and every string accepted by both automata is emitted in a list of correction candidates for the query term.

In contrast to many other Levenshtein automata-based algorithms, the entire Levenshtein automaton needn't be constructed prior to evaluation, and only those states of the automaton which are actually required are derived, as needed, thereby greatly improving the efficiency of the transducer in terms of both space and time complexity.

The infamous blog post, "Damn Cool Algorithms: Levenshtein Automata", provided me with a good starting point for this transducer, but the algorithm proposed therein was too inefficient for my needs. Yet, it did reference the paper "Fast String Correction with Levenshtein-Automata", which I ultimately used as the basis of the Levenshtein automata. The same paper also serves as the basis of the Levenshtein automata used by the Apache projects, Lucene and Solr (Lucene's FuzzyQuery is 100 times faster in 4.0), which itself is based on the project by Jean-Philippe Barrette-LaPierre, Moman.

Steve Hanov pointed me to the paper, "Incremental Construction of Minimal Acyclic Finite-State Automata", in his blog post entitled, "Compressing dictionaries with a DAWG". Another of Steve's blogs also made an impact on me, namely "Fast and Easy Levenshtein distance using a Trie".

I've become heavily involved with the online movement regarding MOOCs (Massive Open Online Classrooms), and the following courses taught me much regarding the material within this library:

  1. Automata
  2. Compilers
  3. Natural Language Processing

Finally, I must credit the course which first introduced me to the field of Information Retrieval, "Mathematical Applications in Search Engine Design", taught by Dr. Rao Li at the University of South Carolina Aiken. I highly recommend that course if you are in the area.

liblevenshtein's People

Watchers

Christopher Harty avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.