Giter Site home page Giter Site logo

bayesian-spelling-correction's Introduction

bayesian-spelling-correction

Data

For this problem, we have the following files:

  • words.txt

    • This file contains a large corpus of words to help train the model about the occurances of many words in the English language.
    • In particular, this file contains the full text of the The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle, encoded in ASCII
  • testwords.txt

    • This file contains a list of words, along with possible mispellings of those words

Understanding the problem

With the advent of small screens on smartphones, mistyping words is a common annoyance. So, what if we implemented a way to correct mispellings using the bayes rule? I will be doing exactly that here.


Formulating the problem

For a mistyped word, w, we can have many candidate words c. Amongst the candidate words, we may or may not have the actual correct word.

The goal is to use the Bayes rule to evaluate each candidate word c, and calculate the probability of the candidate word given a mistyped word. In mathematical terms, we are trying to maximise this expression.

Maximisation problem


Notice that P(w), that is the probability of a spelling error occuring, is constant across all of the candidates c. This is as we are comparing possible candidates c for the same mispelling w. So we can simplify our optimisation problem to:

Maximisation problem


P(c): Probability of candidate word

For a given candidate word c, P(c) is essentially the probability of that candidate word occuring. It makes intuitive sense that we have to take this into account for our calculation as if a candidate word is uncommon or doesn't exist (so P(c) ≈ 0), then it is lessly likely that the a user was trying to type that word.

To determine this value, we can take a look at words.txt and derive the following:

For a word c:

  • If word exists in words.txt:
    • P(c) = $\text{total occurances of c} \over \text{total occurances of all words}$
  • Else:
    • P(c) = 0

P(w|c): Probability of the mistyped word w, given a candidate c

It makes sense that a correct candidate c should not be too "far" from the mistyped word w. We will be using edit distance to measure the distance between words.


Edit distance aka Levenshtein distance

For any word, it can be transformed into another word only using the following operations:

  1. Insertion: Add a character to the string.
  2. Deletion: Remove a character from the string.
  3. Substitution: Replace a character in the string with another character.

Each of these operations are given a cost of 1. So the edit distance between two words A and B would simply be the sum of costs of the operations required to turn A into B.

Eg. A = Cat, B = Bat, C = Code, D = Ha

  • dist(A, B) = 1

    • Substitute C with B
  • dist(A, C) = 3

    • Substitute a with o
    • Substitute t with d
    • Insert e
  • dist(A, D) = 2

    • Substitute C with H
    • Delete t

Choosing P(w|c) based on edit distance

Tt makes more sense that for a mistyped word it is more likely to be have a smaller edit distance. So that implies that:

So we need to select a distribution that allows us to draw probabilities for a given edit distance x such that:

To achieve this, I chose an exponential distribution.

Choosing λ for the exponential distribution


λ was chosen based on out of sample test accuracy such that it solves the following constrainted optimisation problem:

where P is the percentage of correctly guessed words
           f is a function representing the test program that takes in values
           x is the edit distance
           λ is tuning parameter

Solving this problem using scipy.Optimization, I calculated the value λ = 9.08732

Notice that optimisation problem solves our original problem.

Results

So for our simple bayesian spelling correction, the out-of-sample performance at correcting errors was 76% at a speed of 10 words per second.

Considering the relatively simple nature of this algorithm, this is good performance. Of course, this solution can be tweaked to give better performance.

bayesian-spelling-correction's People

Contributors

jjasim avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.