Giter Site home page Giter Site logo

levenshteindistance's Introduction

The Levenshtein Distance

In the field of computational linguistics, the Levenshtein distance, also known as the edit distance, is utilized to compare the similarity between two words or phrases.

The Levenshtein distance between two strings $a$ and $b$, with lengths $|a|$ and $|b|$ respectively, can be calculated using the function $\text{lev}(a, b)$. This function is defined as:

$$ \text{lev}(a, b) = \begin{cases} |a| & \text{if } |b| = 0 \\ |b| & \text{if } |a| = 0 \\ \text{lev}(\text{tail}(a),\text{tail}(b)) & \text{if } a[0] = b[0] \\ 1 + \min \begin{cases} \text{lev}(\text{tail}(a), b) \\ \text{lev}(a, \text{tail}(b)) \\ \text{lev}(\text{tail}(a), \text{tail}(b)) \\ \end{cases} & \text{otherwise} \end{cases} $$

This is a piecewise function that is defined differently in different parts of its domain. In other words, it's like having multiple functions combined into one.

So, the Levenshtein distance is the minimum number of single-character edits required to transform one string into another. It measures the dissimilarity or difference between two words, where "similarity" is the resemblance between their spellings, pronunciations, meanings, or contexts.

The "distance between words" can be considered as the measure of how many changes are needed to transform one word into another. This can be calculated using various algorithms and techniques depending on the specific application or task.

Here are some examples to illustrate the concept of distance between words using the Levenshtein distance algorithm:

  • The Levenshtein distance between "cat" and "hat" is 1 because changing the first letter "c" to "h" is equivalent to a single edit operation (a substitution).
  • The Levenshtein distance between "cat" and "dog" is 3 because three edit operations (two substitutions and one deletion) are required to transform "cat" into "dog".
  • The Levenshtein distance between "kitten" and "sitting" is 3. This is because the following 3 edits change one into the other, and there is no way to do it with fewer than 3 edits: (1) kitten → sitten (substitution of "s" for "k"), (2) sitten → sittin (substitution of "i" for "e"), and (3) sittin → sitting (insertion of "g" at the end).

To calculate the Levenshtein distance, a script has been created that defines a function called levenshtein_distance. This function takes two strings s and t as inputs, finds the minimum number of operations required to transform the prefix of the first string into the prefix of the second string, and returns an integer representing the Levenshtein distance between them.


The Levenshtein distance is a useful tool for detecting fraud. A bank processing a large number of credit card applications can compare new applications to a database of legitimate past applications by calculating the Levenshtein distance. If the distance between a new application and the database exceeds a certain threshold, the bank may suspect fraud and investigate further. In fact, fraudulent transactions may have slight variations from genuine ones, and the Levenshtein distance can be used to measure the degree of difference between them.

For example, let's say a new application has the following information:

Name: John Doe
Address: 123 Main St, Anytown USA
Income: $50,000
Employment: Full-time

The bank calculates the Levenshtein distance between this application and the applications in their database. If they find an existing application with the same name, address, and employment information but a significantly different income, it may indicate that the new application is fraudulent and the income information has been falsified.

However, the Levenshtein distance algorithm is not perfect and may sometimes produce false positives or false negatives, depending on the specific use case and the quality of the data. Therefore, it's important to use this algorithm as a tool in conjunction with other techniques and human expertise to ensure accurate and reliable results.

Python

Languages and Tools

python

Requirements

Just Python 3.6.9

Test Coverage

Name Stmts Miss Branch BrPart Cover
levenshtein_distance.py 21 4 14 1 86%
levenshtein_distanceFinanceFraud.py 15 3 8 1 74%
levenshtein_distanceFinanceFraud_test.py 19 1 4 1 91%
levenshtein_distance_test.py 19 1 4 1 91%
TOTAL 74 9 30 4 86%

License

This project is licensed under the MIT License - see the LICENSE.md file for details


Connect with me

francescopaololezza francescopaololezza

levenshteindistance's People

Contributors

francescopaolol avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.