Giter Site home page Giter Site logo

aceakash / string-similarity Goto Github PK

View Code? Open in Web Editor NEW
2.5K 30.0 122.0 112 KB

Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.

License: MIT License

JavaScript 100.00%
javascript dice-coefficient string-comparison string-similarity strings

string-similarity's People

Contributors

aceakash avatar ascriver avatar awalin avatar dependabot-support avatar f-a-r-a-z avatar ludo97240 avatar maxbachmann avatar rclai avatar tom-sap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

string-similarity's Issues

Matching % seems incorrect

I just tryed the example:

stringSimilarity.compareTwoStrings("healed", "sealed");
//0.8

=> 80% for a 1 letter change.

stringSimilarity.compareTwoStrings("healed", "ehaled");
//0.6

=> 60% for 2 letter switching

Ok I get it but now I just try with another word that contains 1 letter less (5 char length vs 6)

stringSimilarity.compareTwoStrings("fuira", "fuia");
//0.57

=> 57% for a 1 letter change (just lost 23%)

stringSimilarity.compareTwoStrings("furia", "fuira");
//0.25

=> 25% for a 1 letter change (just lost 35%)

Seems to me that less the string is long more the matching is severe.
Is there a way to make it "average" undepending of the length ?

Strange/incorrect matching

stringSimilarity.findBestMatch('wall e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall-e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall_e', ['wall-e', 'wall']);

These all return 1, as though "wall" is the best match. They should all return 0, since they differ by only 1 character and are more symbolically similar.

incompatible with uglifyjs

Hello, I'm getting this when building the productio package

....... from UglifyJs
Unexpected token: punc (,) ....

Perhaps there is a way to add build configuration to the package to fix this?
I've gone around by copying the code in my utilities library.
Thanks!

IE support

IE doesn't understand ES6 (const, let and arrow functions are the ES6 things that I see in the package sources) and to provide IE compatibility (curse him) we need to have our vendor bundles in ES5. And it's not easy to transpile a specific library during bundling...

The common way is to have ./dist/compare-strings.js in the npm package repo and an npm build script for ES6 -> ES5 transpilation process. If it's ok, I can provide a PR covering this situation. What do you think?

Issue comparing a word against a the same word plus a blank and another letter

Hello, I found an issue comparing a word against a the same word plus a blank and another letter.
Eg:
"Iphone" compared with "Iphone X" gives me a match of 1, but the texts are not equal. It should be close to 1 but not 1.

I'm using version 2.0.0

findBestMatch('Iphone', ['Iphone 8', 'Iphone 10', 'Iphone X', 'Iphone XS'])
image

findBestMatch : accept a list of objects

findBestMatch is really cool, but it could use one extra layer of convenience....

You see...I have an array of objects, and one of the attributes is the string that I am comparing.
I need to find the object from the array with the best matching string.

As it is, I have to extract the strings from all objects, find the best matching string, and then go back and find the object whose string is the best matching string.

It would be easier for me to pass-in the list of objects, along with the name of the attribute, and get back a reference to the best object. I suspect that this pattern of usage might be very common.

Again, this is not a bug, just a suggestion to make the API easier to use.
Thanks for this software -- I am going to use it to solve a tricky problem in converting some very old insurance data.

Weird behavior

Hi,

I'm having a weird result when comparing those two string.
It always return a rating of 0 despite them having 2 letters in common.

stringSimilarity.compareTwoStrings('NOS', 'NPS')
//0

stringSimilarity.findBestMatch('NOS', ['NPS'])
//{ ratings: [ { target: 'NPS', rating: 0 } ], bestMatch: { target: 'NPS', rating: 0 } }

https://runkit.com/588655d7fb7a220014a01b47/5886577d0629220014e341d7

Thanks.

compareTwoStrings returning 1 for small and different strings

Hello everyone, I hope all is doing good. I found a case in which there is a difference (search for <_15>), the difference is that the first string has <_15>FLL while the second one has <_15>ORD, yet the function is returning 1 as if it were a perfect match. The version used for this comparison was 4.0.1. Below you can see an example ready to be ran in node.js (system version 14.4.0):

const similarity = require("string-similarity");

const body1 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>FLL</_15><_16>ORD</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

const body2 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>ORD</_15><_16>FLL</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

console.log(similarity.compareTwoStrings(body1, body2));

Thanks!

Wrong bestMatch with game titles

Hey there!

I've noticed that string-similarity is having issues with game titles. Here's a little example:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal', 'Portal 2']);

This example returns 'Portal' as bestMatch. However if I change the order of the targetStrings array, like this:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal 2', 'Portal']);

Then bestMatch is Portal 2. While this sounds like the solution, searching Portal would lead to bestMatch = Portal 2.

Testing around I also found out that if I compareTwoStrings('Portal', 'Portal 2'), the return value is 1, even tho those 2 strings are obviously not exactly the same?

Is there any way to make the comparison more strict?

Adding an optional param for replacing string and case sensitivity.

I am using compareTwoStrings in my own projects and have found a lot of use in having an optional parameter where you can give regex/string to replace by. In my recent implementation I needed to exclude special characters. Having the parameter allowed me to just add that bit of regex in my call like so: compareTwoStrings('hello', 'hey', /[^\w\s]/gi). I also added option to remove case sensitivity. I would love to contribute these features.

compareTwoStrings return wrong output

I notice that if we pass ababacac and abacabac in compareTwoStrings, it return output 1 which is wrong.

var stringSimilarity = require('string-similarity');
console.log(stringSimilarity.compareTwoStrings('ababacac', 'abacabac'));

Expected
Should not 1

Output
1

LICENSE.md and package.json disagree

The package.json reports the license to be ISC, while the LICENSE file reports it to be MIT.

It's quite important that this is fixed as license reporting tools will rightfully report this as problematic.

Handling of tiny strings not functioning as expected

For some reason when both strings are 1 or less characters long, compareTwoStrings will return Number.NaN instead of expected 1 or 0.

Examples:

compareTwoStrings("", "") === Number.NaN
compareTwoStrings("a", "a") === Number.NaN
compareTwoStrings("a", "") === Number.NaN
compareTwoStrings("aa", "aa") === 1
compareTwoStrings("aa", "") === 0

This is a problem as Number.NaN is always greater then other numbers,
eg the following will always return false, even though expected true:
compareTwoStrings("", "") > 0.9

I have temporarily got around this in my own project, by simply using:

if(a.length <= 1 && b.length <= 1)return a === b ? 1 : 0;
return compareTwoStrings(a, b);

Cheers,
Josh

Does not seem to care about the order of words

I have the following two strings:
grid styling xs 1/12
grid styling xs 2/12

And my search input is

xs 2

Both strings get the exact same score, which doesn't seem right.
Because the xs 2 has a longer "direct match" in the order of words with the second string than the first one.
The first string only gets the same score because there is also a "2" in the string. But it's on a location that shouldn't influence the score as match as the "2" in the "right spot"

Weird result based on string case

I've tried to compare the strings, as follow:

stringSimilarity.findBestMatch('bnp', ['BNP Paribas', absolutelyunrelated])

Both rating are 0

stringSimilarity.findBestMatch('BNP', ['BNP Paribas', absolutelyunrelated])

BNP Paribas rating is 0.36363636363636365

I wouldn't expect 0 with bnp, I'd expect something around 0.2+ I guess, the difference is huge based on the string case.

Feature suggestion - pass array of objects as targets for findBestMatch function

Use case:
Instead of wanting to compare ["foo","bar","baz"], it can be useful to pass in an array of objects for which you want to compare one property, i.e.

[
    { name: "foo", otherProperty: 23 },
    { name: "bar", otherProperty: 27 },
    { name: "baz", otherProperty: 99 }
]

and instruct the function to compare based on the name property, but return the whole object in the response.

I have already created a PR #124 for this. Just need approval

[HELP!!] Latest version does not detect spaces, and 2.0.0 version is not case sensitive.

I need to compare two strings completely, that means it should also detect spaces and caps.

I have read the relase notes, and I don't understand why you decided to disregard spaces from version 3.0.0, so after running npm install --save [email protected], it detects spaces, but it is not case sensitive.

Latest version:
stringSimilarity.compareTwoStrings("Te st", "Test"); //1.00

2.0 version
stringSimilarity.compareTwoStrings("TEST", "test"); //1.00

Please help, I need to get this done as soon as possible.
Thank you!

findBestMatch

Hello, how can I pass a key:value array targetStrings to
findBestMatch(mainString, targetStrings) {
}
The match target is the key in the key:value array.

Thanks

Memoize results

It will be good if you memoize the result, so if you run the function with the same arguments, it will give the result right away instead of making the calculation all over again

Huge difference comparing to Levenshtein Distance method

If we test from https://planetcalc.com for;
source : Olive-green table for sale, in extremely good condition.
target : For sale: table in very good condition, olive green in colour.
number of movement is 47

source : Olive-green table for sale, in extremely good condition.
target : Wanted: mountain bike with at least 21 gears..
number of movement is 47

looks same lol :D but doesn't make sense. Sørensen–Dice very accurate.

Add browser support

I'd be great if this library was also usable in the browser as it currently uses require this makes it impossible to use on the client side. 😞

is it possible to support chinese?

I have test string-similarity with chinese letter, but seems not working, it appears "0"

var similarity = stringSimilarity.compareTwoStrings('布莱顿', '布赖顿');

plz advise. thx.

Is it possible to import it in a ES6 module ? (front-side)

I'm trying to use the package to implement a fuzzy search in a React component.
I don't want to use UMD.

I'm trying to import the module like so:

import stringSimilarity from 'stringSimilarity'

Node throw: Cannot find module 'stringSimilarity'.

Is it possible with this package ?

This is not the Dice coefficient

Your algorithm is not the Dice coefficient. It counts all bigram duplicates, whereas the Dice coefficient only counts distinct bigrams (as defined in Wikipedia).

As an example, let's compare two versions of the main file of this repo (https://github.com/aceakash/string-similarity/blob/2718c82bbbf5190ebb8e9c54d4cbae6d1259527a/compare-strings.js and the latest https://github.com/aceakash/string-similarity/blob/eaeec5d74c98a6f6fcb1b06fad44ad7f3d8c2965/src/index.js. They have a Dice coefficient of 0.90, but this lib string-similarity outputs 0.74 when comparing these two files.

Please have a look at the implementations in Talisman, NLTK or in many languages in https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

Doesn't work for strings with length 1

This looks like expected behavior, but it could be useful to fall back to a simple algorithm if one of the inputs is a length 1 string.

  if (first.length === 1 || second.length === 1) {			           // if either is a 1-letter string
    let [smaller, larger] = (first.length === 1)
      ? [first, second]
      : [second, first];
    return larger.includes(smaller) ? 2.0 / (larger.length + 1) : 0;
  }

This came up when I tried to use compareTwoStrings for a search ranking.

if (first.length < 2 || second.length < 2) return 0; // if either is a 1-letter string

isEdgeCaseWithOneOrZeroChars?

Hi :)

Just finished modifying this script so that I can use it in mongoDB (SpiderMonkey with some parts of ES6) without the lodash dependency, and noticed this unused method, isEdgeCaseWithOneOrZeroChars.

It was introduced here, but that's over a year ago, and it hasn't been used since then.

So I'm wondering if it's some unfinished work that should be there, or just some a stab at some approach deemed unnecessary and then accidentally left behind?

Cheers! :)

Daniel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.