aceakash / string-similarity Goto Github PK

Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.

License: MIT License

JavaScript 100.00%

javascript dice-coefficient string-comparison string-similarity strings

string-similarity's People

Contributors

Stargazers

Watchers

Forkers

ocordeiro rclai marcoabi awalin ongair ilyalesik thetimmaeh ooby pablocordova hitamu carloslema jaredatron rcjjian cristoferdomingues anodynos thenewsound tztxqx radupotop personalized-news allensmile hadryan shaunstanislauslau magicknight mixellent abd-elrazek deruistu rahulsoibam fengweijp huydeerpets matthewgard1 hhy5277 alvarlaigna gitviscodeorg bnopacheco brydzu amenhotep19 luckydevil13 julianhm9612 fuath gautiert colinrds nguyenkhuyen0 mokacao nesben leonxi rateitapp mark-monserrat pajohns breankingbad phatbk kevin-sanjaya cloudchng nirmalsinhrathod shinhwe gcharang nishant8bits cyrke jvansoest atishaypostmanreposizetest datasleek nguyentunglam9229 boripan rasata ludo97240 d-e-f-e-a-t blmage akosbalasko olovorr webscopeio gopherj rodrigoberlochi shakahl ilanbm devimal shamil8 interactdo dkclee jaggedsoft ariborneo yossi-nagar anhgh2023 veasnamong macorreag sambres hengkiardo vudat8121999 gugegev5 wmenegali ukungzulfah alex-shilman abbasogaji chrissyast amit08255 lpdxlong jokenliu shoaib-malik-org neverwintermoon azjgard ctindall-1 beingathar

string-similarity's Issues

Matching % seems incorrect

I just tryed the example:

stringSimilarity.compareTwoStrings("healed", "sealed");
//0.8

=> 80% for a 1 letter change.

stringSimilarity.compareTwoStrings("healed", "ehaled");
//0.6

=> 60% for 2 letter switching

Ok I get it but now I just try with another word that contains 1 letter less (5 char length vs 6)

stringSimilarity.compareTwoStrings("fuira", "fuia");
//0.57

=> 57% for a 1 letter change (just lost 23%)

stringSimilarity.compareTwoStrings("furia", "fuira");
//0.25

=> 25% for a 1 letter change (just lost 35%)

Seems to me that less the string is long more the matching is severe.
Is there a way to make it "average" undepending of the length ?

Why is Sørensen–Dice coefficient better?

Finds degree of similarity between two strings, based on Dice's Coefficient, which is mostly better than Levenshtein distance.

What's the rationale behind this claim?

Strange/incorrect matching

stringSimilarity.findBestMatch('wall e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall-e', ['wall·e', 'wall']);
stringSimilarity.findBestMatch('wall_e', ['wall-e', 'wall']);

These all return 1, as though "wall" is the best match. They should all return 0, since they differ by only 1 character and are more symbolically similar.

incompatible with uglifyjs

Hello, I'm getting this when building the productio package

....... from UglifyJs
Unexpected token: punc (,) ....

Perhaps there is a way to add build configuration to the package to fix this?
I've gone around by copying the code in my utilities library.
Thanks!

IE doesn't understand ES6 (const, let and arrow functions are the ES6 things that I see in the package sources) and to provide IE compatibility (curse him) we need to have our vendor bundles in ES5. And it's not easy to transpile a specific library during bundling...

The common way is to have ./dist/compare-strings.js in the npm package repo and an npm build script for ES6 -> ES5 transpilation process. If it's ok, I can provide a PR covering this situation. What do you think?

Issue comparing a word against a the same word plus a blank and another letter

Hello, I found an issue comparing a word against a the same word plus a blank and another letter.
Eg:
"Iphone" compared with "Iphone X" gives me a match of 1, but the texts are not equal. It should be close to 1 but not 1.

I'm using version 2.0.0

findBestMatch('Iphone', ['Iphone 8', 'Iphone 10', 'Iphone X', 'Iphone XS'])

findBestMatch : accept a list of objects

findBestMatch is really cool, but it could use one extra layer of convenience....

You see...I have an array of objects, and one of the attributes is the string that I am comparing.
I need to find the object from the array with the best matching string.

As it is, I have to extract the strings from all objects, find the best matching string, and then go back and find the object whose string is the best matching string.

It would be easier for me to pass-in the list of objects, along with the name of the attribute, and get back a reference to the best object. I suspect that this pattern of usage might be very common.

Again, this is not a bug, just a suggestion to make the API easier to use.
Thanks for this software -- I am going to use it to solve a tricky problem in converting some very old insurance data.

compareTwoStrings() not case insensitive as documentation suggest

Code to replicate
var stringSimilarity = require('string-similarity');
console.log(stringSimilarity.compareTwoStrings('John Doe', 'john doe'));

Actual result
0.5

Expected result
1

Weird behavior

Hi,

I'm having a weird result when comparing those two string.
It always return a rating of 0 despite them having 2 letters in common.

stringSimilarity.compareTwoStrings('NOS', 'NPS')
//0

stringSimilarity.findBestMatch('NOS', ['NPS'])
//{ ratings: [ { target: 'NPS', rating: 0 } ], bestMatch: { target: 'NPS', rating: 0 } }

https://runkit.com/588655d7fb7a220014a01b47/5886577d0629220014e341d7

Thanks.

compareTwoStrings returning 1 for small and different strings

Hello everyone, I hope all is doing good. I found a case in which there is a difference (search for <_15>), the difference is that the first string has <_15>FLL while the second one has <_15>ORD, yet the function is returning 1 as if it were a perfect match. The version used for this comparison was 4.0.1. Below you can see an example ready to be ran in node.js (system version 14.4.0):

const similarity = require("string-similarity");

const body1 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>FLL</_15><_16>ORD</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

const body2 = '<REQ><_0>MSG</_0><_1/><_2>55</_2><_3>ORG</_3><_4>F1</_4><_5>MIA</_5><_6>07560685</_6><_7>AC30</_7><_8>HFD</_8><_9>F1</_9><_10>T</_10><_11>US</_11><_12>USD</_12><_13>ZE</_13><_14>ODI</_14><_15>ORD</_15><_16>FLL</_16><_17>UNT</_17><_18>5</_18><_19>1</_19><_20>UNZ</_20><_21>1</_21><_22>000000</_22><_23/></REQ>';

console.log(similarity.compareTwoStrings(body1, body2));

Thanks!

Wrong bestMatch with game titles

Hey there!

I've noticed that string-similarity is having issues with game titles. Here's a little example:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal', 'Portal 2']);

This example returns 'Portal' as bestMatch. However if I change the order of the targetStrings array, like this:

var matches = stringSimilarity.findBestMatch('Portal 2', ['Portal 2', 'Portal']);

Then bestMatch is Portal 2. While this sounds like the solution, searching Portal would lead to bestMatch = Portal 2.

Testing around I also found out that if I compareTwoStrings('Portal', 'Portal 2'), the return value is 1, even tho those 2 strings are obviously not exactly the same?

Is there any way to make the comparison more strict?

Adding an optional param for replacing string and case sensitivity.

I am using compareTwoStrings in my own projects and have found a lot of use in having an optional parameter where you can give regex/string to replace by. In my recent implementation I needed to exclude special characters. Having the parameter allowed me to just add that bit of regex in my call like so: compareTwoStrings('hello', 'hey', /[^\w\s]/gi). I also added option to remove case sensitivity. I would love to contribute these features.

compareTwoStrings return wrong output

I notice that if we pass ababacac and abacabac in compareTwoStrings, it return output 1 which is wrong.

var stringSimilarity = require('string-similarity');
console.log(stringSimilarity.compareTwoStrings('ababacac', 'abacabac'));

Expected
Should not 1

Output
1

Support for ie10

minor change, var instead of let and const.

PR:
#70

How is similarity calculated

How is similarity calculated？

LICENSE.md and package.json disagree

The package.json reports the license to be ISC, while the LICENSE file reports it to be MIT.

It's quite important that this is fixed as license reporting tools will rightfully report this as problematic.

Handling of tiny strings not functioning as expected

For some reason when both strings are 1 or less characters long, compareTwoStrings will return Number.NaN instead of expected 1 or 0.

Examples:

compareTwoStrings("", "") === Number.NaN
compareTwoStrings("a", "a") === Number.NaN
compareTwoStrings("a", "") === Number.NaN
compareTwoStrings("aa", "aa") === 1
compareTwoStrings("aa", "") === 0

This is a problem as Number.NaN is always greater then other numbers,
eg the following will always return false, even though expected true:
compareTwoStrings("", "") > 0.9

I have temporarily got around this in my own project, by simply using:

if(a.length <= 1 && b.length <= 1)return a === b ? 1 : 0;
return compareTwoStrings(a, b);

Cheers,
Josh

Does not seem to care about the order of words

I have the following two strings:
grid styling xs 1/12
grid styling xs 2/12

And my search input is

xs 2

Both strings get the exact same score, which doesn't seem right.
Because the xs 2 has a longer "direct match" in the order of words with the second string than the first one.
The first string only gets the same score because there is also a "2" in the string. But it's on a location that shouldn't influence the score as match as the "2" in the "right spot"

Weird result based on string case

I've tried to compare the strings, as follow:

stringSimilarity.findBestMatch('bnp', ['BNP Paribas', absolutelyunrelated])

Both rating are 0

stringSimilarity.findBestMatch('BNP', ['BNP Paribas', absolutelyunrelated])

BNP Paribas rating is 0.36363636363636365

I wouldn't expect 0 with bnp, I'd expect something around 0.2+ I guess, the difference is huge based on the string case.

Return best match with index

compareTwoStrings with arguments length lower than 3 since

Hello!

If I try to compare two strings like
compareTwoStrings("set", "st");
the result is always 0. Strings with length lower than 3 cannot be compared.

Is this normal?

Greatings!

Feature suggestion - pass array of objects as targets for findBestMatch function

Use case:
Instead of wanting to compare ["foo","bar","baz"], it can be useful to pass in an array of objects for which you want to compare one property, i.e.

[
    { name: "foo", otherProperty: 23 },
    { name: "bar", otherProperty: 27 },
    { name: "baz", otherProperty: 99 }
]

and instruct the function to compare based on the name property, but return the whole object in the response.

I have already created a PR #124 for this. Just need approval

[HELP!!] Latest version does not detect spaces, and 2.0.0 version is not case sensitive.

I need to compare two strings completely, that means it should also detect spaces and caps.

I have read the relase notes, and I don't understand why you decided to disregard spaces from version 3.0.0, so after running npm install --save [email protected], it detects spaces, but it is not case sensitive.

Latest version:
stringSimilarity.compareTwoStrings("Te st", "Test"); //1.00

2.0 version
stringSimilarity.compareTwoStrings("TEST", "test"); //1.00

Please help, I need to get this done as soon as possible.
Thank you!

findBestMatch

Hello, how can I pass a key:value array targetStrings to
findBestMatch(mainString, targetStrings) {
}
The match target is the key in the key:value array.

Thanks

Memoize results

It will be good if you memoize the result, so if you run the function with the same arguments, it will give the result right away instead of making the calculation all over again

Huge difference comparing to Levenshtein Distance method

If we test from https://planetcalc.com for;
source : Olive-green table for sale, in extremely good condition.
target : For sale: table in very good condition, olive green in colour.
number of movement is 47

source : Olive-green table for sale, in extremely good condition.
target : Wanted: mountain bike with at least 21 gears..
number of movement is 47

looks same lol :D but doesn't make sense. Sørensen–Dice very accurate.

Add browser support

I'd be great if this library was also usable in the browser as it currently uses require this makes it impossible to use on the client side. 😞

is it possible to support chinese?

I have test string-similarity with chinese letter, but seems not working, it appears "0"

var similarity = stringSimilarity.compareTwoStrings('布莱顿', '布赖顿');

plz advise. thx.

CommonJS causing optimization bailouts

Warning: [some/module.ts] depends on 'string-similarity'. CommonJS or AMD dependencies can cause optimization bailouts.

Is it possible to have an es6 version? It would be nice to avoid this kind of thing: https://web.dev/commonjs-larger-bundles/

Is it possible to import it in a ES6 module ? (front-side)

I'm trying to use the package to implement a fuzzy search in a React component.
I don't want to use UMD.

I'm trying to import the module like so:

import stringSimilarity from 'stringSimilarity'

Node throw: Cannot find module 'stringSimilarity'.

Is it possible with this package ?

How to make it work with Angular 4

I need string similarity in angular, how to perform string-similarity with angular 4, thanks

Error: Bad arguments: First argument should be a string, second should be an array of strings

I am pretty sure I did it correctly I have the first as a string and the second as an array the code is

const pokemons = require('../../arrays/pokemons.js')
const poss = stringSimilarity.findBestMatch(args, pokemons)

arrays file

exports.pokemons = [
	pokemon lists
]

This is not the Dice coefficient

Your algorithm is not the Dice coefficient. It counts all bigram duplicates, whereas the Dice coefficient only counts distinct bigrams (as defined in Wikipedia).

As an example, let's compare two versions of the main file of this repo (https://github.com/aceakash/string-similarity/blob/2718c82bbbf5190ebb8e9c54d4cbae6d1259527a/compare-strings.js and the latest https://github.com/aceakash/string-similarity/blob/eaeec5d74c98a6f6fcb1b06fad44ad7f3d8c2965/src/index.js. They have a Dice coefficient of 0.90, but this lib string-similarity outputs 0.74 when comparing these two files.

Please have a look at the implementations in Talisman, NLTK or in many languages in https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Dice%27s_coefficient

first.replace is not a function

All in the title.

Doesn't work for strings with length 1

This looks like expected behavior, but it could be useful to fall back to a simple algorithm if one of the inputs is a length 1 string.

  if (first.length === 1 || second.length === 1) {			           // if either is a 1-letter string
    let [smaller, larger] = (first.length === 1)
      ? [first, second]
      : [second, first];
    return larger.includes(smaller) ? 2.0 / (larger.length + 1) : 0;
  }

This came up when I tried to use compareTwoStrings for a search ranking.

string-similarity/compare-strings.js

Line 14 in ccdb537

    
           if (first.length < 2 || second.length < 2) return 0;			 // if either is a 1-letter string

isEdgeCaseWithOneOrZeroChars?

Hi :)

Just finished modifying this script so that I can use it in mongoDB (SpiderMonkey with some parts of ES6) without the lodash dependency, and noticed this unused method, isEdgeCaseWithOneOrZeroChars.

It was introduced here, but that's over a year ago, and it hasn't been used since then.

So I'm wondering if it's some unfinished work that should be there, or just some a stab at some approach deemed unnecessary and then accidentally left behind?

Cheers! :)

Daniel