joshaven / string_score Goto Github PK

View Code? Open in Web Editor NEW

841.0 841.0 69.0 220 KB

JavaScript string ranking 0 for no match upto 1 for perfect... "String".score("str"); //=> 0.825

License: MIT License

Makefile 0.80% CoffeeScript 3.58% JavaScript 82.42% HTML 8.18% CSS 5.02%

string_score's People

Contributors

Stargazers

Watchers

Forkers

brianjlandau gorakhargosh bltavares pjhoberman lorensr holdensmagicalunicorn goatslacker rsrose21 stevemanuel thetron nvdnkpr dwoodard adrienne zeke ghostandthemachine netconstructor listings-xx hetman ameydhavle aventuralabs tarunc rich-harris mneuhaus flammenmensch jaggedsoft swelshh fuzzyballs chandruxp s-panferov mdantas chris-yee yichizhang nenadg jamesjansson bigjoevtrj steelbrain teamsendo sharkpunch hbcbh1999 jkochis dfelber antixrist linushp franarroutado thiagonp arniesaha canrau askmhs mrtnpro phuong3030 emugabi webmaia xcj-coding jaegow itbirds omgimalexis andreirk dudb forky-mcforkface

string_score's Issues

Hello World and jello

"Hello World" and "jello" should score higher than 0 with a fuzziness of 0.5, says your test.

Feature-Request: closest/best match

Would be nice, if we could do this: "My string".closest([array of strings]) and would give back the index(es) of the best match(es). (Multiple results if their score is the same, but the content is not.)

Add LICENSE to repository

Hi! I see that the readme links to the standard MIT License, but would it be possible to include a LICENSE file in the repository itself with the MIT license text and copyright statement in it?

This practice makes it much easier for us to use the code and attribute it properly in our product.

Thanks!

Register as a bower component?

Hiya! I like this plugin, and I was wondering if you'd be willing to register it as a bower component?

score demo

hello
i was testing the algorithm So i tried hello and hellô and the result was 0 wtf!

Don't patch string prototype

Patching prototypes of global objects is a bad practice. Would it be possible to get a version of this module that just exports the scoring function, with a UMD wrapper so it can be used in a CommonJS environment? That would also make it possible to pass the function directly to array.sort()

Out-of-order search

I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.

tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.

For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.

For example, when searching for a match for 2G8 Bahagian Pinjaman Perumahan, string_score with 0.5 fuzziness thinks that PMO is a better match than LOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM. Or for a more English example, comparing university of oxford with oxford of university scores 0.027.

To address this failure mode, I've wrapped it in a pretty gnarly loop:

Both the search string and the comparison string are split into words
Each word of the search string is string_score'd against each word of the comparison string (concatenating the whole comparison string as an option for matching abbreviations)
For each search word, the score for the most similar comparison word is recorded
The score for the search string against the comparison string is taken as the sum of the (maximum) score of each word, normalised by dividing by the number of search words.

Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?

"romanouskas delicatessen".score("romanouskas delicatessen") //=> 0

The only diff is the second to last character ("e" -> "a"). Bug?

"you;S".score("you10") => 0

"you;S".score("you10") yields a score of 0. Seems like it should produce a score higher than that, no?

why?

Why use this arbitrary string scoring algorithm over the well-known and common Levenshtein distance? I'm not trying to attack this project, but I'd like to know if it has anything more to offer. There are already a few JavaScript/coffee-script Levenshtein implementations.

Support diacritics

So I stumbled on matching city names in Germany and particularly I have this:

'Gross-Gerau' .score('Groß-Gerau') => 0

I wonder if there's any chance that you can support mapping it first without diacritics as an option in the config? would be much easier this would be usage (as I see it):

'Gross-Gerau' .score('Groß-Gerau', { diacriticsMap: true }) => 1

or similar

Faster string query

Hello,
I think you can use a hash table to speed up the query.

Create a hash table of words, with key values as characters, and values as positions of characters。
Word: Hello
User Input: eao

2）Find the character entered by the user: ‘e’,
Find by hash table, find the index greater than 0 (you can use binary search)。
The word index is at position 1.

Find the character entered by the user: ‘a’,
Find by hash table, find index greater than 1 (you can use binary search)
'A' not found, index does not move

4）Find the character entered by the user: ‘o’,
Find by hash table, find the index greater than 0 (you can use binary search)。
The word index is at position 4.

I'm not familiar with js, I can't implement it with code, I'm really sorry.

The result isn't a proper distance

"Hello World".score("hello")
is not equal to
"hello".score("Hello World")
and thus the result isn't symmetric (a nice property if one wants to start meassure string edit distances).

This is somewhat in connection to issue #10. (This algorithm offers fast execution while giving up on some mathematical properties that some might find nice to have.)

Note that the particular test case above works "as expected" if one uses a fuzziness of 1. That isn't a solution though since it would yield strange results for things like "foo".score("Hello World", 1).

Also, it would be nice if two equal strings gave the result 0 and two completely different strings gave the result 1. Again, to satisfy the distance property. (If the distance between NY and London is 0 then NY and London are the same location. If the distance between the strings "foo" and "bar" is 0, then they are the same strings.)

Finally, I have not yet tested if it would satisfy the triangle inequality (a.score(b) + b.score(c) >= a.score(c) or, if you go from location a to location b and then from b to c, you must have walked further than or equal to the distance from a to c), but my guess is not. This too is needed to be a proper meassurement of distance.

On the other hand, you never set out to create an algorithm to meassure string distances, only to score strings.

PS. I love the project and it fits perfectly for what I'm about to do. Just writing this to explain why people might ask what this solution has to offer over things like the Levenshtein Distance or Hamming Distance and to document to other users that some things will not be possible with this solution. For the interested, this would likely be an algorithm producing an ordinal scale (ranking scale) so you can always compare two results but you can't do basic arithmetic on it. See http://en.wikipedia.org/wiki/Level_of_measurement for examples of what this implies.

IE7

When running IE9 in IE7 mode,

"Foo".score("o") // => Unable to get value of the property 'toLowerCase': object is null or undefined, string_score.js, line 23

Not that anyone should still be using IE7, but the README does specify this should work in IE7.

I'd like to use this script on the server side

This script is perfect for me to calculate the relevance of my search strings, but I need to use it on the server side...

Any ideas?