Giter Site home page Giter Site logo

joshaven / string_score Goto Github PK

View Code? Open in Web Editor NEW
841.0 841.0 69.0 220 KB

JavaScript string ranking 0 for no match upto 1 for perfect... "String".score("str"); //=> 0.825

License: MIT License

Makefile 0.80% CoffeeScript 3.58% JavaScript 82.42% HTML 8.18% CSS 5.02%

string_score's People

Contributors

bltavares avatar brandoncarl avatar gorakhargosh avatar joshaven avatar lorensr avatar steelbrain avatar tarunc avatar thetron avatar yesudeep avatar yichizhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

string_score's Issues

Hello World and jello

"Hello World" and "jello" should score higher than 0 with a fuzziness of 0.5, says your test.

Feature-Request: closest/best match

Would be nice, if we could do this: "My string".closest([array of strings]) and would give back the index(es) of the best match(es). (Multiple results if their score is the same, but the content is not.)

Add LICENSE to repository

Hi! I see that the readme links to the standard MIT License, but would it be possible to include a LICENSE file in the repository itself with the MIT license text and copyright statement in it?

This practice makes it much easier for us to use the code and attribute it properly in our product.

Thanks!

score demo

hello
i was testing the algorithm So i tried hello and hellô and the result was 0 wtf!

Don't patch string prototype

Patching prototypes of global objects is a bad practice. Would it be possible to get a version of this module that just exports the scoring function, with a UMD wrapper so it can be used in a CommonJS environment? That would also make it possible to pass the function directly to array.sort()

Out-of-order search

I had a play with string_score, and while it's excellent for what it does, it doesn't quite fit my use case out of the box. In particular, it lacks recognition for highly similar substrings that are in a different order.

tl;rd: Do you plan to add support for out-of-order substring matching? Or can you at least think of a smart way to do it? If this is totally outside of the scope of string_score, then go ahead and close this. I'm mostly just rubber ducking the problem.

For background, I have two spreadsheets where each row represents a building with some attributes, including ID and name. I'm told that both spreadsheets contain the same 66 buildings, except that one of the spreadsheets has 72 rows, and neither of them use the same IDs or names consistently. One will abbreviate some names, the other will abbreviate others, or the same ones but in a different way. It's a mess, so I'd like an automated, objective mechanism for associating the "matching" rows and ultimately merging the attributes.

For example, when searching for a match for 2G8 Bahagian Pinjaman Perumahan, string_score with 0.5 fuzziness thinks that PMO is a better match than LOT 2G8 (2M10 & 2M11) Bhg. Pinjaman Perumahan, JPM. Or for a more English example, comparing university of oxford with oxford of university scores 0.027.

To address this failure mode, I've wrapped it in a pretty gnarly loop:

  1. Both the search string and the comparison string are split into words
  2. Each word of the search string is string_score'd against each word of the comparison string (concatenating the whole comparison string as an option for matching abbreviations)
  3. For each search word, the score for the most similar comparison word is recorded
  4. The score for the search string against the comparison string is taken as the sum of the (maximum) score of each word, normalised by dividing by the number of search words.

Clearly this is more expensive (something like an order of magnitude, or at least a factor of the average number of words per string), but it's pretty easy to implement given what string_score already does. Can you think of a straightforward way to modify your algorithm to handle this kind of case? Or even just a smarter way to package it than mine?

why?

Why use this arbitrary string scoring algorithm over the well-known and common Levenshtein distance? I'm not trying to attack this project, but I'd like to know if it has anything more to offer. There are already a few JavaScript/coffee-script Levenshtein implementations.

Support diacritics

So I stumbled on matching city names in Germany and particularly I have this:

'Gross-Gerau' .score('Groß-Gerau') => 0

I wonder if there's any chance that you can support mapping it first without diacritics as an option in the config? would be much easier this would be usage (as I see it):

'Gross-Gerau' .score('Groß-Gerau', { diacriticsMap: true }) => 1

or similar

Faster string query

Hello,
I think you can use a hash table to speed up the query.

  1. Create a hash table of words, with key values as characters, and values as positions of characters。
    Word: Hello
    User Input: eao
    image

2)Find the character entered by the user: ‘e’,
Find by hash table, find the index greater than 0 (you can use binary search)。
The word index is at position 1.

  1. Find the character entered by the user: ‘a’,
    Find by hash table, find index greater than 1 (you can use binary search)
    'A' not found, index does not move

4)Find the character entered by the user: ‘o’,
Find by hash table, find the index greater than 0 (you can use binary search)。
The word index is at position 4.

I'm not familiar with js, I can't implement it with code, I'm really sorry.

The result isn't a proper distance

"Hello World".score("hello")
is not equal to
"hello".score("Hello World")
and thus the result isn't symmetric (a nice property if one wants to start meassure string edit distances).

This is somewhat in connection to issue #10. (This algorithm offers fast execution while giving up on some mathematical properties that some might find nice to have.)

Note that the particular test case above works "as expected" if one uses a fuzziness of 1. That isn't a solution though since it would yield strange results for things like "foo".score("Hello World", 1).

Also, it would be nice if two equal strings gave the result 0 and two completely different strings gave the result 1. Again, to satisfy the distance property. (If the distance between NY and London is 0 then NY and London are the same location. If the distance between the strings "foo" and "bar" is 0, then they are the same strings.)

Finally, I have not yet tested if it would satisfy the triangle inequality (a.score(b) + b.score(c) >= a.score(c) or, if you go from location a to location b and then from b to c, you must have walked further than or equal to the distance from a to c), but my guess is not. This too is needed to be a proper meassurement of distance.

On the other hand, you never set out to create an algorithm to meassure string distances, only to score strings.

PS. I love the project and it fits perfectly for what I'm about to do. Just writing this to explain why people might ask what this solution has to offer over things like the Levenshtein Distance or Hamming Distance and to document to other users that some things will not be possible with this solution. For the interested, this would likely be an algorithm producing an ordinal scale (ranking scale) so you can always compare two results but you can't do basic arithmetic on it. See http://en.wikipedia.org/wiki/Level_of_measurement for examples of what this implies.

IE7

When running IE9 in IE7 mode,

"Foo".score("o") // => Unable to get value of the property 'toLowerCase': object is null or undefined, string_score.js, line 23

Not that anyone should still be using IE7, but the README does specify this should work in IE7.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.