Giter Site home page Giter Site logo

simstring's Introduction

simstring

A Ruby implementation of the SimString approximate string matching algorithm.

References:

Install

gem install simstring_pure

Usage

In IRB (some lines elided):

irb(main):003:0> require 'simstring_pure'

irb(main):004:0> ngram_builder = SimString::NGramBuilder.new(3)

irb(main):005:0> db = SimString::Database.new(ngram_builder)
irb(main):006:0> db.add("foo")
irb(main):007:0> db.add("bar")
irb(main):008:0> db.add("food")
irb(main):009:0> db.add("floor")

irb(main):010:0> matcher = SimString::StringMatcher.new(db, SimString::CosineMeasure.new)

irb(main):011:0> matcher.search("fooo", 0.6)
=> ["foo"]
irb(main):012:0> matcher.search("fooo", 0.5)
=> ["foo", "food"]
irb(main):021:0> matcher.search("fooor", 0.5)
=> ["foo", "floor"]
irb(main):022:0> matcher.search("for", 0.5)
=> ["floor"]
irb(main):023:0> matcher.search("for", 0.3)
=> ["foo", "food", "floor"]

irb(main):011:0> matcher.ranked_search("fooo", 0.6)
=> [#<struct SimString::Match value="foo", score=0.9128709291752769>]
irb(main):017:0> matcher.ranked_search("fooo", 0.5)
=> [#<struct SimString::Match value="foo", score=0.9128709291752769>, #<struct SimString::Match value="food", score=0.5>]
irb(main):020:0> matcher.ranked_search("fooor", 0.5)
=> [#<struct SimString::Match value="floor", score=0.5714285714285714>, #<struct SimString::Match value="foo", score=0.50709255283711>]
irb(main):021:0> matcher.ranked_search("for", 0.5)
=> [#<struct SimString::Match value="floor", score=0.50709255283711>]
irb(main):022:0> matcher.ranked_search("for", 0.3)
=> [#<struct SimString::Match value="floor", score=0.50709255283711>, #<struct SimString::Match value="foo", score=0.4>, #<struct SimString::Match value="food", score=0.3651483716701107>]

Supported String Similarity Measures

  • Cosine
  • Dice
  • Exact
  • Jaccard
  • Overlap

Performance:

On a 2.7GHz Core i5 MacBook Pro (Retina, 13-inch, Early 2015), here are some sample timings:

davidellis:~/Projects/ruby/simstring (master) $ simstring wordlists/companynames.txt "Inyel Corp" 0.4
["PHH Corp",
 "Viad Corp",
 "Aegion Corp",
 "B2Gold Corp",
 "InfoSonics Corp",
 "GSV Capital Corp",
 "Intel Corporation"]
1.614527 seconds to build database
0.130983 seconds to search

davidellis:~/Projects/ruby/simstring (master) $ simstring wordlists/companynames.txt "Intel Corp" 0.6
["Intel Corporation"]
1.628863 seconds to build database
0.060129 seconds to search

davidellis:~/Projects/ruby/simstring (master) $ simstring wordlists/unabridged_dictionary.txt "zygoat" 0.7
[]
35.177757 seconds to build database
0.206831 seconds to search

davidellis:~/Projects/ruby/simstring (master) $ simstring wordlists/unabridged_dictionary.txt "zygoat" 0.5
["goat", "zygon", "zygoma", "zygose", "zygote", "zygous", "zygodont"]
34.808823 seconds to build database
0.840492 seconds to search

Word Lists

  • wordlists/companyNames.txt is a list of 5797 company names
  • wordlists/unabridged_dictionary.txt is a list of 235544 words from an unabridged dictionary

Run Tests:

rake

OR

rake test

Building and Publishing Gem

$ gem build simstring_pure.gemspec
  Successfully built RubyGem
  Name: simstring_pure
  Version: 1.1.2
  File: simstring_pure-1.1.2.gem

$ gem push simstring_pure-1.1.2.gem
Enter your RubyGems.org credentials.
Don't have an account yet? Create one at https://rubygems.org/sign_up
   Email:   <email goes here>
Password:   <password goes here>

Signed in.
Pushing gem to https://rubygems.org...
Successfully registered gem: simstring_pure (1.1.2)

simstring's People

Contributors

nullnull avatar

Watchers

James Cloos avatar Krištof Črnivec avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.