Giter Site home page Giter Site logo

word2vec-ruby's Introduction

word2vec-ruby

Minor rewrite of, and ruby C-bindings for, the distance program from word2vec. This could almost certainly be done with e.g. the rb-libsvm gem, but this does so directly.

N.B.: This does not currently include bindings for the main word2vec program; model creation must be done with the C program. See the original or a fork on GitHub for how to do so.

Installation

Add this line to your application's Gemfile:

gem 'word2vec-ruby'

And then execute:

bundle install

Or install it yourself as:

gem install word2vec-ruby

Usage

Here we assume you already have a model file generated by word2vec (e.g. vector.bin); if this is not the case, you should probably start here.

Assuming the model file is at data/vector.bin, the following shows some basic usage patterns:

require "word2vec/native_model"

# Load the model file.
model = Word2Vec::NativeModel.parse_file("data/vector.bin")

# Get the index of some word in the model's vocabulary:
model.index("cat")
# => 1980

# Get the nearest neighbors for a word:
model.nearest_neighbors(%w(cat), neighbors_count: 3)
# => { "dog" => 0.7418528199195862, "cats" => 0.711361825466156, "puppy" => 0.6765584349632263 }

Caveats

String encoding

In the native C code, we use rb_utf8_str_new_cstr rather than rb_str_new_cstr to create ruby strings from C strings (e.g. here and here). This means that any strings coming out of (and, to some extent, going into) Word2Vec::NativeModel will (should) be marked as having Encoding::UTF_8. We do this, rather than using Encoding::ASCII_8BIT, as it is generally more convenient.

If the underlying word2vec model file contains strings which are not UTF-8 encoded, then you should (hopefully?) be able to use String#force_encoding to mark them as the appropriate encoding when they come out of Word2Vec::NativeModel. If this become an issue, then it would be fairly straightforward to add an #encoding attribute to Word2Vec::Model, which would default to Encoding::UTF_8 but could be set to anything else.

word2vec references

Some useful references on word2vec:

  1. The original. Contains links to relevant academic papers.
  2. A well-written high level summary of word vectors.
  3. Wikipedia.
  4. Deeplearning4j's description of word2vec. Java-specific, but still a good reference.
  5. A fork of the original on GitHub.

word2vec-ruby's People

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.