Giter Site home page Giter Site logo

tomoto-ruby's Introduction

tomoto.rb

🍅 tomoto - high performance topic modeling - for Ruby

Build Status

Installation

Add this line to your application’s Gemfile:

gem "tomoto"

Getting Started

Train a model

model = Tomoto::LDA.new(k: 2)
model.add_doc("text from document one")
model.add_doc("text from document two")
model.add_doc("text from document three")
model.train(100) # iterations

Get the summary

model.summary

Get topic words

model.topic_words

Save the model to a file

model.save("model.bin")

Load the model from a file

model = Tomoto::LDA.load("model.bin")

Get topic probabilities for a document

doc = model.docs[0]
doc.topics

Get the number of words for each topic

model.count_by_topics

Get the vocab

model.vocabs

Get the log likelihood per word

model.ll_per_word

Perform inference for unseen documents

doc = model.make_doc("unseen doc")
topic_dist, ll = model.infer(doc)

Models

Supports:

  • Latent Dirichlet Allocation (LDA)
  • Labeled LDA (LLDA)
  • Partially Labeled LDA (PLDA)
  • Supervised LDA (SLDA)
  • Dirichlet Multinomial Regression (DMR)
  • Generalized Dirichlet Multinomial Regression (GDMR)
  • Hierarchical Dirichlet Process (HDP)
  • Hierarchical LDA (HLDA)
  • Multi Grain LDA (MGLDA)
  • Pachinko Allocation (PA)
  • Hierarchical PA (HPA)
  • Correlated Topic Model (CT)
  • Dynamic Topic Model (DT)

API

This library follows the tomotopy API. There are a few changes to make it more Ruby-like:

  • The get_ prefix has been removed from methods (topic_words instead of get_topic_words)
  • Methods that return booleans use ? instead of is_ (live_topic? instead of is_live_topic)

If a method or option you need isn’t supported, feel free to open an issue.

Examples

Tokenization

Documents are tokenized by whitespace by default, or you can perform your own tokenization.

model.add_doc(["tokens", "from", "document", "one"])

Performance

tomoto uses AVX2, AVX, or SSE2 instructions to increase performance on machines that support it. Check which instruction set architecture it’s using with:

Tomoto.isa

Parallelism

Choose a parallelism algorithm with:

model.train(parallel: :partition)

Supported values are :default, :none, :copy_merge, and :partition.

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone --recursive https://github.com/ankane/tomoto-ruby.git
cd tomoto-ruby
bundle install
bundle exec rake compile
bundle exec rake test

tomoto-ruby's People

Contributors

ankane avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.