Giter Site home page Giter Site logo

fasttext.py's Introduction

fasttext Build Status

fasttext is a Python interface for Facebook fastText.

Requirements

fasttext support Python 2.6 or newer. It requires Cython in order to compile the C++ extension.

Installation

pip install fasttext

Example usage

This package has two main use cases: word representation learning and text classification.

These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1,

We can use fasttext.skipgram and fasttext.cbow function like the following:

import fasttext

# Skipgram model
model = fasttext.skipgram('data.txt', 'model')
print model.words # list of words in dictionary

# CBOW model
model = fasttext.cbow('data.txt', 'model')
print model.words # list of words in dictionary

where data.txt is a training file containing utf-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters.

At the end of optimization the program will save two files: model.bin and model.vec.

model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters.

The binary file can be used later to compute word vectors or to restart the optimization.

The following fasttext(1) command is equivalent

# Skipgram model
./fasttext skipgram -input data.txt -output model

# CBOW model
./fasttext cbow -input data.txt -output model

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words.

print model.get_vector('king') # get the vector of the word 'king'

the following fasttext(1) command is equivalent:

echo "king" | ./fasttext print-vectors model.bin

This will output the vector of word king to the standard output.

Load pre-trained model

We can use fasttext.load_model to load pre-trained model:

model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model.get_vector('king') # get the vector of the word 'king'

Text classification

Works in progress

API documentation

Word vector model

import fasttext

model = fasttext.skipgram(params)
model.words
model.get_vector(word)

model = fasttext.cbow(params)
model.words
model.get_vector(word)

model = fasttext.load_model('model.bin')
model.words
model.get_vector(word)

List of params and their default value:

input       training file path
output      output file path
lr          learning rate [0.05]
dim         size of word vectors [100]
ws          size of the context window [5]
epoch       number of epochs [5]
min_count   minimal number of word occurences [1]
neg         number of negatives sampled [5]
word_ngrams max length of word ngram [1]
loss        loss function {ns, hs, softmax} [ns]
bucket      number of buckets [2000000]
minn        min length of char ngram [3]
maxn        max length of char ngram [6]
thread      number of threads [12]
verbose     how often to print to stdout [10000]
t           sampling threshold [0.0001]
silent      suspress the log from the C++ extension [1]

References

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2016enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.04606},
  year={2016}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

fasttext.py's People

Contributors

pyk avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.