Giter Site home page Giter Site logo

fasttext4j's Introduction

FastText4j

fastText4j原是Mynlp一个子模块,现在独立成为一个开源项目。(mynlp是一个高性能、模块化、可扩展的中文NLP工具包 )

Implementing Facebook's FastText with java. Fasttext is a library for text representation and classification by facebookresearch. It implements text classification and word embedding learning.

中文文档

Features:

  • Implementing with java(kotlin)
  • Well-designed API
  • Compatible with original C++ model file (include quantizer compression model)
  • Provides training api (almost the same performance)
  • Support for java file formats( can read file use mmap),read big model file with less memory

Installing

Gradle

compile 'com.mayabot:fastText4j:1.2.2'

Maven

<dependency>
  <groupId>com.mayabot</groupId>
  <artifactId>fastText4j</artifactId>
  <version>1.2.2</version>
</dependency>

Tutorial

1. Train model

GitHub

  • ModelName.sup supervised
  • ModelName.sg skipgram
  • ModelName.cow cbow
//Word representation learning
FastText fastText = FastText.train(new File("train.data"), ModelName.sg);

// Text classification

FastText fastText = FastText.train(new File("train.data"), ModelName.sup);

data.txt is also encoded in utf-8 with one sample each line. And it needs to do word spliting beforehand as well. There is a string starting with __label__ in each line,representing the classifying target, such as __label__正面. Each sample could have multiple label. Through the attribute 'label' in TrainArgs, you can customise the head.

2. save model

save model to java format

fastText.saveModel("path/data.model");

3. load model

public Fasttext loadModel(String modelPath, boolean mmap)

//load from java format 
FastText fastText = FastText.loadModel("path/data.model",true);

//load from c++ format
FastText fastText = FastText.loadFasttextBinModel("path/wiki.bin") 

4. quantizer compression

FastText quantize(FastText fastText , int dsub=2, boolean qnorm=false)

//load from java format 
FastText quantizerFastText = FastText.quantize(fastText,2,false);

5.Predict

//predict the result of a word
List<FloatStringPair> predict = fastText.predict(Arrays.asList("fastText在预测标签时使用了非线性激活函数".split(" ")), 5);

6.Nearest Neighbor Search

List<FloatStringPair> predict = fastText.nearestNeighbor("**",5);

7.Analogies

By giving three words A, B and C, return the nearest words in terms of semantic distance and their similarity list, under the condition of (A - B + C).

List<FloatStringPair> predict = fastText.analogies("国王","皇后","男",5);

Ag News example

test agnews data set, train and predict by fastText4j

Result:

   Read 5M words
   Number of words:  95812
   Number of labels: 4
   Progress: 100.00% words/sec/thread:  5792774 lr: 0.00000 loss: 0.28018 ETA: 0h 0m 0s
   Train use time 5275 ms
   total=7600
   right=6889
   rate 0.9064473684210527

Parameters of TrainArgs

The parameters is consistant with the C++ version :

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurences [1]
  -minCountLabel      minimal number of label occurences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Resource

Official pre-trained model

Recent state-of-the-art English word vectors.
Word vectors for 157 languages trained on Wikipedia and Crawl.
Models for language identification and various supervised tasks.

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

License

fastText is BSD-licensed. facebook provide an additional patent grant.

fasttext4j's People

Contributors

jimichan avatar xswds123 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.