facebookresearch / fasttext Goto Github PK

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

(g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

Python 2.6 or newer
NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

Python version 2.7 or >=3.4
NumPy & SciPy
pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

Facebook page: https://www.facebook.com/groups/1174547215919768
Google group: https://groups.google.com/forum/#!forum/fasttext-library
Contact: [email protected], [email protected], [email protected], [email protected]

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

fasttext's People

Contributors

Stargazers

Watchers

Forkers

ml-ai-nlp-ir qianxueseng-com jisaacso nighthawk24 datalibs sulmone peratham g-cl rtvt123 mattdisipio techscientist amitshah slack0 vadimmalykh ilyaeck zhoujialinmumu panyang bootinge nipengmath tuyendothanh oztc gxieaa scofield0li tchen0123 hongliangzhao lemonhall nonego phecy ritali boosterduan ranjeet-floyd rajkumaranugu pandasasa limeng05 tungmeo kevinlyu netankit yqf-oo tienhv ilkamo sandeshputtaraju dav009 tvtritin felixdasgupta alexbeletsky zachlungu leihentulong etali sndnvaps shatu ndrwchn laisun why-not-sky binbinbian fandywang songweiping gidim rlugojr fedorajzf teardemon pyk shannonyu shravan97 gojomo tgallice benjamesbabala lido63 gwworld dselivanov christiansch infinite-joy lyrl dylan-fan adrianhust studydev 7472741 wuntoguo zhiyu-chen atuxhe templeblock mantyr ebraheemf jasonkessler stevenlol dhlee421 tomtung oroszgy nope8 ezhangle ank-it gmarousek yoonjaecho corona10 zuiwufenghua saikswaroop elghandour1022 maccam912 ml-lab bobz653 leo1565

fasttext's Issues

Question: unit testing framework

I'd like to understand the library a little better and thought to create some tests that can help to mitigate regression issues in the future.

Last time I deal with Cpp quite some years ago, what is you preferable unit testing framework? I found Catch and it looks good.

What's your thoughts?

api to get averaged vector representation of document during classification?

After the model is trained under supervised mode, currently is there a way, given a piece of text, obtain the vector representation before it is sent to softmax/hs for prediction?

Low performance of trained classfier when the input labels are clustered

When training a classifier, if the input contain labels that are clustered, the precision and recall of the trained model is very low e.g.

some text __label__a
other text __label__a
yet another line __label__b
the forth line __label__b
the fifth line __label__c
last line __label__c

If the input lines are shuffled, the precision and recall of the trained model are higher.
Is this an expected behavior?

Assertion failed on ./fasttext predict

predict command failed!

./fasttext predict model.bin test.txt

Assertion failed: (counts.size() == osz_), function setTargetCounts, file src/model.cc, line 188.
Abort trap: 6

model train command was:

./fasttext supervised -input train.txt -output model -wordNgrams 4 -bucket 1000000 -thread 16
Read 4223M words
Number of words:  16577869
Number of labels: 25
Progress: 100.0%  words/sec/thread: 375706  lr: 0.000000  loss: 0.169518  eta: 0h0m 

Question: How to analyze sentence similarity under fastText?

I have a requirement on hand, e.g. user input some text (Chinese) inside the input box of the app, and we need to redirect user to other pages according to user's semantic meaning.
So, can fastText do some helps on this?
Thanks!

Discard word on predict stage?

When using the model for predicting a label, the prediction sometimes returns n/a, which I traced back to the dictionary.cc file, method Dictionary::getLine:

if (type == entry_type::word && !discard(wid, uniform(rng))) {

that seems to be randomly discarding words from the input line based on a uniform distribution and a probability that comes from the model training.

Is it correct to have this behavior on the predict phase? Although I see the point on having this for the training phase (to avoid overfitting the input data), it seems that this would lead to non-deterministic behavior when using a trained model after.

Print-vectors only printed out the word vectors, not the sentence/paragraph vectors

./classification-results.sh: 25: ./classification-results.sh: Syntax error: "(" unexpected

Running with bash worked fine. On xenial sh is dash-0.5.8-2.1ubuntu2.

Skipgram: Hang while processing small dataset

I'm trying to reproduce issue reported here https://github.com/salestock/fastText.py/issues/53 by @prakhar21

It turns out that fasttext(1) also hang:

% cat data.txt 
simple
statement
% ./fasttext skipgram -input data.txt -output model
Read 0M words
Number of words:  0
Number of labels: 0

I'm using the fastText version:

% git l | head -n 1
* 27d90b1 - (22 hours ago) Change defaults for supervised setting - Edouard Grave (HEAD, origin/master, origin/HEAD, master)

Classification for small datasets

Hello,
First of all thank you for this awesome contribution to the scientific world.

I performed some tests with binary classification on the corpus with 5.000 samples and the result was not good ( 0.65). Even with a BoW Naive Bayes classifier I could get higher scores.

I tried to play with some parameters like epoch and minCount and it improved the results very slight.

At the fastText Hacker News thread seems like a developer is aware about this issue:

Thanks for pointing this out. We design this library on large datasets and some static variables may not be well tuned for smaller ones. For example the learning rate is only updated every 10k words. We are fixing that now, could you please send us on which dataset you were testing? We would like to see if we have solved this.

There is something inherently to the algorithm that makes this performant only with big datasets ? There are any variable that we can tune for this use case ?

Classification on AG news with learning rate 0.25 and trigram got 0.926 better than number claimed in paper

LAAMPs-MacBook-Pro:fastText laam$ ./classification-ag.sh
Read 5M words
Progress: 100.0% words/sec/thread: 2975763 lr: 0.000001 loss: 0.063118 eta: 0h0m
Train time: 6.000000 sec
P@1: 0.926
Number of examples: 7600

Is that something wrong with my testset on AGnews?

What does the bucket option do?

I see it in both word vector learning and supervised classification.

-bucket number of buckets [2000000]

Regarding evaluation of test: P@1 always equal to R@1?

When testing with K equal to 1, the precision@1 is always equal to the recall@1. I check the source and the calculation looks more like accuracy instead of precision/recall. Is this correct?

Wrong vector size if word is float number

After training word vector, I've check vector size on word: [-0.27998], I got the word vector is 99 instead of 100. So do we need to do preprocessing to filter out these numerical like words?

-0.27998 [ 0.70508 -0.43374 0.2001 0.077085 -0.056991 -0.27614 0.21743
-0.25468 0.45824 0.20939 0.085696 1.0295 -0.32094 0.23692
-0.14012 0.77944 0.38626 -0.14074 0.070259 0.0063866 -0.21889
-1.0098 0.41252 -0.64827 -0.30834 0.62971 0.39256 -0.14508
0.44633 0.094293 0.33191 -0.011547 0.22663 -0.15396 -0.078965
0.23796 0.34684 -0.2233 0.31443 1.084 -0.17832 -0.11039
0.24845 0.14074 0.038921 0.049496 -0.5893 -0.19393 -0.53372
-0.010605 -0.15091 -0.35736 0.63122 0.096636 -0.14316 -0.46914
1.1883 0.68973 0.45303 0.29499 -0.16392 -0.52919 -0.72379
0.54775 -0.050434 0.44091 0.99511 -0.20806 -0.36489 -0.35815
0.019567 0.39701 -0.55698 -1.4855 0.80146 0.51527 -0.69782
-0.037633 -0.039004 0.16793 -0.3204 -0.35665 0.069757 0.43055
0.63142 0.39043 0.72315 -0.55542 -0.44498 1.0321 -0.26342
0.25087 -0.59114 -0.13786 0.12934 -1.226 -0.17471 -0.82121
0.10297 ]

Precision and recall metrics

Hi,

I've heard that it was now possible to get precision and recall values for the labels of our classifier, but I just can't find the option (maybe I'm dumb). Can you run me trough that process ?

Thanks

fastText fails on bash on windows

Ran ./word-vector-example.sh on bash for windows 10 and received the following:

Read 124M words
Progress: 100.0%  words/sec/thread: 100377  lr: 0.000001  loss: 1.667808  eta: 0h0m
Train time: 311.000000 sec
OMP: Error #100: Fatal system error detected.
OMP: System error #22: Invalid argument
Aborted (core dumped)

batch training

Is it possible to train the model on one set of data and then use a different data for additional training without starting from scratch?

Use of angle brackets around file names for include statements

Would you like to replace any double quotes by angle brackets around file names for include statements?

words hypersphere from a seed word

After training with a large corpus, given a word w, would it be possible with fasttext have a list L of all words contained in a hypersphere of radius R?

something like:

L = hypershere(w,R)

Representing P most frequent words without n-grams

Quoting from the last paragraph of section 3 in the word representation paper -

To improve the efficiency of our model, we do not use n-grams to represent the P most frequent words in the vocabulary. There is a trade-off in the choice of P, as smaller values imply higher computational cost but better performance.

I couldn't find any further references to it in the implementation details or the released code. I was looking to use this to directly imitate the original skipgram model. Was this dropped from the final implementation?

The best alternate (hack-y) way to do this seems to modify the computeNgrams method in Dictionary. Would simply making it do nothing be a correct approach?

Sorry for raising this as an issue, wasn't sure where else to ask this.

missing words?

Hi and thanks so much for making this code public!

When I train fastText unsupervised on some chatdata, the textfile with the vectors only contains about half of the words in my corpus.. If I train it with labels then I'll get a vector for all the words in the corpus.. So somehow - it seems - fastText discards half of the words from my corpus in the unsupervised setting.. am I missing something here? sorry if this is really trivial..

Can FastText be used for multiple-labels per doc?

This is not an issue. just the usage is not clear for me whether and how the multiple-labels are supported.

do we support utf-8?

Hello, thank you for your contribution.

I performed some tests with classification on the Chinese text, while an error occurred as follow:

terminate called after throwing an instance of 'std::bad_array_new_length'
what(): std::bad_array_new_length
Aborted (core dumped)

Through many debugging， the error is located in the code of function of save and load in dictionary.cc.
In function Dictionary::save, 0 is saved as the end of word.data, as following code.

ofs.write(e.word.data(), e.word.size() * sizeof(char));
ofs.put(0);

It's ok when the text is written in English, while it maybe incompatible with other languages.
Will the software support utf-8 in future versions?

How is one line per document parsed for skipgram

I have a file with one sentence per line. There is no direct relation between two lines.

e.g.

Sports line talks about sports and activities related to sports like stamina.
Politics line talks about politics and related concepts like election.

So if the context is say 5 words, will it consider context across sentences? Will "Politics" be considered a context of "stamina"?

Building word-representations independently from classification model

We often have a large unsupervised corpus available to generate word-representations but only a smaller subset of labeled data for supervised training. An option to specify these corpora independently or accomplish this in two distinct steps would allow better usage of all available training data.

Multi-threading on multiple epochs

The line 253 in fastext.cc receives one line from the stream at a time:

localTokenCount += dict_->getLine(ifs, line, labels, model.rng);

The different parts of the ifs are correctly distributed among the threads in the beginning (line 238 in trainThread). For one epoch I think it is OK, but when we do multiple epochs, the stream is being set to the beginning in line 258 of dictionary.cc when it reaches the end. If my reasoning is correct, this will lead to all the threads reading the same data (beginning of the file) in second epoch, instead of each thread reading a different part of the file.

How to use pre-trained word representations for classification?

The doc is very simple and I'm wondering if we can use a pre-trained word representations for classification. Why I need to do so it that I have large unlabeled dataset but small labeled dataset. I want to train word vectors using the large unlabeled dataset and train the classification model with small labeled dataset.

Windows 7 MinGW-W64 build produces a corrupted model file

fastText built on Windows 7 with MinGW-W64 gcc version 4.9.2 (x86_64-posix-seh-rev1) creates a model file that is not loaded correctly in test or predict modes.
Can be fixed by explicitly specifying std::ifstream ifs(filename, std::ifstream::binary); in FastText::loadModel and std::ofstream ofs(args_->output + ".bin", std::ofstream::binary); in FastText::saveModel in fasttext.cc

word-vector-example fails with std::bad_alloc

Archive:  data/rw.zip
   creating: data/rw/
  inflating: data/rw/README.txt
  inflating: data/rw/rw.txt
make: Nothing to be done for `opt'.
Read 124M words
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
./word-vector-example.sh: line 34:  4482 Aborted                 (core dumped) ./fasttext skipgram -input "${DATADIR}"/text9 -output "${RESULTDIR}"/text9 -lr 0.025 -dim 100 -ws 5 -epoch 1 -minCount 5 -neg 5 -loss ns -bucket 2000000 -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100
Model file cannot be opened for loading!
Traceback (most recent call last):
  File "eval.py", line 79, in <module>
    .format(dataset, corr[0] * 100, math.ceil(drop / nwords * 100.0)))
ValueError: Unknown format code 'f' for object of type 'unicode'

I pretty much just git cloned, make, then ran the word-vector example.

Make include guards unique

I find that include guards like "ARGS_H" and "MODEL_H" are too short for the safe reuse of your header files (when they belong to an application programming interface).

Closed

Just ignore, deadlock was happening because the training textfile didn't have any words above the mincount threshold.

file src／dictonary.cc，line 270 Abort trap: 6

Dear author:
Thank you for your contribution. I download and test on ag_news data. my command are below:

.／fasttext supervised -input .／data／ag_news.train -output model -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4
.／fasttext predict model.bin .／data／ag_news.test
However，I get a mistake:
Assertion failed:（lid < nlabels_），function getLabel，file src／dictionary.cc， line 270.
Abort trap: 6

Would you give me a help，Thank you very much!

How to explain the text classification model?

Explain the text classification model is hard, since fastText doesn't provides a tool to export the model. It's just like a black box.

"Null" Class & Approximate Tokens

Hi,

2 quick questions that perhaps run the risk of being more general than just specific fastText - but relating to the classification capability:

Given a corpus of short messages (something like Tweets), how (could?) one approach a problem where the vast majority of documents are of a "null class" while a minority are of a specific labelled class. It's possible to include in the training set of "null class" examples, however, the null class has a much wider array of possible tokens (including many unobserved tokens)

Is there some way to get the pseudo-probability (rather than k most likely classes) from the predict function? Then if a given document did not reach some threshold, it could be considered as not strongly matching any labelled class.

Does fastText support any concept rather than exact tokens? The use case here is spelling error (ie. we may have observed "AMAZON" but not "AMZON" despite them having presumably the same meaning).

Thanks,
Alex

lrUpdateRate does not influence the learning rate

Hi guys,
I believe the purpose of lrUpdateRate was to somehow schedule how the lr_ is updated.
Currently it's not used, is it supposed to be that way?

Update happens here
The condition that uses lrUpdateRate doesn't touch the learning rate here

when install the fastText got that eror

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
src/dictionary.cc: In member function ‘void Dictionary::threshold(int64_t)’:
src/dictionary.cc:189: error：expected primary-expression before ‘[’ token
src/dictionary.cc:189: error：expected primary-expression before ‘]’ token
src/dictionary.cc:189: error：expected primary-expression before ‘const’
src/dictionary.cc:189: error：expected primary-expression before ‘const’
src/dictionary.cc:193: error：expected primary-expression before ‘[’ token
src/dictionary.cc:193: error：expected primary-expression before ‘]’ token
src/dictionary.cc:193: error：expected primary-expression before ‘const’

Questions regarding the embeddings produced by the `skipgram` and `supervised` options

Hello!

As far as I understand fastText is implementing two research papers [1, 2] and both papers can be used to learn word embeddings:

[1] learns the embeddings by predicting the current word from its surrounding character n-grams
[2] learns word embeddings that are specifically geared towards a classification task

A few questions:

Given that both systems have an embedding component, I was wondering whether: (i) you tried to perform the classification task on the skip-gram embeddings from [1]; (ii) you could modify the architecture in [2] to work on character n-grams.
In [2] you are averaging word embeddings to obtain the embedding of a sentence. Does averaging make sense for the skip-gram embeddings from [1] as well? More generally, when is it a good idea to average embeddings in order to obtain the embedding of a larger chunk of text? This question might be related to #26.
The help function suggests that the two parts of the code (skipgram and supervised) use the same arguments. Is this right? Do you use character n-grams for supervised (the minn and maxn options) or n-gram words for skipgram (the wordNgram option)?

Thanks!

[1] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

[unsupervised learning] Export ngram as text file?

Right now, we are supposed to use fasttext exec to generate vectors for OOV words.
According to the source code, it seems simple to generate them in any prog language if we have access to the n-gram vectors.

Would it be possible to get a feature to export the content of the bin model (including n-gram) as text file?

did not generate model.vec after supervised training

Empty vocabulary. Try a smaller -minCount value.

Hello!
I'm having a bit of trouble, because, i get this error each time i try to exec this:
./fasttext supervised -input / -output data/train.txt -lr 0.025 -dim 100 -ws 5 -epoch 1 -neg 5 -loss ns -bucket 2000000 -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100
I have formated my data, almost exactly as you have in the classification_example.sh
1 wallet Give more sales to old clients
If this is my fault, please, let me know where is my mistake

Compile error with commit fabb04e

Using gcc 4.9.2

PS 09/01/2016 15:30:03> mingw32-make.exe fasttext
...
c++ -pthread -std=c++0x args.o dictionary.o matrix.o vector.o model.o utils.o src/fasttext.cc -o fasttext
In file included from src/fasttext.cc:10:0:
src/fasttext.h:32:5: error: 'clock_t' does not name a type
clock_t start;
^
src/fasttext.cc: In member function 'void FastText::printInfo(real, real)':
src/fasttext.cc:96:27: error: 'start' was not declared in this scope
real t = real(clock() - start) / CLOCKS_PER_SEC;
^
src/fasttext.cc: In member function 'void FastText::train(std::shared_ptr)':
src/fasttext.cc:273:3: error: 'start' was not declared in this scope
start = clock();
^
Makefile:40: recipe for target 'fasttext' failed
mingw32-make: *** [fasttext] Error 1

Problem solved by including time.h before fasttext.h in fasttext.cc or just move the include to fasttext.h

Multilabel Classification

Does fastText supports multilabel classification?

eta time for supervised training

When I started training a supervised model:
Progress: 2.7% words/sec/thread: 6565 lr: 0.097276 loss: 0.616644 eta: 18h4m 4m

and after 58 hours, this is what I am seeing:
Progress: 86.2% words/sec/thread: 5512 lr: 0.013768 loss: 0.175465 eta: 3h6m 4m

Am I missing anything here? I thought, training would be complete in 18h!

tar file unzip error

Hey guys,

I'm getting following problem while unzipping downloaded file dbpedia_csv.tar.gz in classification-example.sh

[vols]temp $ tar -xzvf dbpedia_csv.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Any comments?

how can I READ the model.bin in java or python

thanks

How can we get the vector of a paragraph?

I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

Thank you!

make error

when I input "make” ，there are some erros happened like this:

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
src/dictionary.cc:113: warning: this decimal constant is unsigned only in ISO C90
src/dictionary.cc: In member function ?.oid Dictionary::threshold(int64_t)?.
src/dictionary.cc:195: error: expected primary-expression before ?.?.token
src/dictionary.cc:195: error: expected primary-expression before ?.?.token
src/dictionary.cc:195: error: expected primary-expression before ?.onst?
src/dictionary.cc:195: error: expected primary-expression before ?.onst?
src/dictionary.cc:199: error: expected primary-expression before ?.?.token
src/dictionary.cc:199: error: expected primary-expression before ?.?.token
src/dictionary.cc:199: error: expected primary-expression before ?.onst?
src/dictionary.cc:202: error: ?.words_?.was not declared in this scope
src/dictionary.cc: In member function ?.td::vector<long long int, std::allocator > Dictionary::getCounts(entry_type)?.
src/dictionary.cc:227: error: expected initializer before ?.?.token
src/dictionary.cc:230: error: expected primary-expression before ?.eturn?
src/dictionary.cc:230: error: expected ?.?.before ?.eturn?
src/dictionary.cc:230: error: expected primary-expression before ?.eturn?
src/dictionary.cc:230: error: expected ?.?.before ?.eturn?
src/dictionary.cc: In member function ?.nt32_t Dictionary::getLine(std::istream&, std::vector<int, std::allocator >&, std::vector<int, std::allocator >&, std::minstd_rand&)?.
src/dictionary.cc:257: error: ?.oken?.was not declared in this scope
src/dictionary.cc:263: error: ?.niform?.was not declared in this scope

What is a reasonable "loss" during training?

Hi, I am training a fasttext model for Chinese words, and my loss is 1.286 , is this value reasonable ?

Upper Bound for Vocabulary Size?

Hi,
Is there an upper bound for vocabulary size? I have a dataset that has 3 Million tokens but the word vectors are coming only for 90791.
Thanks!

wrong count number of sample using api test of classifier.bin

I have a sanity test file which contains 7 samples/lines. But the result of calling test api just print out 5 samples. I also used the api predict, classifier gave 7 results instead.

Calling test api

LAAMPs-MacBook-Pro:fastText laam$ ./fasttext test ./result/score10en.bin data/sanity.test3
P@1: 1
Number of examples: 5

Calling predict api

LAAMPs-MacBook-Pro:fastText laam$ ./fasttext predict ./result/score10en.bin data/sanity.test3
__label__fin_irrel
__label__fin_rel
__label__fin_rel
__label__fin_irrel
__label__fin_irrel
__label__fin_rel
__label__fin_irrel

facebookresearch / fasttext Goto Github PK

fasttext's Introduction

fastText

Table of contents

Resources

Models

Supplementary data

FAQ

Cheatsheet

Requirements