stanfordnlp / glove Goto Github PK

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

License: Apache License 2.0

Makefile 1.91% C 70.68% Shell 8.23% MATLAB 9.32% Python 9.86%

glove's Introduction

GloVe: Global Vectors for Word Representation

nearest neighbors of frog	Litoria	Leptodactylidae	Rana	Eleutherodactylus
Pictures

Comparisons	man -> woman	city -> zip	comparative -> superlative
GloVe Geometry

We provide an implementation of the GloVe model for learning word representations, and describe how to download web-dataset vectors or train your own. See the project page or the paper for more information on glove vectors.

Download pre-trained word vectors

The links below contain word vectors obtained from the respective corpora. If you want word vectors trained on massive web datasets, you need only download one of these text files! Pre-trained word vectors are made available under the Public Domain Dedication and License.

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip [mirror]
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip [mirror]
Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download): glove.6B.zip [mirror]
Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors, 1.42 GB download): glove.twitter.27B.zip [mirror]

Train word vectors on a new corpus

If the web datasets above don't match the semantics of your end use case, you can train word vectors on your own corpus.

$ git clone https://github.com/stanfordnlp/glove
$ cd glove && make
$ ./demo.sh

Make sure you have the following prerequisites installed when running the steps above:

GNU Make
GCC (Clang pretending to be GCC is fine)
Python and NumPy

The demo.sh script downloads a small corpus, consisting of the first 100M characters of Wikipedia. It collects unigram counts, constructs and shuffles cooccurrence data, and trains a simple version of the GloVe model. It also runs a word analogy evaluation script in python to verify word vector quality. More details about training on your own corpus can be found by reading demo.sh or the src/README.md

License

All work contained in this package is licensed under the Apache License, Version 2.0. See the include LICENSE file.

glove's People

Contributors

Stargazers

Watchers

Forkers

eyeccc codeaudit sergeydidenko huazhou2 wenxiao anukat2015 ferhtaydn fxcebx edmondztt ahsannabi tkngoutham pombredanne edgenetworks caomw willitpass lvnteren smartinsightsfromdata wdw110 simudream sandy4321 lu839684437 o0neup ml-ai-nlp-ir milesqli to-shimo gaebor ziweimeng poyuwu janchorowski rygbee wen-li zephyrzilla errord stevenlol ashokpant moscow25 rtvt123 orangelpai shubashree kevinningthu junteudjio tglarner hanwei2008 hfxunlp esparza83 zakattacktwitter korterling kg5 wertyman hrishikeshvganu chagge sumansai14 zhmz90 llwire lijian8 snachx kevinwenya ritali davidnemeskey alexhuang888 mokoron netairline rjbashar justin2061 sk388938 joshzyj benjaminhess karcylee luzc08 wabyking tianlongwang lewisget fgaim jinyu0310 imclab sam676 tbarnier liuyang1123 badlogicmanpreet iamgr007 zbxzc35 wesamalnabki ovunck skycache vvw superposition anupama-gupta linxihui appcoreopc zhangweijiqn sashank06 ai2010 jgabriellima everjesse eden57 agile-innovations tanthml kvillaville thomas4g cloudadic

glove's Issues

tokenization not separating punctuation from words

I'm looking at the vocab.txt that is generated from running over the latest itwiki dump with xml tags stripped by a WikiExtractor modified to not insert doc tags. It doesn't seem to correctly separate out punctuation. I'm getting 10 different variations of the word "zia" (aunt). I must be doing something wrong, was I expected to pre-process the corpus?

:~/Downloads/GloVe-1.2$ grep ^zia vocab.txt
zia 4021
zia, 683
zia. 274
zia" 30
zia: 13
zia; 12
zia) 10
zia", 8
zia), 5
zia). 5

Input file format

What is the input file format? I know it should be whitespace separated tokens, but how you define the document boundaries?

Segmentation fault in shuffle

After making the changes I described in the python 2 to 3 issue, I received this seg fault:

[hafields@colossus GloVe-1.2]$ ./demo.sh
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005206 tokens.
Writing cooccurrences to disk.........2 files in total.
Merging cooccurrence files: processed 60666466 lines.

SHUFFLING COOCCURRENCES
array size: 255013683
Unable to open file temp_shuffle_0000.bin.
[hafields@colossus GloVe-1.2]$ ./demo.sh
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 60666466 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 60666466 lines.

TRAINING MODEL
Read 60666466 lines.
Initializing parameters...done.
vector size: 50
vocab size: 71290
x_max: 10.000000
alpha: 0.750000
./demo.sh: line 55: 9417 Segmentation fault (core dumped) $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE

Do you have any idea why this might be the case?

demo.sh fails at shuffling with segmentation fault

The demo.sh fails on a virtual Linux Mint 17.1 (based on Ubuntu 14.04) on VirtualBox (with 4GB ram).

...
Merging cooccurrence files: processed 60666466 lines.

SHUFFLING COOCCURRENCES
array size: 255013683
Shuffling by chunks: processed 0 lines../demo.sh: line 55:  2613 Segmentation fault   
$BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

Segmentation fault in shuffle

I am getting the following message on my ubuntu 14.04 -
5877 Segmentation fault (core dumped) $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE

ValueError in evaluate.py

bash-4.1$ sh demo.sh
        Binary input file of shuffled cooccurrence data (produced by 'cooccur' and 'shuffle'); default cooccurrence.shuf.bin
    -vocab-file <file>
        File containing vocabulary (truncated unigram counts, produced by 'vocab_count'); default vocab.txt

SHUFFLING COOCCURRENCES
array size: 4080218931
Shuffling by chunks: processed 174462388 lines.
Wrote 1 temporary file(s).
Merging temp files: processed 174462388 lines.

TRAINING MODEL
Read 174462388 lines.
Initializing parameters...done.
vector size: 100
vocab size: 231387
x_max: 10.000000
alpha: 0.750000
iter: 001, cost: 0.138845
iter: 002, cost: 0.093234
iter: 003, cost: 0.082903
iter: 004, cost: 0.077870
iter: 005, cost: 0.074741
iter: 006, cost: 0.072627
iter: 007, cost: 0.070986
iter: 008, cost: 0.069763
iter: 009, cost: 0.068730
iter: 010, cost: 0.067894
iter: 011, cost: 0.067183
iter: 012, cost: 0.066559
iter: 013, cost: 0.066034
iter: 014, cost: 0.065546
iter: 015, cost: 0.065127
Traceback (most recent call last):
  File "eval/python/evaluate.py", line 110, in <module>
    main()
  File "eval/python/evaluate.py", line 33, in main
    evaluate_vectors(W_norm, vocab, ivocab)
  File "eval/python/evaluate.py", line 66, in evaluate_vectors
    ind1, ind2, ind3, ind4 = indices.T
ValueError: need more than 0 values to unpack

In demo.sh I've updated ONLY the following lines:

MEMORY=64.0
VECTOR_SIZE=100
NUM_THREADS=24

Any idea of what may be the problem?

Thanks,
michele.

Asymetric context and model = 1, leads to saving only context instead of word vec

Hi,
I'm a CS M.Sc. student from the Technion, Israel, researching applications of word vectors to programming language. Thanks for your code, it's great.

I think there is a bug, where GloVe saves only context word vectors without the word vectors.
When training with symmetric=0, and model=1, the user expects (and the documentation says) that only left context is used, and finally only the word vectors will be saved (without context word vectors).

However, in cooccur.c, CREC.word1 gets the left word (the context, line 363) and CREC.word2 gets the right word (the target word, line 364). So indeed only the left context is saved in the CREC struct.

However, in glove.c, word1 and word2 fields are interpreted in the opposite way - CREC.word1 is treated as the target word (line 113), and CREC.word2 is treated as the context word (line 114). Finally, since model=1, the code attempts to save only the word vectors without the context (line 233), but the vectors that are actually saved are the original contexts, not target words.

What do you think?
Thanks!

About Document Boundary

When I saw the code of coocurrence.c, I found that it doesn't detect the boundary between documents. So if I train the model with context from both sides, I may use the begginning of the next document. Do you think that is a problem or doesn't matter? One solution to that I think is to set an x_min which means we discard occasional coocurrence of words.

Possible memory leak

https://github.com/stanfordnlp/GloVe/blob/master/src/glove.c#L288

It's looks like memory leak here

\0 inside "big data" causes problem in vocab

When pasting together different data, there will be some garbage. If there are \0 bytes, then this will be included into vocab. Result is that vocab.txt will have one row with single column of data. This will cause the rest of the process to fail.

Maybe add a check for \0 and discard it.

What is the real value returned by cooccur

Thank you for your fast responses.
I want to get the number of times each word cooccurred with other words.
In cooccur, is the "real val" in the struct "cooccur_rec" the count? If yes why is it double and not integer?

expand embedding with a smaller datased

Hello, thanks for the software and the datasets, very interesting!

I'm curious to know whether are there tools (or papers) dealing with the problem of expanding a huge collection of word embeddings like the ones available in this project download page with smaller ones build from a smaller project-specific corpus (like a mailbox or a collection of articles on a specific topic) assuming they have the same dimensionality.

It could be useful to exploit the computational effort to generate such dataset and add ones for a specific jargon or use case.

Dataset Error

It seems there is a integrity error in file Common Crawl -840B tokens(http://nlp.stanford.edu/data/glove.840B.300d.zip) in line number 140649
The word in that line is broken. Could you check this out?

Initialize model with pretrained words vectors

Hi!

I was wondering if it was possible to initialize a GloVe model with one of the pre-trained embeddings before training it on a dataset of my own? I'm not sure to see how it can be done..

Thank you!

Is the Common Crawl data available?

Is either the Common Crawl data or the script to get the data available anywhere?

Thanks.

Strange break

I'm trying to compile a GloVe for Windows(https://github.com/anoidgit/GloVe-win), I have done the job for word2vec(https://github.com/anoidgit/word2vec-win) and it passed all the test. But for GloVe, the vocab_count and cooccur could work correctly, the shuffle can work but not correctly, the GloVe just break after it print "alpha:0.750000", before sizeof and malloc, I have try to print something before the malloc(https://github.com/stanfordnlp/GloVe/blob/master/src/glove.c#L306), but it did not print the test string.
I do not know why and hope you could help me if you can.
Thank you.

Fromat of the cooccur and shuffle output

I want to access the statistics before running the glove for dimension reduction to compare GloVe to other dimensionality reduction methods.

They are being saved in a binary file. How can I read the files and access the co-occurrence values?

Bug in coocurrence calculation?

Hi,

We wanted to see if GloVe's code could be reused for an idea we had at work, and so I started looking at cooccur.c. C code can be pretty low-level, so I tried sanity checking my intuition of what is and should be going on with some fprintf calls and some small examples.

I think that the first w1/w2 pair is being double counted. The problem seems to be within the merge_files call, specifically in merge_write. I'm still thinking through the "priority queue" part of the algorithm, but when I removed one line (https://github.com/stanfordnlp/GloVe/blob/master/src/cooccur.c#L217) I got the counts I was expecting. The example I was using was super small (vocabulary of size 2) and symmetric ("a a b b"), so it didn't make sense to get assymetric results.

Does it make sense that removing this line might fix a defect? My gut feeling is that people just haven't noticed because it seems to only affect on line's counts, and typical corpa have tons of word pairs.

I'm happy to provide more information if it helps.

How to use our own tokenizer?

How to use our own tokenizer? I did not find any parameters that support using our own tokenizer.
Like twitter, it might need different tokenizer. Should I pass tokenized corpus?
Thank you.

NaN after large number of iterations on 200d GloVe training

Thanks for a great codebase!

When training a word-GloVe on a 100M tweets dataset with 30 minimum word-occurrence and resulting in a 1.2M effective vocabulary size, the GloVe trains great, albeit slowly. The error decreases at each iteration... and then after ~25 iterations, it suddenly goes NaN. Will follow up a log.

This happens for me every time, after slightly different iteration count. I don't think it is a memory problem. Rather, perhaps it's some kind of division by a very very small number, that becomes NaN. I'm thinking tiny, non-zero gradients, or taking the inverse of that gradient, resulting in a sneaky divide-by-zero.

If my theory had some merit, I would appreciate a code pointer, where I might look for the NaN in the gradient descent?

Again, it runs for hours, on a large 1.2M vocabulary. Great GloVe... then sudden a NaN. If I choose a much higher word occurrence cutoff or a smaller dimension like 50d, I can not reproduce the error. But with a 1.2M vocabulary and 15-word occurrence window... after running long enough, it happens every time.

Thanks! Love the GloVe.

The Common Crawl models: What data sets are they based on?

Hello!

I am writing my master's thesis on constructing corpora from the web. In it, I use the Common Crawl's WARC files to produce my corpus by extracting text and cleaning it myself.

The GloVe models trained on Common Crawl data is therefore a great point of comparison for me, but I can't seem to find what data it is based on: Is it the already extracted text provided by the Common Crawl (the WET files) or have you extracted text yourself from the WARC files?

Regards,
Kjetil Bugge Kristoffersen
MSc student,
Department of Informatics,
University of Oslo

Gigaword5 license

According to https://catalog.ldc.upenn.edu/LDC2011T07, the Gigaword5 corpus is under a non-commercial license.

Does this 'pollute' the legal status of glove vectors derived from Gigaword5?

a bug in the update loop in glove.c?

perhaps a bug or clever regularization in lines 133 and 134? Are l1 and l2 supposed to be switched?

        /* Adaptive gradient updates */
        fdiff *= eta; // for ease in calculating gradient
        real W_updates1_sum = 0;
        real W_updates2_sum = 0;
        for (b = 0; b < vector_size; b++) {
            // learning rate times gradient for word vectors
            temp1 = fdiff * W[b + l2];
            temp2 = fdiff * W[b + l1];
            // adaptive updates
            W_updates1[b] = temp1 / sqrt(gradsq[b + l1]);
            W_updates2[b] = temp2 / sqrt(gradsq[b + l2]);
            W_updates1_sum += W_updates1[b];
            W_updates2_sum += W_updates2[b];
            gradsq[b + l1] += temp1 * temp1;
            gradsq[b + l2] += temp2 * temp2;
        }

Why is binary the default format?

According to the usage page of the glove program,

-binary <int>
        Save output in binary format (0: text, 1: binary, 2: both); default 0

If 0 is text, and the default is 0, doesn't it mean that the default output should be text? However, I ran the program without specifying anything for binary, and ended up with binary output.

`cooccur` misses out the last token

Small bug in the cooccur package: it misses out the last token.

$ build/vocab_count -min-count 5 -verbose 2 < text8 > vocab.txt
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

$ build/cooccur -memory 4.0 -vocab-file vocab.txt -verbose 2 -window-size 15 < text8 > cooccurrence.bin
COUNTING COOCCURRENCES
window size: 15
context: symmetric
max product: 13752509
overflow length: 38028356
Reading vocab from file "vocab.txt"...loaded 71290 words.
Building lookup table...table contains 94990279 elements.
Processed 17005206 tokens.
Writing cooccurrences to disk.........2 files in total.
Merging cooccurrence files: processed 60666466 lines.

vocab_count counts 17005207 tokens, cooccur counts 17005206. The last b token in text8 is ignored.

Is it possible that I want to incrementally train my corpus based on your pre-trained models?

Hello, dear developers,

I was wondering that if glove supports incremental training. I have a small twitter corpus consisting of only 16K sentences, and thus I would like to train word-vectors on it based on your pre-trained twitter word vectors, since there are still some new terms yours didn't cover. Is that possible?

Thank you very much.
Thank you for your great contributions to NLP.

Readme guidelines vs code for preparing corpus

The readme file says to concatenate documents with a single space or dummy words, yet I see there is a code branch in glove.c for processing newlines. After trying both dummy words and newlines I've got almost the same results.

Also, for dummy words, shouldn't we use as many as the window size?

Is vocab ID in cooccurrence file the order of words in vocab.txt?

In co-occurrence file, each line is in the format of (word1_ID word2_ID count). What are these IDs? Are they the order of words in vocab count file (vocab.txt)?

vocab.txt and vector.txt

I am having issue understanding where to get vocab.txt and vector.txt files. I am relatively new to this, please help. Thanks.

vectors in W being updated (python)

Are the vectors in W supposed to be updated depending on inputs passed to the distance function?

For example, if I pass in "car", the cosine similarity with "stroller" is 0.1729. If I then pass in "car stroller", and then just "car" again, then the cosine similarity with "stroller" is now 0.765.

Running the python code through a debugger, it looks like the vectors for "car" and "stroller" in W are updated during the distance function call with the input containing multiple words. Is this supposed to be happening?

from eval/python/distance.py:

    for idx, term in enumerate(input_term.split(' ')):
        if term in vocab:
            print('Word: %s  Position in vocabulary: %i' % (term, vocab[term]))
            if idx == 0:
                vec_result = W[vocab[term], :]
            else:
                vec_result += W[vocab[term], :]

when you initialize the first vec_result, and then add to it in the else: branch, you're updating the first vector in the W ndarray itself since vec_result is a reference (not a copy).

How do I install GloVe in python?

Allocating 8 extra bytes per word?

In glove.c, you increment vector_size at line 67 to account for the bias value. Additionally, at line 70 you allocate space for (vector_size + 1). Is this extra 8-byte allocation redundant?

If so, it would increase the amount of memory allocated for word vector, context word vector and gradsq vectors by an unnecessary ~60MB for a corpus of 2M unique words.

Padding in Creating Word Embedding for Twitter

Hi,

You have mentioned here that padding would be useful for short documents. Have you done this for your twitter word vectors? If yes, how did you do the padding? Can you share the function you designed for this purpose? @Russell91, I would be happy if you provide an answer soon.

Regards,
Nader

Special Tokens in glove.twitter.27B

Hi,

Can you explain meaning of special tokens in glove.twitter.27B?

for example: repeat, allcaps, elong, ...

Double free or corruption while running glove code

Error message is occuring after shuffle file. During iteration is showing nan values
iter: 001, cost: 0.186454 iter: 002, cost: 0.152860 iter: 003, cost: 0.131222 iter: 004, cost: 0.116344 iter: 005, cost: 0.106716 iter: 006, cost: -nan iter: 007, cost: -nan iter: 008, cost: -nan
after that during last glove code it is giving mentioned error

Error in build/glove: double free or corruption (!prev): 0x00000000006244e0 ***

Segfault when running build/glove

$ time build/glove -save-file gigamega.cased.tokenized.glove.600d -threads 32 -input-file /mnt/cooccurrence.shuf.bin -x-max 40 -iter 10 -vector-size 600 -vocab-file /mnt/vocab.txt -verbose 3
TRAINING MODEL
Read 1325708976 lines.
Initializing parameters...done.
vector size: 600
vocab size: 3448322
x_max: 40.000000
alpha: 0.750000
Segmentation fault

Any clues? Anything I can debug? Size of the corpus is 67GB, all scripts prior to glove worked like a charm.
Box has 128GB of memory and 32 cores

Latest version of GloVe downloaded from http://nlp.stanford.edu/projects/glove/

warnings at compile time

I get the following warnings on ubuntu 14.04 LTS

make
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
src/glove.c: In function ‘glove_thread’:
src/glove.c:85:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     long long id = (long long) vid;
                    ^
src/glove.c: In function ‘train_glove’:
src/glove.c:256:86: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
         for (a = 0; a < num_threads; a++) pthread_create(&pt[a], NULL, glove_thread, (void *)a);
                                                                                      ^
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
src/shuffle.c:30:48: warning: integer overflow in expression [-Woverflow]
 static const long LRAND_MAX = ((long) RAND_MAX + 2) * (long)RAND_MAX;
                                                ^
src/shuffle.c: In function ‘rand_long’:
src/shuffle.c:56:31: warning: integer overflow in expression [-Woverflow]
         rnd = ((long)RAND_MAX + 1) * (long)rand() + (long)rand();
                               ^
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result ```

Skipping updateCaught NaN in diff for kdiff for thread.

What does this mean? and what should I do to avoid this?

Adaptive gradient update bug?

Hi,
I'm a CS M.Sc. student from the Technion, Israel, researching applications of word vectors to programming language. Thanks for your code, it's great.

In glove.c, lines 136-144, are the values of temp1 and temp2 a mistake, or should it be like this?

// learning rate times gradient for word vectors
temp1 = fdiff * W[b + l2];
temp2 = fdiff * W[b + l1];
// adaptive updates
W_updates1[b] = temp1 / sqrt(gradsq[b + l1]);
W_updates2[b] = temp2 / sqrt(gradsq[b + l2]);
W_updates1_sum += W_updates1[b];
W_updates2_sum += W_updates2[b];
gradsq[b + l1] += temp1 * temp1;
gradsq[b + l2] += temp2 * temp2;

temp1 gets the value of W in line l2:
temp1 = fdiff * W[b + l2];
but then gradsq in line l1 is increased with temp1*temp1:
gradsq[b + l1] += temp1 * temp1;

So - I think the values of temp1 and temp2 are swapped. What do you think?
Thanks!

raw_input not longer available in Python 3.X.X

Hello guys,

Please change raw_input() to input(), since it's not longer available in Python 3.X.X according to the documentation.

Link: https://docs.python.org/3/whatsnew/3.0.html

PEP 3111: raw_input() was renamed to input(). That is, the new input() function reads a line from sys.stdin and returns it with the trailing newline stripped. It raises EOFError if the input is terminated prematurely. To get the old behavior of input(), use eval(input()).

Predicted vectors not normalized?

I'm not an expert in linear algebra, but, in evaluate.py shouldn't the predicted vectors pred_vec be normalized in order for the cosine similarity to be between -1 and 1? I was surprised to get values greater than 1 in some cases.

Map in evaluate.py

Traceback (most recent call last):
File "eval/python/evaluate.py", line 110, in
main()
File "eval/python/evaluate.py", line 22, in main
vector_dim = len(vectors[ivocab[0]])
TypeError: object of type 'map' has no len()

Had to put list() around line 16 in evaluate.py in order to fix:
vectors[vals[0]] = list(map(float, vals[1:]))

<unk> in input vocab

Lines 225-226 of glove.c go like this:

// input vocab cannot contain special <unk> keyword
if (strcmp(word, "<unk>") == 0) return 1;

However, later:

if (use_unk_vec) {
  ...
}

I think this is wrong. If the input vocabulary contains the <unk> token, after potentially hours of waiting for glove to finish, the user ends up with a broken text file. The binary vectors will be intact IF the user asked for them, but even then, he might prefer text.

I see two ways of solving this problem:

Allow the user to have <unk> in the input vocabulary -- maybe by providing a switch or just

if (strcmp(word, "<unk>") == 0) use_unk_vec = 0;

If <unk> is not allowed, just warn the user after reading the vocabulary (or even in vocab-count)

Is there any pretrained model that can be used for test?

I have some unknow words(unk) in my data, i want to send the context of unk to the pretrained model to get the prediction of embedding for the unk words. But i can't found pretrained model here, is there any such resources?

why sort when calculating cooccur?

i noticed that there are several qsort in the cooccur.c, while next step is shuffling the cooccur.
i wonder what the sort code is used to.

Skipping updateCaught NaN in diff for kdiff for thread.

Any ideas/suggestions on how to overcome it? Latest GloVe from Github, here is the params:
time build/glove -save-file gigamega.cased.tokenized.glove.600d -threads 32 -input-file /mnt/cooccurrence.shuf.bin -x-max 40 -iter 50 -vector-size 600 -vocab-file /mnt/vocab.txt -verbose 3

Is there any way to retrain model with new sentences?

hi, Is there any way to retrain model with new sentences?

Unable to open file temp_shuffle_0000.bin.

Bellow is my terminal output, do you know what I can do to fix this?

[hafields@bluepig fp]$ cd GloVe-1.2
[hafields@bluepig GloVe-1.2]$ ls
build cooccurrence.bin cooccurrence.shuf.bin demo.sh eval LICENSE Makefile README src text8 vectors.bin vectors.txt vocab.txt
[hafields@bluepig GloVe-1.2]$ ./demo.sh
mkdir -p build
gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
BUILDING VOCABULARY
Processed 17005207 tokens.
Counted 253854 unique words.
Truncating vocabulary at min count 5.
Using vocabulary of size 71290.

SHUFFLING COOCCURRENCES
array size: 255013683
Unable to open file temp_shuffle_0000.bin.

nan output: pthreads or corpus size?

I have a small corpus (165,367 words, 33 unique) from bAbi. After successfully creating the ancillary files, the following command produces nan vectors or Segmentation fault: 11.
build/glove -save-file vectors -input-file cooccurrence.shuf.bin -vector-size 2 -vocab-file vocab.txt -binary 0

From vectors.txt:

down nan nan
put nan nan
<unk> nan nan

the system gets confused on long or multiple vocabulary entries

In the code (cooccur.c and glove.c) there are some build-in tests for repeated entries in the vocabulary.
Also long entries are detected.
This however leads to very strange and questionable results.
As an example I provide a toy vocabulary 'voc' and a toy text 'txt'
voc.zip
txt.zip

The commands I ran were:
./build/cooccur -window-size 3 -vocab-file voc < txt > bin
./build/shuffle -verbose 1 < bin > shuf
./build/glove -vocab-file voc -input-file shuf

The resulting vectors are wrong, imho:
There are 3 different vectors for 'mies' and there are 2 vectors for parts of the overly long 'klaasklaas....' entry

aap 0.146921 -0.055305 -0.028311 -0.039464 0.074458
noot -0.085299 0.147773 0.035997 0.080978 -0.051187
mies -0.052002 0.054730 0.092953 0.116784 0.033464
mies -0.084152 0.107761 -0.010578 -0.007232 -0.115695
wim -0.036644 -0.050298 -0.034700 -0.142460 0.091654
klaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaaskl
aasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaa
sklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaask
laasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaaskla
asklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaas
klaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaaskl
aasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaa
sklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaask
laasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaaskla
asklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaas
klaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaas 0.068427 0.
033890 -0.031678 0.037881 -0.002418
klaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaaskl
aasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaa
sklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaasklaask
laasklaasklaasklaasklaas 0.090010 0.112283 -0.010036 0.126007 -0.006739
zus -0.040267 0.141259 -0.006493 0.052686 0.058043
mies 0.001488 0.174758 -0.065266 -0.005968 -0.034094
<unk> 0.000942 0.074095 -0.006457 0.024357 0.005276

The problems are in cooccur.c, where the vocabulary is read,and the words are hashed.

when hashinsert fails, the id is still incremented. (via ++j)
when reading fails due to an overly long entry, 1000 letters are hashed and we try again with the remainder, splitting the long word in several words.
(using while (fscanf(fid, format, str, &id) != EOF) is an anti-pattern anyway, fscanf doesn't return EOF)

SO: the vocabulary reading needs improvement. e.g. by skipping multiple and overly long entries totally.
BUT THEN:
This leads to a smaller vocabulary size then the files size. In glove.c the vocabulary size is first counted as the size of the file. Which would be wrong then, AND the vectors are written using words stemming from the same wrong reading loop as in cooccur.c. Otherwise things might get out-of-sync

So this needs rework. glove.c should use exact the same reading logic as cooccur.c

Storing the true vocabulary size in the .bin file might be a good idea?