amr1tv1rd1 / word2vec Goto Github PK

0.0 0.0 0.0 276 KB

Automatically exported from code.google.com/p/word2vec

License: Apache License 2.0

C 84.39% Shell 14.40% Makefile 1.21%

word2vec's People

word2vec's Issues

word2vec Binary Index & Locally-Sensitive Hashing Module

Just released a Ruby module that builds an index of a binary word2vec vector 
file, so your code can seek directly to the right position in the file for a 
given word or term. For example, the word "/en/italy" in the English 
"freebase-vectors-skipgram1000-en.bin" file is at byte position 116414 position.

The module also computes a locally-sensitive hash for each vector in a binary 
word2vec file, so you can do a nearest neighbor search (i.e. cosine distance) 
much faster. I get a couple orders of magnitude better performance on my 
machine, with a 10 bit random projection LSH.

https://github.com/someben/treebank/blob/master/src/build_word2vec_index.rb

Thanks for the project, Tomas.

Best,
Ben

Original issue reported on code.google.com by [email protected] on 23 Sep 2013 at 3:48

Does the vocab_size match the actual size of vocab in word2vec.c?



What steps will reproduce the problem?
1. Download attached text_simple train file
2. Compile word2vec.c as: gcc word2vec.c -o word2vec -lm -pthread
3. Run: ./word2vec -train text_simple -save-vocab vocab.txt

What is the expected output? What do you see instead?
Expect in saved vocab.txt file:
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
nine 9
===============
What is really seen in the file
===============
</s> 0
and 12
the 11
four 10
in 8
used 5
war 5
one 5
===============

The last element "nine 5" wass missing.

What version of the product are you using? On what operating system?
MacOS, gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 
2336.11.00)

Please provide any additional information below.

This is NOT really a bug report because I am confused to understand the format 
of train_file and how the vocab is constructed from it.

Based on the source code of word2vec.c, when reading from train_file, it will

1. insert </s> as the first element in vocab

2. scan each word (or </s> for newline) in train_file, add it to vocab, and 
hash it in vocab_hash

So far the vocab_size = the number of words in vocab, INCLUDING </s> at the head

3. sort the words in vocab based on their counts, but keep </s> as the first of 
vocab

Now the vocab_size because the number of words in vocab, EXCLUDING the leading 
</s>. And if there is no newline character in train_file, </s> won't even be 
hashed in vocab_hash

So there is a inconsistency here between vocab_size and the actual size of 
vocab (including </s>). It could be a bug because later when the vocab is being 
iterated, it is always done by iterating the elements from 0 to vocab_size-1, 
like in SaveVocab(). This results in that the leading </s> will be saved, but 
the last element in vocab will be ignored. At least that's what it looks with a 
simple train file "text_simple" as attached here.

Original issue reported on code.google.com by [email protected] on 25 Aug 2013 at 2:38

Attachments:

text_simple

Sorry, this is a question, not an issue.

Sorry to put a plain old question here, but... is there a correct place to put 
a plain old question?
In other words, is there a way to engaging in dialog about this project? I have 
several points I'd like to discuss.

Original issue reported on code.google.com by [email protected] on 19 Aug 2013 at 2:16

The definition of sentences?

Hi, I am new to word2vec. I am preparing corpus in sentences using wikipedia 
dump. However the dump is pre-splitted in paragraphs which seems need to 
further be processed into sentences. 

My question is 

is it possible to directly train paragraphs instead of sentences? Or it is a 
must that word2vec (the SkipGram model) has to work with sentences. 

Since the algorithm trains the data by a context window, I didn't see much 
difference by add the extra window across sentences within the same paragraph.

Original issue reported on code.google.com by [email protected] on 24 Feb 2015 at 9:35

Updating global variables without locking

Some of the global variables are updated in the function TrainModelThread 
without the use of locking. Here is one of them:

word_count_actual += word_count - last_word_count;

Can someone please explain how this works. 

Also, can someone please help me with the following question:

For negative sampling, why do we need a unigram table to choose the negative 
samples from? Why can't we just choose random words from the vocabulary?

Sincerely,

Vishal

Original issue reported on code.google.com by [email protected] on 21 Jul 2015 at 6:26

word2phrase segmentation fault

When I run demo-phrases.sh on Linux I get the following error message:

./demo-phrases.sh: line 6:  5492 Segmentation fault      ./word2phrase -train 
text8 -output text8-phrase -threshold 500 -debug 2

Original issue reported on code.google.com by [email protected] on 25 Aug 2013 at 11:12

Patch for word2vec.c: Fix memory leaks

word2vec does not free allocated objects correctly. It also reads freed objects.
Attached patch fixes this issue.

Original issue reported on code.google.com by [email protected] on 17 Aug 2013 at 6:23

Attachments:

word2vec.patch

Make fails on Windows + Cygwin

What steps will reproduce the problem?
1. Download the code
2. Try to compile with "make"

What is the expected output? What do you see instead?
Error : 
gcc word2vec.c -o word2vec -O2 -Wall -funroll-loops
/usr/bin/as: /usr/bin/as : fichier binaire impossible à lancer
Makefile:6 : la recette pour la cible « word2vec » a échouée
make: *** [word2vec] Erreur 1

What version of the product are you using? On what operating system?
word2vec won't compile
I have been using it with cygwin on windows since 3 months and now it won't 
compile anymore.
I've tried reinstalling all the components of Cygwin that I need, been through 
the trace logs of the make command but nothing works...

Do you have any idea why it is not working ?

Original issue reported on code.google.com by [email protected] on 30 Jul 2015 at 11:29

Patch for /trunk/demo-train-big-model-v1.sh

Fixed a couple of bugs:
1. name mismatch with the UMBC-webbase corpus
2. Downloading the phrases dataset

Original issue reported on code.google.com by [email protected] on 15 Sep 2014 at 6:42

Attachments:

demo-train-big-model-v1.sh.patch

Patch for /trunk/README.txt

Typo

Original issue reported on code.google.com by [email protected] on 19 May 2015 at 4:05

Attachments:

README.txt.patch

Create vectors for classification

I'm trying to use word2vec in order to change a bunch of text into vectors with 
the use of word2vec.

Ultimately, I want to do this so I can go from text -> vectors -> 
classification through the usage of a classification algorithm like a neural 
network. Any reading or testing of word2vec, however, does not let me output a 
vectorized version of my input text data, but, rather, I can just do word 
similarity or some other feature like that. 

Am I completely off base with the potential usage of word2vec or is this 
actually possible?

Original issue reported on code.google.com by [email protected] on 1 Jun 2015 at 2:37

A typo "back" of words

Run ./word2vec without arguments
You can see the help in the last line
Use the continuous back of words model; default is 0 (skip-gram model)

Original issue reported on code.google.com by [email protected] on 24 Sep 2013 at 11:30

word2phrase: AddWord2Vocab strcpy's full length of strings longer than MAX_STRING

The strcmp should be changed to strncmp.  This may be causing segmentation 
faults in some cases when the input contains words that are too long.

Original issue reported on code.google.com by [email protected] on 17 Jul 2015 at 11:12

cnnot check out ?

i just cannot check out the source code by svn in eclipse
when i using http://word2vec.googlecode.com/svn/trunk/
it returns 文件夹“” 已不存在 what's mean forder "" doesn't exist or 
just still pending.....

is word2vec project closed? or there's anything wrong with me ?

Original issue reported on code.google.com by [email protected] on 27 Nov 2014 at 9:40

word2vec.c: Minor tweak to reduce CPU pipeline stalls (3% gain)

The perf utility reported a bottleneck in this area, where it's waiting for the 
'target' variable to be finalized. By computing the next value of target early, 
overall performance is increased by about 3% on a small (128MB) training file.

add next_target next to target in the variable list.

Then in the two negative sampling blocks:
...
          if (d == 0) {
            target = word;
            label = 1;
            next_random = next_random * (unsigned long long)25214903917 + 11;
            next_target = table[(next_random >> 16) % table_size];
          } else {
            target = next_target;
            if (target == 0) target = next_random % (vocab_size - 1) + 1;
            next_random = next_random * (unsigned long long)25214903917 + 11;
            next_target = table[(next_random >> 16) % table_size];
            if (target == word) continue;
            label = 0;
          }
...

Original issue reported on code.google.com by [email protected] on 22 Jul 2015 at 6:40

Compile on debian

What steps will reproduce the problem?
1.  svn checkout http://word2vec.googlecode.com/svn/trunk/
2. cd trunk
3. make



What is the expected output? What do you see instead?

cc1: error: invalid option argument '-Ofast'
make: *** [word2vec] Erreur 1


What version of the product are you using? On what operating system?

revision 34. Debian Squeeze.

Original issue reported on code.google.com by [email protected] on 4 Oct 2013 at 11:49

Fix build errors and warnings

What steps will reproduce the problem?
1. make

What is the expected output? What do you see instead?

clean build

What version of the product are you using? On what operating system?
OSX Yosemite, Apple LLVM version 6.0 (clang-600.0.51) (based on LLVM 3.5svn)

Please provide any additional information below.
Following patch fixes build error (due to malloc.h and warnings due to unused 
variables)


Index: compute-accuracy.c
===================================================================
--- compute-accuracy.c  (revision 41)
+++ compute-accuracy.c  (working copy)
@@ -26,7 +26,7 @@
 int main(int argc, char **argv)
 {
   FILE *f;
-  char st1[max_size], st2[max_size], st3[max_size], st4[max_size], 
bestw[N][max_size], file_name[max_size], ch;
+  char st1[max_size], st2[max_size], st3[max_size], st4[max_size], 
bestw[N][max_size], file_name[max_size];
   float dist, len, bestd[N], vec[max_size];
   long long words, size, a, b, c, d, b1, b2, b3, threshold = 0;
   float *M;
Index: distance.c
===================================================================
--- distance.c  (revision 41)
+++ distance.c  (working copy)
@@ -28,7 +28,6 @@
   char file_name[max_size], st[100][max_size];
   float dist, len, bestd[N], vec[max_size];
   long long words, size, a, b, c, d, cn, bi[100];
-  char ch;
   float *M;
   char *vocab;
   if (argc < 2) {
Index: makefile
===================================================================
--- makefile    (revision 41)
+++ makefile    (working copy)
@@ -1,6 +1,6 @@
 CC = gcc
 #Using -Ofast instead of -O3 might result in faster code, but is supported only by newer GCC versions
-CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result
+CFLAGS = -I. -lm -pthread -O3 -march=native -Wall -funroll-loops 
-Wno-unused-result

 all: word2vec word2phrase distance word-analogy compute-accuracy

@@ -17,4 +17,4 @@
    chmod +x *.sh

 clean:
-   rm -rf word2vec word2phrase distance word-analogy compute-accuracy
\ No newline at end of file
+   rm -rf word2vec word2phrase distance word-analogy compute-accuracy
Index: word-analogy.c
===================================================================
--- word-analogy.c  (revision 41)
+++ word-analogy.c  (working copy)
@@ -28,7 +28,6 @@
   char file_name[max_size], st[100][max_size];
   float dist, len, bestd[N], vec[max_size];
   long long words, size, a, b, c, d, cn, bi[100];
-  char ch;
   float *M;
   char *vocab;
   if (argc < 2) {

Original issue reported on code.google.com by [email protected] on 1 Oct 2014 at 5:50

Segfault in script demo-phrase-accuracy.sh

$ ./demo-phrase-accuracy.sh 
make: Nothing to be done for `all'.
Starting training using file text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2586139
Words in train file: 17005206
Words written: 17000K
real    0m21.130s
user    0m20.062s
sys 0m1.054s
Starting training using file text8-phrase
Vocab size: 123636
Words in train file: 16337523
Alpha: 0.000119  Progress: 99.59%  Words/thread/sec: 22.70k  
real    1m38.617s
user    12m0.795s
sys 0m1.501s
newspapers:
./demo-phrase-accuracy.sh: line 12: 36538 Segmentation fault: 11  
./compute-accuracy vectors-phrase.bin < questions-phrases.txt

I'm on OSX (latest non-beta), and had to switch out "#include <stdlib.h>" to 
get it to compile, but no other changes.

Original issue reported on code.google.com by [email protected] on 19 Aug 2013 at 7:41

How to setup Word2Vec in windows ?

Give me directions to setup Word2Vec in windows platform .

Original issue reported on code.google.com by [email protected] on 20 Sep 2013 at 5:42

Missing first letters in precompiled word vectors file: GoogleNews-vectors-negative300.bin

What steps will reproduce the problem?
--Using the python word2vec module in ipython. Loaded the model from 
GoogleNews-vectors-negative300.bin using the command:
model = word2vec.load("~/Downloads/GoogleNews-vectors-negative300.bin")

What is the expected output? What do you see instead?
The vocab of the model looks like it is made of of english words that have been 
stripped of their first character. As a result, many common words are missing. 
Correctly spelled words which are found in vocab actually represent collisions 
created when removing the first character.

For instance:

model.cosine('out')  

returns:

{'out': [('outs', 0.8092596703376076),
  ('eavyweight_bout', 0.65542583911176289),
  ('ightweight_bout', 0.64856198153561295),
  ('ndercard_bout', 0.62005739361720136),
  ('iddleweight_bout', 0.61811559624397572),
  ('assily_Jirov', 0.61172633394627596),
  ('atchweight_bout', 0.60739346001729411),
  ('uper_middleweight_bout', 0.60237084554945242),
  ('eatherweight_bout', 0.60183827323165029),
  ("KO'd", 0.60002383627451883)]}

The string 'out' actually represents the english word 'bout' which has been 
correctly grouped with other boxing terms. Note the similar terms are also 
missing their first characters.

Another example:

model.cosine('aul')

returns

{'aul': [('ohn', 0.82979825790046702),
  ('eter', 0.750119256790031),
  ('ark', 0.71162490811744983),
  ('ndrew', 0.66359523924163855),
  ('hris', 0.66228796043431837),
  ('ichard', 0.66142257169136376),
  ('hilip', 0.6576444040097873),
  ('ichael', 0.64312885937086905),
  ('on', 0.64042190735670823),
  ('avid', 0.63592487085268301)]}  


This group of words is gibberish but represents a cluster of common male names. 
The full names like 'john' however are not present in the vocabularly.

It looks like the vector representation is doing a very good job of capturing 
the linguistic structure. However, the inconvenient absence of first characters 
creates many unfortunate collisions.

Original issue reported on code.google.com by [email protected] on 15 Dec 2014 at 7:54

Patch for distance.c: minor off-by-one error

This is really nitpicky but.... When populating the vocab array, distance.c 
begins skipping characters after index max_w (having read 51 characters), but 
it should have stopped after index max_w - 1. Consequently, the string 
terminator for long strings is entered in the space reserved for the subsequent 
string, and is overwritten when the next string is read in causing the two to 
be mashed together.

For example, when searching for Cash_Flow given the current (as of 2015-06-15) 
GoogleNews-vectors-negative300.bin, two results overflow the printf format 
buffer, which is padded for strings up to length 50; indeed these two string do 
not appear in the vocabulary, but are constructed when two vocabulary entries 
-- a long one followed by a normal one -- are mashed together as described 
above. After applying the attached patch the printf formatting looks fine as 
only the first 50 characters of the long entries are printed.

Original issue reported on code.google.com by [email protected] on 17 Jun 2015 at 12:24

Attachments:

distance.patch

Charset encoding

Ran and compiled on Ubuntu 14.04.

The input file is encoded in UTF-8 but output files (vocabulary and text 
vectors file) are encoded in ISO-8859-1.

All accents are wrong.

Original issue reported on code.google.com by [email protected] on 10 Sep 2014 at 5:33

Wrong Link

Where it says "First billion characters from wikipedia (use the pre-processing 
perl script from the bottom of Matt Mahoney's page)" on the homepage:

The link should point to: http://cs.fit.edu/~mmahoney/compression/textdata.html

which is the site which contains the script in question.

Original issue reported on code.google.com by [email protected] on 4 Oct 2013 at 3:33

distance.c - fread'ing all vector floats at once boosts loading speed by ~3x...

replacing:
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);

with:
    fread(&M[b * size], sizeof(float), size, f);

Greatly increases loading speed.

Original issue reported on code.google.com by [email protected] on 21 Jul 2015 at 5:01

How can I get the coordinates of all words?

I need to get all of the word coordinates in an input text. 

word2vec program gives the closest 40 words of a word, at the end. But I want 
to get word coordinates, from previous steps. I do not need the distance 
between words, just I want to get the word coordinates. Which step I should 
interest in, where and how can I get only all word coordinates.

As you understand, I am new on C language.

Thanks for your help.

Original issue reported on code.google.com by [email protected] on 11 Aug 2014 at 8:13

vectors.bin ?

Hi,

I was able to run the demo-words.sh (using the text8) and obtained the output 
binary file - vectors.bin.

Can someone help me on how to convert it into a readable text/ascii file ? 
since I don't know the format of the binary file.

tx

Original issue reported on code.google.com by [email protected] on 26 Aug 2013 at 3:51

slightly optimized word2vec

Original:
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 53.22k 
real    51m5.959s
user    78m27.464s
sys     0m17.319s

Optimized:
Starting training using file text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000005  Progress: 100.10%  Words/thread/sec: 58.15k 
real    45m47.814s
user    71m50.251s
sys     0m13.984s

VM with openSUSE 12.3 AMD64
gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012]
2 cores by AMD Phenom(tm) II X4 925 Processor:

Original issue reported on code.google.com by [email protected] on 30 Nov 2014 at 12:49

Attachments:

word2vec.patch

malloc.h

What steps will reproduce the problem?
1. make on Mac
2.
3.

What is the expected output? What do you see instead?
distance.c:18:10: fatal error: 'malloc.h' file not found
#include <malloc.h>
         ^
1 error generated.
make: *** [distance] Error 1

What version of the product are you using? On what operating system?
OSX 10.9.4

Please provide any additional information below.
I fixed it by replacing malloc.h with stdlib.h

Original issue reported on code.google.com by [email protected] on 17 Jul 2014 at 5:44

freebase vector file in python gensim

What steps will reproduce the problem?
1. Load freebase.bin files into a word2vec model on freebase
2. attempy .most_similar function 
3. error returned

What is the expected output? What do you see instead?
see beloe

What version of the product are you using? On what operating system?
mac osx anaconda python

Please provide any additional information below.

I’m trying to get started by loading the pretrained .bin files from the 
google word2vec site ( freebase-vectors-skipgram1000.bin.gz) into the gensim 
implementation of word2vec. The model loads fine,

using ..

model = word2vec.Word2Vec.load_word2vec_format('...../free....-en.bin', binary= 
True)
and creates a

>>> print model
<gensim.models.word2vec.Word2Vec object at 0x105d87f50>
but when I run the most similar function. It cant find the words in the 
vocabulary. My error code is below.

Any ideas where I’m going wrong?

>>> model.most_similar(['girl', 'father'], ['boy'], topn=3)
2013-10-11 10:22:00,562 : WARNING : word ‘girl’ not in vocabulary; ignoring 
it
2013-10-11 10:22:00,562 : WARNING : word ‘father’ not in vocabulary; 
ignoring it
2013-10-11 10:22:00,563 : WARNING : word ‘boy’ not in vocabulary; ignoring 
it
Traceback (most recent call last):
File “”, line 1, in
File 
“/....../anaconda/python.app/Contents/lib/python2.7/site-packages/gensim-0.8.7
/py2.7.egg/gensim/models/word2vec.py”, line 312, in most_similar
raise ValueError(“cannot compute similarity with no input”)
ValueError: cannot compute similarity with no input

any ideas welcome?

Original issue reported on code.google.com by [email protected] on 11 Oct 2013 at 3:41

Build for Mac?

What steps will reproduce the problem?
On a Mac:
1. svn checkout http://word2vec.googlecode.com/svn/trunk/
2. make

What is the expected output?
Binary is emitted.

What do you see instead?
pindari:word2vec pmonks$ make
gcc word2vec.c -o word2vec -lm -pthread -Ofast -march=native -Wall 
-funroll-loops -Wno-unused-result
cc1: error: invalid option argument ‘-Ofast’
cc1: error: unrecognized command line option "-Wno-unused-result"
word2vec.c:1: error: bad value (native) for -march= switch
word2vec.c:1: error: bad value (native) for -mtune= switch
make: *** [word2vec] Error 1
pindari:word2vec pmonks$

What version of the product are you using?
SVN r32

On what operating system?
Mac OSX 10.8.4

Original issue reported on code.google.com by [email protected] on 15 Aug 2013 at 5:45

Keeping indexing URIs in plain text models

Dear all,

I want to use your freebase-naming-based word2vec model. Is it provided 
anywhere as plain text?

Our problem is, that we want to have a distance vector for each resource
from DBpedia respectively freebase. By preparing and indexing the wikipedia
model with the provided perl script the URIs of the pages get lost.

Thank you in advance.
Ricardo

Original issue reported on code.google.com by [email protected] on 6 Feb 2015 at 1:05

Patch for /trunk/word2vec.c

Patch for bug, which caused discarding the last word of vocab after sorting if 
there was no newline character in the input file.

If there is no newline in the input file, vocab[0].cn==0, which is ignored in 
sorting, but is not in the for loop, where it decrements the vocab_size and 
frees the memory of the last word. However, it still computes the hash for the 
last word if its count is greater than min_count. Also the realloc needs to 
allocate only vocab_size * sizeof(struct vocab_word).

Original issue reported on code.google.com by FerroMrkva on 5 Feb 2014 at 11:24

Attachments:

word2vec.c.patch

Cluster output malformed

What steps will reproduce the problem?
1. Train on data with tokens that are numbers "12345" , "321" ...
2.  Select cluster output ( -classes option)

What is the expected output?

The expected cluster output is:

<token>, <cluster number>
ex:

quick, 44
357, 45

Instead, what you see is:

quick, 44
357 45,

What version of the product are you using? On what operating system?

0.1b on OSX 10.10.2

Please provide any additional information below.

it is currently 47oF and overcast outside

Original issue reported on code.google.com by [email protected] on 8 Feb 2015 at 7:25

expTable array initialization

In the end of word2vec.c there's "malloc" of size EXP_TABLE_SIZE + 1

But the next "for" loop is for(i=0; i < EXP_TABLE_SIZE; ++i)

That means, expTable[EXP_TABLE_SIZE] is uninitialized.

As the result - we can see different program output on the same input data.

Original issue reported on code.google.com by [email protected] on 1 Dec 2014 at 7:11

FreeBSD build patch

The enclosed patch makes word2vec build on FreeBSD.

Original issue reported on code.google.com by [email protected] on 1 Oct 2013 at 9:19

Attachments:

w2v-freebsd.patch

Sorry, not an issue. I've commented distance.c so that everyone can see how it works. Enjoy, and feel free to correct/improve it!

If, after making needed corrections, this could be added to the source code, I 
think future users would appreciate this. Thanks. --Gregg Williams

--- begin distance.c ---
//  Copyright 2013 Google Inc. All Rights Reserved.
//
//  Licensed under the Apache License, Version 2.0 (the "License");
//  you may not use this file except in compliance with the License.
//  You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
//  Unless required by applicable law or agreed to in writing, software
//  distributed under the License is distributed on an "AS IS" BASIS,
//  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
//  See the License for the specific language governing permissions and
//  limitations under the License.

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>

const long long max_size = 2000;         // max length of strings
const long long N = 40;                  // number of closest words that will 
be shown
const long long max_w = 50;              // max length of vocabulary entries

int main(int argc, char **argv) {
  FILE *f;
  char st1[max_size];
  char bestw[N][max_size];
  char file_name[max_size], st[100][max_size];
  float dist, len, bestd[N], vec[max_size];
  long long words, size, a, b, c, d, cn, bi[100];

  char ch;
  float *M;
  char *vocab;
  if (argc < 2) {
    printf("Usage: ./distance <FILE>\nwhere FILE contains word projections in the BINARY FORMAT\n");
    return 0;
  }
  strcpy(file_name, argv[1]);
  f = fopen(file_name, "rb");
  if (f == NULL) {
    printf("Input file not found\n");
    return -1;
  }
  // words = number of words in file
  fscanf(f, "%lld", &words);

  // size = number of floating-point values associated with each word in the "dictionary"
  fscanf(f, "%lld", &size);

  // vocab points to a list of all the words in the "dictionary". Words are stored in fixed-width substrings;
  //    each word is allotted max_w bytes.
  vocab = (char *)malloc((long long)words * max_w * sizeof(char));

  // SUMMARY: M contains 'size' (an integer) floats for each word in the "dictionary".
  // M points to a vector of (words * size) floats, stored linearly. Floats 0 through (size - 1) correspond
  //     to word 0, floats size through (2 * size - 1) correspond to word 1, etc.
  M = (float *)malloc((long long)words * (long long)size * sizeof(float));
  if (M == NULL) {
    printf("Cannot allocate memory: %lld MB    %lld  %lld\n", (long long)words * size * sizeof(float) / 1048576, words, size);
    return -1;
  }

  for (b = 0; b < words; b++) {

    // Reads one entry from input file f, which corresponds to one word of the "dictionary" of words contained in f;
    // this word is stored in the specific substring of vocab reserved for it.
    // UNSURE WHAT PURPOSE SERVED by the char pointed to by &ch--maybe a NUL char denoting end-of-string?
    fscanf(f, "%s%c", &vocab[b * max_w], &ch);

    // Reads 'size' floats, corresponding to the word with index b, into array M.
    for (a = 0; a < size; a++) fread(&M[a + b * size], sizeof(float), 1, f);

    // len = sqrt (sum of [each entry in M] ** 2 ) -- a normalizing factor
    len = 0;
    for (a = 0; a < size; a++) len += M[a + b * size] * M[a + b * size];
    len = sqrt(len);

    // Each entry in M is normalized by a factor of 'len'.
    for (a = 0; a < size; a++) M[a + b * size] /= len;
  }
  fclose(f);

  // **********************************************
  // ****  beginning of user-interaction loop  ****
  // **********************************************
  while (1) {
    for (a = 0; a < N; a++) bestd[a] = 0;
    for (a = 0; a < N; a++) bestw[a][0] = 0;
    printf("Enter word or sentence (EXIT to break): ");

    // st1 contains a input text from stdin (usually the console)
    a = 0;
    while (1) {
      st1[a] = fgetc(stdin);
      if ((st1[a] == '\n') || (a >= max_size - 1)) {
        st1[a] = 0;
        break;
      }
      a++;
    }

    // End program loop if input text = "EXIT".
    if (!strcmp(st1, "EXIT")) break;

    // st[0] is a zero-terminated character array representing the input text.
    cn = 0;
    b = 0;
    c = 0;
    while (1) {
      st[cn][b] = st1[c];
      b++;
      c++;
      st[cn][b] = 0;
      if (st1[c] == 0) break;
      if (st1[c] == ' ') {
        cn++;
        b = 0;
        c++;
      }
    }
    cn++;
    // cn = number of words (separated by a space) in the input text
    // st = an array of strings: st[0][] is the first word of the input text; st[1][] is the second word, etc.

    // This loop either finds each word within the input text in the 'vocab' string, or it signals 
    //     b = -1 if at least one word is not found. If a word is found, b is the index to it in 'vocab'.
    // bi[0] = the index of the first word in the input text; bi[1] = the index of the second word, etc.;
    //     bi[k] = -1 signals no more words--i.e., there are k words in the input text.
    // For each word in the input text, the word and its position in the "dictionary" is printed.
    for (a = 0; a < cn; a++) {
      // 
      for (b = 0; b < words; b++) if (!strcmp(&vocab[b * max_w], st[a])) break;
      if (b == words) b = -1;
      bi[a] = b;
      printf("\nWord: %s  Position in vocabulary: %lld\n", st[a], bi[a]);
      if (b == -1) {
        printf("Out of dictionary word!\n");
        break;
      }
    }

    // Ff input text is not found, restart user-interaction loop.
    if (b == -1) continue;

    // Reminder:
    //     st points to the words in the input text (there are cn of them)
    //     bi[k] is the index of word st[k] within the 'vocab' string
    //     M gives the precalculated "cosine distance" value for the word at st[k][].

    // The code below finds and prints the N "closest" words to the input and their 
    //     "similarity" values (which are always < 1.0)--larger values are "closer".
    printf("\n                                              Word       Cosine distance\n------------------------------------------------------------------------\n");
    // vec contains the 'size' floating-point values associated with the input text.
    // NOTE: if the input text contains multiple words, the value of each element in vec is 
    //       the SUM of the corresponding float values for each of the words in the input text.
    for (a = 0; a < size; a++) vec[a] = 0;
    for (b = 0; b < cn; b++) {
      if (bi[b] == -1) continue;
      // vec contains the 'size' vectors associated with the bi[b]-th word in the "dictionary".
      for (a = 0; a < size; a++) vec[a] += M[a + bi[b] * size];
    }

    // len = sqrt (sum of the squares of each vector element within vec)
    // Each element in vec is normalized by dividing it by 'len'.
    len = 0;
    for (a = 0; a < size; a++) len += vec[a] * vec[a];
    len = sqrt(len);
    for (a = 0; a < size; a++) vec[a] /= len;

    // Arrays bestd and bestw are associated with the list of the N words that are "closest"
    //     to the word(s) in the input text
    // For an index i, bestw[i][] points to the word in that slot,
    //                 bestd[i]   points to the word's "distance" value.
    for (a = 0; a < N; a++) bestd[a] = 0;
    for (a = 0; a < N; a++) bestw[a][0] = 0;

    // For each word in "dictionary"....    (in loop, c is the index of the word being tested)
    for (c = 0; c < words; c++) {
        // a is set to 1 if any of the words in the input text is the word being tested.
      a = 0;
      for (b = 0; b < cn; b++) if (bi[b] == c) a = 1;
      if (a == 1) continue;
      // The following executes only if the word being tested is NOT in the input text.
      dist = 0;
      // dist = sum of the 'size' the float values associated with the word being tested
      for (a = 0; a < size; a++) dist += vec[a] * M[a + c * size];
      // for each of the N slots that will eventually hold the N "closest" words...
      for (a = 0; a < N; a++) {
          // if the "distance" of word c is greater than the "distance" of the current slot (slot a),
          //     move all the bestd and bestw entries one entry closer to the end of the list (losing
          //     the "worst" entry) and insert the bestd and bestw entries for the current word (c)
          //     into the current slot (a).
        if (dist > bestd[a]) {
          for (d = N - 1; d > a; d--) {
            bestd[d] = bestd[d - 1];
            strcpy(bestw[d], bestw[d - 1]);
          }
          bestd[a] = dist;
          strcpy(bestw[a], &vocab[c * max_w]);
          break;
        }
      }
    }
    // From "best" to "worst", print each word and its "distance" value.
    for (a = 0; a < N; a++) printf("%50s\t\t%f\n", bestw[a], bestd[a]);
  }
  return 0;
}
---  end distance.c  ---

Original issue reported on code.google.com by [email protected] on 22 Aug 2013 at 6:40

Patch for /trunk/word2vec.c

the hierarchical softmax sampling code missing some situations.

Original issue reported on code.google.com by [email protected] on 26 Mar 2015 at 4:23

Attachments:

word2vec.c.patch

amr1tv1rd1 / word2vec Goto Github PK

word2vec's People

word2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org