maxoodf / word2vec Goto Github PK

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch

License: Apache License 2.0

CMake 3.35% C++ 96.09% C 0.56%

nlp nlp-machine-learning word2vec c-plus-plus ml machine-learning machine-learning-algorithms word2vec-algorithm word2vec-model word2vec-en

word2vec's People

Contributors

Stargazers

Watchers

word2vec's Issues

R wrapper

Thanks for releasing this.
FYI. I've made an R wrapper at https://github.com/bnosac/word2vec
Will be putting it on CRAN soon.

Can I Download the pertained models?

The link leads to google drive which I don't have right to access :(

Below is the excerpt

====

You can download one or more models (833MB each) trained on 11.8GB English texts corpus:

CBOW, Hierarchical Softmax, vector size 500, window 10

CBOW, Negative Sampling, vector size 500, window 10

Skip-Gram, Hierarchical Softmax, vector size 500, window 10

Skip-Gram, Negative Sampling, vector size 500, window 10

====

Could please you let me in?

No Compatibility with standard word2vec formats

Seems that it's not actually compatible
I took text format vectors format generated by gensim: https://drive.google.com/file/d/1fbrVJeVlkrA8r4J-LHyjEMY2vmb3OliE/view?usp=sharing
And getting this:

$ ./king2queen word2vec_format_sm.txt 
.124246: 0.763318
000148: 0.728877
.007798: 0.699526
124530: 0.696778
-0.008152: 0.696077
0.070740: 0.693602
40459: 0.67723
03326: 0.672958
38744: 0.669238

Also I took pre-trained vectors from here:
https://code.google.com/archive/p/word2vec/
GoogleNews-vectors-negative300.bin.gz

Gensim works fine with google binary format:

In [1]: from gensim.models.keyedvectors import KeyedVectors

In [2]: word_vectors = KeyedVectors.load_word2vec_format('/home/oleksii/Downloads/GoogleNews-vectors-negative300.bin', binary=True)

In [3]: word_vectors.most_similar('king')
Out[3]: 
[('kings', 0.7138045430183411),
 ('queen', 0.6510956287384033),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864823460578918),
 ('ruler', 0.5797567367553711),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422105193138123)]

but this command doesn't work:

$ ./king2queen ~/Downloads/GoogleNews-vectors-negative300.bin
model: wrong model file format

Hello @maxoodf
I'm trying to make sure the R wrapper is ready for CRAN. That includes running it and testing it on many platforms.
One of these is Solaris. Unfortunately building the package fails on Solaris due to the following compilation errors
Do you know a possible solution to this?

/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3  -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include   -KPIC  -O -xlibmil -xtarget=generic -xcache=generic -nofstore  -c word2vec/lib/huffmanTree.cpp -o word2vec/lib/huffmanTree.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3  -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include   -KPIC  -O -xlibmil -xtarget=generic -xcache=generic -nofstore  -c word2vec/lib/mapper.cpp -o word2vec/lib/mapper.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"word2vec/lib/mapper.cpp", line 71: Error: Formal argument 1 of type char* in call to munmap(char*, unsigned) is being passed void*.
1 Error(s) and 2 Warning(s) detected.

Distance Between Two Words

Do you provide any functionality to find the distance between two words? If there is no such functionality and you do not plan implement it, I can probably put together a simple distance function based on the cosine similarity metric, if you brief me on the format of your .w2v files, and commit it here if you would like.

iostream error running w2v_accuracy

After running "w2v_accurary" and after all analogies on the test file has been "tested", I get this error:

terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
  what():  basic_ios::clear: iostream error
Aborted

Here is the full input/output:

$ ./w2v_accuracy model.w2v /home/user/questions-words.txt
 capital-common-countries
section accuracy: 0.405082
 capital-world
section accuracy: 0.142876
 currency
section accuracy: 0.585325
 city-in-state
section accuracy: 0.44021
 family
section accuracy: 0.765345
 gram1-adjective-to-adverb
section accuracy: 0.59348
 gram2-opposite
section accuracy: 0.332675
 gram3-comparative
section accuracy: 0.766856
 gram4-superlative
section accuracy: 0.569138
 gram5-present-participle
section accuracy: 0.73709
 gram6-nationality-adjective
section accuracy: 0.379385
 gram7-past-tense
section accuracy: 0.841701
 gram8-plural
section accuracy: 0.835196
 gram9-plural-verbs
terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
  what():  basic_ios::clear: iostream error
Aborted

Can anyone help me?

Paragraph Vector algorithms

Do you have any plans to add the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model?

Word distance - words with accents

How can I search for words with accents? Like: "balão", "água", etc...

When I enter a word with an accent that's the message it shows up: doc2vec: can not create vector

EDIT:
I already found it out. I've to use quotation marks...

vector normalization?

Why do you normalize the vectors different compared to the original word2vec code?

med = sqrt(med / vector_size);

len = sqrt(len);

mman.hpp wrong in lib/CMakeLists.txt

The name of the header is "mman.h", but it is referenced as "mman.hpp" in the CMakeLists.txt file of the lib directory. Just changing the name in lib/CMakeLists.txt file to "mman.h" seems to work for me. Pretty sure this only applies to Windows users.

[question] doc2vec results + merging/updating models

Hello, thank you very much for sharing your code.

I have tried the doc2vec example with several models, including the four pre-trained english models available on your github, and the one obtained from the original google code and data, and I could not reproduce the results that are reported in

https://github.com/maxoodf/word2vec/blob/master/examples/doc2vec/main.cpp

4: 0.976313
6: 0.971176
3: 0.943542
7: 0.850593
1: 0.749066
2: 0.724662
5: 0.587743

The order are the same but the cosine similarity values are much closer to each other (and much less discriminant). For example this is what I obtain with the sg_hs_500_10.w2v model:

4: 0.995932
6: 0.995018
3: 0.992355
7: 0.981416
1: 0.969636
2: 0.969345
5: 0.953782

Do you know the reason for this difference?

I had another question:
Is there a possibility to merge wor2vec models, or to update word2vecmodels from new corpus?

Thanks for your help

w2c_accuracy failure

Using model trained from w2v_trainer.

time ./w2v_accuracy model.w2v ../../../word2vec/questions-phrases.txt 
 newspapers
 ice_hockey
 basketball
 airlines
section accuracy: 0.124119
 people-companies
terminate called after throwing an instance of 'std::ios_base::failure'
  what():  basic_ios::clear
Aborted

-read-vocab functionality not present

Can we have the option to do -read-vocab argument as present in the original word2vec.

Adding an option to process a folder containing many files for training rather than just one big train text corpus

Hi Max,

Your word2vec implementation compiles and works fine - is also fast. Nice work!

Will it be possible, instead of pointing to a single train text corpus file option (-f [file] or --train-file [file]) to also point to a directory containing many text files and use those text files in that directory as training corpus? (i.e. something like --train-directory ~/Desktop/trainningData)

Thanks!

why normalisation when loading the model

Hello @maxoodf
I'm checking the package a bit alongside these models available at https://nlp.h-its.org/bpemb
I noticed that when loading the model, you basically standardise the embeddings as in
https://github.com/maxoodf/word2vec/blob/master/lib/word2vec.cpp#L187
But when computing the nearest based on a distance
https://github.com/maxoodf/word2vec/blob/master/include/word2vec.hpp#L178
You again do this. This seems doing things twice. Is this expected behaviour?

Java binding for the libword2vec?

I've ever used https://github.com/jnr/jnr-ffi to create Java binding for a c shared library.
Is it possible to create a Java binding for libword2vec?

maxoodf / word2vec Goto Github PK

word2vec's People

Contributors

Stargazers

Watchers

Forkers

word2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org