maxoodf / word2vec Goto Github PK
View Code? Open in Web Editor NEWword2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
License: Apache License 2.0
word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch
License: Apache License 2.0
Thanks for releasing this.
FYI. I've made an R wrapper at https://github.com/bnosac/word2vec
Will be putting it on CRAN soon.
The link leads to google drive which I don't have right to access :(
Below is the excerpt
====
You can download one or more models (833MB each) trained on 11.8GB English texts corpus:
CBOW, Hierarchical Softmax, vector size 500, window 10
CBOW, Negative Sampling, vector size 500, window 10
Skip-Gram, Hierarchical Softmax, vector size 500, window 10
Skip-Gram, Negative Sampling, vector size 500, window 10
====
Could please you let me in?
Seems that it's not actually compatible
I took text format vectors format generated by gensim: https://drive.google.com/file/d/1fbrVJeVlkrA8r4J-LHyjEMY2vmb3OliE/view?usp=sharing
And getting this:
$ ./king2queen word2vec_format_sm.txt
.124246: 0.763318
000148: 0.728877
.007798: 0.699526
124530: 0.696778
-0.008152: 0.696077
0.070740: 0.693602
40459: 0.67723
03326: 0.672958
38744: 0.669238
Also I took pre-trained vectors from here:
https://code.google.com/archive/p/word2vec/
GoogleNews-vectors-negative300.bin.gz
Gensim works fine with google binary format:
In [1]: from gensim.models.keyedvectors import KeyedVectors
In [2]: word_vectors = KeyedVectors.load_word2vec_format('/home/oleksii/Downloads/GoogleNews-vectors-negative300.bin', binary=True)
In [3]: word_vectors.most_similar('king')
Out[3]:
[('kings', 0.7138045430183411),
('queen', 0.6510956287384033),
('monarch', 0.6413194537162781),
('crown_prince', 0.6204219460487366),
('prince', 0.6159993410110474),
('sultan', 0.5864823460578918),
('ruler', 0.5797567367553711),
('princes', 0.5646552443504333),
('Prince_Paras', 0.5432944297790527),
('throne', 0.5422105193138123)]
but this command doesn't work:
$ ./king2queen ~/Downloads/GoogleNews-vectors-negative300.bin
model: wrong model file format
Hello @maxoodf
I'm trying to make sure the R wrapper is ready for CRAN. That includes running it and testing it on many platforms.
One of these is Solaris. Unfortunately building the package fails on Solaris due to the following compilation errors
Do you know a possible solution to this?
/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3 -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include -KPIC -O -xlibmil -xtarget=generic -xcache=generic -nofstore -c word2vec/lib/huffmanTree.cpp -o word2vec/lib/huffmanTree.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3 -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include -KPIC -O -xlibmil -xtarget=generic -xcache=generic -nofstore -c word2vec/lib/mapper.cpp -o word2vec/lib/mapper.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"word2vec/lib/mapper.cpp", line 71: Error: Formal argument 1 of type char* in call to munmap(char*, unsigned) is being passed void*.
1 Error(s) and 2 Warning(s) detected.
Do you provide any functionality to find the distance between two words? If there is no such functionality and you do not plan implement it, I can probably put together a simple distance function based on the cosine similarity metric, if you brief me on the format of your .w2v files, and commit it here if you would like.
After running "w2v_accurary" and after all analogies on the test file has been "tested", I get this error:
terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
what(): basic_ios::clear: iostream error
Aborted
Here is the full input/output:
$ ./w2v_accuracy model.w2v /home/user/questions-words.txt
capital-common-countries
section accuracy: 0.405082
capital-world
section accuracy: 0.142876
currency
section accuracy: 0.585325
city-in-state
section accuracy: 0.44021
family
section accuracy: 0.765345
gram1-adjective-to-adverb
section accuracy: 0.59348
gram2-opposite
section accuracy: 0.332675
gram3-comparative
section accuracy: 0.766856
gram4-superlative
section accuracy: 0.569138
gram5-present-participle
section accuracy: 0.73709
gram6-nationality-adjective
section accuracy: 0.379385
gram7-past-tense
section accuracy: 0.841701
gram8-plural
section accuracy: 0.835196
gram9-plural-verbs
terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
what(): basic_ios::clear: iostream error
Aborted
Can anyone help me?
Do you have any plans to add the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model?
How can I search for words with accents? Like: "balão", "água", etc...
When I enter a word with an accent that's the message it shows up: doc2vec: can not create vector
EDIT:
I already found it out. I've to use quotation marks...
Why do you normalize the vectors different compared to the original word2vec code?
med = sqrt(med / vector_size);
len = sqrt(len);
The name of the header is "mman.h", but it is referenced as "mman.hpp" in the CMakeLists.txt file of the lib directory. Just changing the name in lib/CMakeLists.txt file to "mman.h" seems to work for me. Pretty sure this only applies to Windows users.
Hello, thank you very much for sharing your code.
I have tried the doc2vec example with several models, including the four pre-trained english models available on your github, and the one obtained from the original google code and data, and I could not reproduce the results that are reported in
https://github.com/maxoodf/word2vec/blob/master/examples/doc2vec/main.cpp
4: 0.976313
6: 0.971176
3: 0.943542
7: 0.850593
1: 0.749066
2: 0.724662
5: 0.587743
The order are the same but the cosine similarity values are much closer to each other (and much less discriminant). For example this is what I obtain with the sg_hs_500_10.w2v model:
4: 0.995932
6: 0.995018
3: 0.992355
7: 0.981416
1: 0.969636
2: 0.969345
5: 0.953782
Do you know the reason for this difference?
I had another question:
Is there a possibility to merge wor2vec models, or to update word2vecmodels from new corpus?
Thanks for your help
Using model trained from w2v_trainer.
time ./w2v_accuracy model.w2v ../../../word2vec/questions-phrases.txt
newspapers
ice_hockey
basketball
airlines
section accuracy: 0.124119
people-companies
terminate called after throwing an instance of 'std::ios_base::failure'
what(): basic_ios::clear
Aborted
Can we have the option to do -read-vocab argument as present in the original word2vec.
Hi Max,
Your word2vec implementation compiles and works fine - is also fast. Nice work!
Will it be possible, instead of pointing to a single train text corpus file option (-f [file] or --train-file [file]) to also point to a directory containing many text files and use those text files in that directory as training corpus? (i.e. something like --train-directory ~/Desktop/trainningData)
Thanks!
Hello @maxoodf
I'm checking the package a bit alongside these models available at https://nlp.h-its.org/bpemb
I noticed that when loading the model, you basically standardise the embeddings as in
https://github.com/maxoodf/word2vec/blob/master/lib/word2vec.cpp#L187
But when computing the nearest based on a distance
https://github.com/maxoodf/word2vec/blob/master/include/word2vec.hpp#L178
You again do this. This seems doing things twice. Is this expected behaviour?
I've ever used https://github.com/jnr/jnr-ffi to create Java binding for a c shared library.
Is it possible to create a Java binding for libword2vec?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.