Giter Site home page Giter Site logo

maxoodf / word2vec Goto Github PK

View Code? Open in Web Editor NEW
129.0 12.0 24.0 121 KB

word2vec++ is a Distributed Representations of Words (word2vec) library and tools implementation, written in C++11 from the scratch

License: Apache License 2.0

CMake 3.35% C++ 96.09% C 0.56%
nlp nlp-machine-learning word2vec c-plus-plus ml machine-learning machine-learning-algorithms word2vec-algorithm word2vec-model word2vec-en

word2vec's People

Contributors

maxoodf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word2vec's Issues

No Compatibility with standard word2vec formats

Seems that it's not actually compatible
I took text format vectors format generated by gensim: https://drive.google.com/file/d/1fbrVJeVlkrA8r4J-LHyjEMY2vmb3OliE/view?usp=sharing
And getting this:

$ ./king2queen word2vec_format_sm.txt 
.124246: 0.763318
000148: 0.728877
.007798: 0.699526
124530: 0.696778
-0.008152: 0.696077
0.070740: 0.693602
40459: 0.67723
03326: 0.672958
38744: 0.669238

Also I took pre-trained vectors from here:
https://code.google.com/archive/p/word2vec/
GoogleNews-vectors-negative300.bin.gz

Gensim works fine with google binary format:

In [1]: from gensim.models.keyedvectors import KeyedVectors

In [2]: word_vectors = KeyedVectors.load_word2vec_format('/home/oleksii/Downloads/GoogleNews-vectors-negative300.bin', binary=True)

In [3]: word_vectors.most_similar('king')
Out[3]: 
[('kings', 0.7138045430183411),
 ('queen', 0.6510956287384033),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159993410110474),
 ('sultan', 0.5864823460578918),
 ('ruler', 0.5797567367553711),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422105193138123)]

but this command doesn't work:

$ ./king2queen ~/Downloads/GoogleNews-vectors-negative300.bin
model: wrong model file format

compilation

Hello @maxoodf
I'm trying to make sure the R wrapper is ready for CRAN. That includes running it and testing it on many platforms.
One of these is Solaris. Unfortunately building the package fails on Solaris due to the following compilation errors
Do you know a possible solution to this?

/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3  -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include   -KPIC  -O -xlibmil -xtarget=generic -xcache=generic -nofstore  -c word2vec/lib/huffmanTree.cpp -o word2vec/lib/huffmanTree.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
/opt/developerstudio12.6/bin/CC -std=c++11 -library=stdcpp,CrunG3  -I"/opt/R/R-4.0.0/lib/R/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib -I'/export/home/XCxYEwn/R/Rcpp/include' -I'/export/home/XCxYEwn/R/RcppProgress/include' -I/opt/csw/include   -KPIC  -O -xlibmil -xtarget=generic -xcache=generic -nofstore  -c word2vec/lib/mapper.cpp -o word2vec/lib/mapper.o
CC: Warning: Option -pthread passed to ld, if ld is invoked, ignored otherwise
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"./word2vec/include/mapper.hpp", line 74: Warning: Function w2v::fileMapper_t::~fileMapper_t() can throw only the exceptions thrown by the function w2v::mapper_t::~mapper_t() it overrides.
"word2vec/lib/mapper.cpp", line 71: Error: Formal argument 1 of type char* in call to munmap(char*, unsigned) is being passed void*.
1 Error(s) and 2 Warning(s) detected.

Distance Between Two Words

Do you provide any functionality to find the distance between two words? If there is no such functionality and you do not plan implement it, I can probably put together a simple distance function based on the cosine similarity metric, if you brief me on the format of your .w2v files, and commit it here if you would like.

iostream error running w2v_accuracy

After running "w2v_accurary" and after all analogies on the test file has been "tested", I get this error:

terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
  what():  basic_ios::clear: iostream error
Aborted

Here is the full input/output:

$ ./w2v_accuracy model.w2v /home/user/questions-words.txt
 capital-common-countries
section accuracy: 0.405082
 capital-world
section accuracy: 0.142876
 currency
section accuracy: 0.585325
 city-in-state
section accuracy: 0.44021
 family
section accuracy: 0.765345
 gram1-adjective-to-adverb
section accuracy: 0.59348
 gram2-opposite
section accuracy: 0.332675
 gram3-comparative
section accuracy: 0.766856
 gram4-superlative
section accuracy: 0.569138
 gram5-present-participle
section accuracy: 0.73709
 gram6-nationality-adjective
section accuracy: 0.379385
 gram7-past-tense
section accuracy: 0.841701
 gram8-plural
section accuracy: 0.835196
 gram9-plural-verbs
terminate called after throwing an instance of 'std::ios_base::failure[abi:cxx11]'
  what():  basic_ios::clear: iostream error
Aborted

Can anyone help me?

Paragraph Vector algorithms

Do you have any plans to add the distributed bag of words ('PV-DBOW') and the distributed memory ('PV-DM') model?

Word distance - words with accents

How can I search for words with accents? Like: "balão", "água", etc...

When I enter a word with an accent that's the message it shows up: doc2vec: can not create vector

EDIT:
I already found it out. I've to use quotation marks...

vector normalization?

Why do you normalize the vectors different compared to the original word2vec code?

med = sqrt(med / vector_size);

len = sqrt(len);

mman.hpp wrong in lib/CMakeLists.txt

The name of the header is "mman.h", but it is referenced as "mman.hpp" in the CMakeLists.txt file of the lib directory. Just changing the name in lib/CMakeLists.txt file to "mman.h" seems to work for me. Pretty sure this only applies to Windows users.

[question] doc2vec results + merging/updating models

Hello, thank you very much for sharing your code.

I have tried the doc2vec example with several models, including the four pre-trained english models available on your github, and the one obtained from the original google code and data, and I could not reproduce the results that are reported in

https://github.com/maxoodf/word2vec/blob/master/examples/doc2vec/main.cpp

4: 0.976313
6: 0.971176
3: 0.943542
7: 0.850593
1: 0.749066
2: 0.724662
5: 0.587743

The order are the same but the cosine similarity values are much closer to each other (and much less discriminant). For example this is what I obtain with the sg_hs_500_10.w2v model:

4: 0.995932
6: 0.995018
3: 0.992355
7: 0.981416
1: 0.969636
2: 0.969345
5: 0.953782

Do you know the reason for this difference?

I had another question:
Is there a possibility to merge wor2vec models, or to update word2vecmodels from new corpus?

Thanks for your help

w2c_accuracy failure

Using model trained from w2v_trainer.

time ./w2v_accuracy model.w2v ../../../word2vec/questions-phrases.txt 
 newspapers
 ice_hockey
 basketball
 airlines
section accuracy: 0.124119
 people-companies
terminate called after throwing an instance of 'std::ios_base::failure'
  what():  basic_ios::clear
Aborted

Adding an option to process a folder containing many files for training rather than just one big train text corpus

Hi Max,

Your word2vec implementation compiles and works fine - is also fast. Nice work!

Will it be possible, instead of pointing to a single train text corpus file option (-f [file] or --train-file [file]) to also point to a directory containing many text files and use those text files in that directory as training corpus? (i.e. something like --train-directory ~/Desktop/trainningData)

Thanks!

why normalisation when loading the model

Hello @maxoodf
I'm checking the package a bit alongside these models available at https://nlp.h-its.org/bpemb
I noticed that when loading the model, you basically standardise the embeddings as in
https://github.com/maxoodf/word2vec/blob/master/lib/word2vec.cpp#L187
But when computing the nearest based on a distance
https://github.com/maxoodf/word2vec/blob/master/include/word2vec.hpp#L178
You again do this. This seems doing things twice. Is this expected behaviour?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.