tca19 / near-lossless-binarization Goto Github PK
View Code? Open in Web Editor NEWThis repository contains source code to binarize any real-value word embeddings into binary vectors.
License: GNU General Public License v3.0
This repository contains source code to binarize any real-value word embeddings into binary vectors.
License: GNU General Public License v3.0
Hi,
I find your work interesting and I was reading your source code in binarize.c file. In this process, i found a potential error in read_word function.
int getc_unlocked(FILE *);
UTF8 embedding file is mainstream, Restricting the MAXWORDLEN=256 is too small while reading byte by byte, i tried some embedding file like fasttext skip-gram model and got segment fault, At first, i was confused because the max word len is 228 in my embedding file, when i thought about its utf8 format, i understood. i changed the MAXWORDLEN=1024 and successfully got the result. During the process, I referenced liaocs2008's issue.
If we do not explore the max length of the vector file, we could got duplicated word because truncated by the limited MAXWORDLEN.
/* skip white spaces (space or line feed (ascii code 0x0a is \n)) */ while (isspace((tmp[i] = getc_unlocked(fp))))
Yes, when i binarized fasttext skip-gram model using this repository, i got error, it is just because the file contains space word like \t,space .etc. especially when i used pre-trained vector file. In current function, it just skip word and got same empty word
Hi,
I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.
However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?
Thanks!
Hi, is there a reason why your Sokal Michener function differs from other sources online? For example, Scipy sokalmichener returns a different value from your function.
def sokalmichener(u, v):
ntt = np.dot(u , v.T)
ntf = np.dot(u, 1 - v.T)
nff = (np.dot((1.0 - u) , (1.0 - v.T)))
nft = (np.dot((1.0 - u) , v.T))
print(ntt + ntf + nff + nft)
return (ntt + nff) / (ntt + ntf + nff + nft)
u=np.array([1,1,1,1,1,0,0,1])
v=np.array([1,0,0,1,1,0,0,1])
sokalmichener(u, v)
Returns
8.0
0.75
Scipy's version:
distance.sokalmichener(u, v)
0.4
Is the Sokal Michener dissimilarity function a different thing?
OS : ubuntu 16.04.6
gcc version : 5.4.0-6ubuntu1~16.04.11
First, I installed the OpenBLAS.
sudo apt install libopenblas-dev
sudo update-alternatives --config libblas.so.3
Then, I run make
and show the following error.
gcc -ansi -pedantic -lm -pthread -Ofast -funroll-loops -lblas -Wall -Wextra -Wno-unused-result binarize.c -o binarize
/tmp/cc9DvK9o.o: In function `apply_regularizarion_gradient':
binarize.c:(.text+0xea8): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1087): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o: In function `apply_reconstruction_gradient':
binarize.c:(.text+0x1117): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x149b): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1c54): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o:binarize.c:(.text+0x22b7): more undefined references to `cblas_sgemm' follow
collect2: error: ld returned 1 exit status
makefile:28: recipe for target 'binarize' failed
make: *** [binarize] Error 1
How can I deal with this problem, any help?
On Fedora 30, GCC 9.2.1, I had to add to the makefile at line 23 the lib '-lcblas' to proper run make.
From:
LDLIBS = -lblas -lm
to:
LDLIBS = -lblas -lm -lcblas
Hi,
When i binarize wiki-news-300d-1M.vec file, program output
*** Error in `./binarize': free(): invalid pointer: 0x00007f64c490e010 ***
From the backtrace, i know there is something wrong free memory, but i don't find the exact position.
I am surprised at program works normally when i compress GloVe.6B.300d.txt file.
First of all thank you for the great paper, really like the simplicity in the model! Currently I am trying to reproduce the semantic similarity scores but somehow this doesn't seem to work. What I did:
Use the Wiki 1M 300 vec embeddings from https://fasttext.cc/docs/en/english-vectors.html
train using the parameters as specified in the paper for FastText (10 epochs, all other values default)
Run your semantic evaluation scores:
create_vocab(): 0.008999s
load_vectors(): 0.926019s
Filename | Spearman | OOV
==============================
WS353.txt | 0.559 | 0%
MEN.txt | 0.622 | 0%
RW.txt | 0.421 | 3%
SimVerb.txt | 0.238 | 0%
SimLex.txt | 0.334 | 0%
evaluate(): 0.008135s
Maybe I am missing something but the scores for WS353, RW, SimVerb, SimLex are quite a bit lower. Am I using the correct FastText embeddings? I did not perform any sort of normalization or standardization
Thank you in advance!
Having trouble to apply make
to build the binaries.
In the first step I installed all requirements and used git clone
to colne the repository to my local directory.
running cd near-lossless-binarization && make
prdoces this error message:
`gcc binarize.c -o binarize -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops -lblas -lm
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_regularizarion_gradient':
binarize.c:(.text+0xddc): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0xf9e): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_reconstruction_gradient':
binarize.c:(.text+0x1059): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x12fb): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x1800): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o:binarize.c:(.text+0x1e71): more undefined references to `cblas_sgemm' follow
`
exit status is 1.
I want to use your repo in my bachelor thesis. Any advice how to fix this?
The requirements are fulfilled as far as I know.
Probably there is a mistake on my side.
Hey Julien Tissier, Christophe Gravier and Amaury Habrard
Your paper looks really interesting and I wanted to test the binary vectors. However, when I run the command ./binarize -input ../crawl-300d-2M.vec
on the facebook vectors I get the result on a Ubuntu 16.04 server with openblas-dev installed:
1999995 256
, 12854058564283852332 14495264440446430917 2338217189150912797 3900823073646626759
the 2543537490569830019 7091594523546908885 12132390578482067676 17054061891972847710
. 49085064044372029 13430128919309794722 3676268525978997720 1129446822182950747
and 16350715358351243055 16393012712704504925 969627095091772476 2240764738555920805
to 9099634481703745995 17642094685701064884 6452519320616550255 16120898980517001725
of 236520170687141399 4767709100272182524 12135506450154551421 17634123495325585581
a 11194259637020728784 4228149066443023844 7425191957517212067 12735785767725396608
....
Futhermore, when I then run the command: ./similarity_binary binary_vectors.vec
on the result, this in turn results in the following error:
create_vocab(): 0.012973s
*** stack smashing detected ***: ./similarity_binary terminated
Aborted (core dumped)
I've tried to debug the problem, but have not been able to find the source. Am I doing something wrong?
When I execute this code with embeddings of small size (e.g., GloVeTwitter 25d) I'd like to use a representation that is smaller than 64 bits, but it doesn't work as the output file doesn't support anything that is smaller than that size.
Is there a way to obtain smaller, binary embeddings with this method?
Thank you,
Emanuele
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.