tca19 / near-lossless-binarization Goto Github PK

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License: GNU General Public License v3.0

C 95.74% Makefile 4.26%

word-embeddings wordembeddings quantized-neural-networks autoencoder binary-word-embeddings binarization binary-word-vectors

near-lossless-binarization's Issues

Potential error in read_word function

Hi,
I find your work interesting and I was reading your source code in binarize.c file. In this process, i found a potential error in read_word function.

int getc_unlocked(FILE *);
UTF8 embedding file is mainstream, Restricting the MAXWORDLEN=256 is too small while reading byte by byte, i tried some embedding file like fasttext skip-gram model and got segment fault, At first, i was confused because the max word len is 228 in my embedding file, when i thought about its utf8 format, i understood. i changed the MAXWORDLEN=1024 and successfully got the result. During the process, I referenced liaocs2008's issue.
If we do not explore the max length of the vector file, we could got duplicated word because truncated by the limited MAXWORDLEN.
/* skip white spaces (space or line feed (ascii code 0x0a is \n)) */ while (isspace((tmp[i] = getc_unlocked(fp))))
Yes, when i binarized fasttext skip-gram model using this repository, i got error, it is just because the file contains space word like \t,space .etc. especially when i used pre-trained vector file. In current function, it just skip word and got same empty word

Could you train on Glove.42B.300d.txt

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

Sokal Michener definition

Hi, is there a reason why your Sokal Michener function differs from other sources online? For example, Scipy sokalmichener returns a different value from your function.

def sokalmichener(u, v):
    ntt = np.dot(u , v.T)
    ntf = np.dot(u, 1 - v.T)
    nff = (np.dot((1.0 - u) , (1.0 - v.T)))
    nft = (np.dot((1.0 - u) , v.T))
    print(ntt + ntf + nff + nft)
    return (ntt + nff) / (ntt + ntf + nff + nft)

u=np.array([1,1,1,1,1,0,0,1])
v=np.array([1,0,0,1,1,0,0,1])
sokalmichener(u, v)

Returns

8.0
0.75

Scipy's version:

distance.sokalmichener(u, v)
0.4

Is the Sokal Michener dissimilarity function a different thing?

undefined reference to `cblas_sgemm' error when make

OS : ubuntu 16.04.6
gcc version : 5.4.0-6ubuntu1~16.04.11

First, I installed the OpenBLAS.

sudo apt install libopenblas-dev
sudo update-alternatives --config libblas.so.3

Then, I run make and show the following error.

gcc -ansi -pedantic -lm -pthread -Ofast -funroll-loops -lblas -Wall -Wextra -Wno-unused-result binarize.c -o binarize
/tmp/cc9DvK9o.o: In function `apply_regularizarion_gradient':
binarize.c:(.text+0xea8): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1087): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o: In function `apply_reconstruction_gradient':
binarize.c:(.text+0x1117): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x149b): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1c54): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o:binarize.c:(.text+0x22b7): more undefined references to `cblas_sgemm' follow
collect2: error: ld returned 1 exit status
makefile:28: recipe for target 'binarize' failed
make: *** [binarize] Error 1

How can I deal with this problem, any help?

binarize.c:(.text+0xe1c): undefined reference to `cblas_sgemm'

On Fedora 30, GCC 9.2.1, I had to add to the makefile at line 23 the lib '-lcblas' to proper run make.
From:
LDLIBS = -lblas -lm
to:
LDLIBS = -lblas -lm -lcblas

Invalid pointer when binarize wiki-news-300d-1M.vec

Hi,
When i binarize wiki-news-300d-1M.vec file, program output
*** Error in `./binarize': free(): invalid pointer: 0x00007f64c490e010 ***
From the backtrace, i know there is something wrong free memory, but i don't find the exact position.
I am surprised at program works normally when i compress GloVe.6B.300d.txt file.

Cannot reproduce results using FastText embeddings

First of all thank you for the great paper, really like the simplicity in the model! Currently I am trying to reproduce the semantic similarity scores but somehow this doesn't seem to work. What I did:

Use the Wiki 1M 300 vec embeddings from https://fasttext.cc/docs/en/english-vectors.html

train using the parameters as specified in the paper for FastText (10 epochs, all other values default)

Run your semantic evaluation scores:

create_vocab(): 0.008999s
load_vectors(): 0.926019s
Filename     | Spearman | OOV
==============================
WS353.txt    |    0.559 |   0%
MEN.txt      |    0.622 |   0%
RW.txt       |    0.421 |   3%
SimVerb.txt  |    0.238 |   0%
SimLex.txt   |    0.334 |   0%
evaluate(): 0.008135s

Maybe I am missing something but the scores for WS353, RW, SimVerb, SimLex are quite a bit lower. Am I using the correct FastText embeddings? I did not perform any sort of normalization or standardization

Thank you in advance!

undefined reference to cblas_sgemm

Having trouble to apply make to build the binaries.
In the first step I installed all requirements and used git clone to colne the repository to my local directory.

running cd near-lossless-binarization && make prdoces this error message:

`gcc binarize.c -o binarize -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops -lblas -lm
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_regularizarion_gradient':
binarize.c:(.text+0xddc): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0xf9e): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_reconstruction_gradient':
binarize.c:(.text+0x1059): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x12fb): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x1800): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o:binarize.c:(.text+0x1e71): more undefined references to `cblas_sgemm' follow

exit status is 1.

I want to use your repo in my bachelor thesis. Any advice how to fix this?
The requirements are fulfilled as far as I know.

Probably there is a mistake on my side.

Problem when testing your solution

Hey Julien Tissier, Christophe Gravier and Amaury Habrard

Your paper looks really interesting and I wanted to test the binary vectors. However, when I run the command ./binarize -input ../crawl-300d-2M.vec on the facebook vectors I get the result on a Ubuntu 16.04 server with openblas-dev installed:

1999995 256

, 12854058564283852332 14495264440446430917 2338217189150912797 3900823073646626759

the 2543537490569830019 7091594523546908885 12132390578482067676 17054061891972847710

. 49085064044372029 13430128919309794722 3676268525978997720 1129446822182950747

and 16350715358351243055 16393012712704504925 969627095091772476 2240764738555920805

to 9099634481703745995 17642094685701064884 6452519320616550255 16120898980517001725

of 236520170687141399 4767709100272182524 12135506450154551421 17634123495325585581

a 11194259637020728784 4228149066443023844 7425191957517212067 12735785767725396608

....

Futhermore, when I then run the command: ./similarity_binary binary_vectors.vec on the result, this in turn results in the following error:

create_vocab(): 0.012973s
*** stack smashing detected ***: ./similarity_binary terminated
Aborted (core dumped)

I've tried to debug the problem, but have not been able to find the source. Am I doing something wrong?

Small Embeddings

When I execute this code with embeddings of small size (e.g., GloVeTwitter 25d) I'd like to use a representation that is smaller than 64 bits, but it doesn't work as the output file doesn't support anything that is smaller than that size.

Is there a way to obtain smaller, binary embeddings with this method?

Thank you,
Emanuele

tca19 / near-lossless-binarization Goto Github PK

near-lossless-binarization's Issues

Potential error in read_word function

Could you train on Glove.42B.300d.txt

Sokal Michener definition

undefined reference to `cblas_sgemm' error when make

binarize.c:(.text+0xe1c): undefined reference to `cblas_sgemm'

Invalid pointer when binarize wiki-news-300d-1M.vec

Cannot reproduce results using FastText embeddings

undefined reference to cblas_sgemm

Problem when testing your solution

Small Embeddings

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent