tca19 / near-lossless-binarization Goto Github PK

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License: GNU General Public License v3.0

C 95.74% Makefile 4.26%

word-embeddings wordembeddings quantized-neural-networks autoencoder binary-word-embeddings binarization binary-word-vectors

near-lossless-binarization's Introduction

                 Near-lossless Binarization of Word Embeddings
                 =============================================

PREAMBLE

	This work  is  one  of  my  contributions  of  my  PhD  thesis  entitled
	"Improving methods to learn word representations for efficient  semantic
	similarities computations" in which  I  propose  new  methods  to  learn
	better word embeddings. You can find and read my thesis freely available
	at https://github.com/tca19/phd-thesis.

ABOUT

	This repository contains source code to  binarize  any  real-value  word
	embeddings into binary  vectors.   It  also  contains  some  scripts  to
	evaluate the performances of the binary vectors on  semantic  similarity
	tasks  and   top-k   queries.    Related   paper   can   be   found   at
	https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.

	If you use this repository, please cite:

	@inproceedings{tissier2019near,
	  author    = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
	  title     = {Near-Lossless Binarization of Word Embeddings},
	  booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
	               Artificial Intelligence, Honolulu, Hawaii, USA,
	               January 27 - February 1, 2019.},
	  volume    = {33},
	  pages     = {7104--7111},
	  year      = {2019},
	  url       = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
	  doi       = {10.1609/aaai.v33i01.33017104}
	}

INSTALLATION

	To compile the source files of this repository, you need to have on your
	system:
	  - OpenBLAS [1]
	  - a C compiler (gcc, clang ...)
	  - make

	Then run the command `make` to build the different  binary  executables.

	[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages

USAGE

	1. Binarize word vectors
	------------------------
	Run the executable `binarize` to transform  real-value  embeddings  into
	binary vectors.  The only mandatory command line argument  is  `-input`,
	the filename containing the real-value vectors.

	./binarize -input vectors.vec

	All  the  other  existing  flags  documentation  can   be   found   with
	`./binarize -h` or `./binarize --help`

	Binary vectors are saved by default into the file  `binary_vectors.vec`.
	The first line of this file indicates the number of binary word  vectors
	and the number of bits in each vector. Each following line are formatted
	like:

	WORD INTEGER_1 INTEGER_2 [...]

	Binary vectors are not saved as strings of zeros (0) and ones (1) but as
	groups of unsigned long integers. Each integer represents 64 bits so for
	a binary vector of 256 bits, there are 4 integers (4 * 64 =  256).   The
	binary  vector  of  a  word  is  the   concatenation   of   the   binary
	representations  of  all  the  integers  on  the  rest  of   its   line.

	2. Evaluate semantic similarity
	-------------------------------
	Run  the  executable  `similarity_binary`  to  evaluate   the   semantic
	similarity  correlation  scores  of   the   produced   binary   vectors.

	./similarity_binary binary_vectors.vec

	This repository includes some semantic similarity datasets:
	  - MEN
	  - Rare Word (RW)
	  - SimVerb 3500 (SimVerb)
	  - SimLex 999 (SimLex)
	  - WordSim 353 (WS353)
	To evaluate on other semantic similarity datasets, simply add them  into
	the datasets/ folder and run again the `./similarity_binary` executable.

	3. Top-K queries
	----------------
	Run the executable `topk_binary` to  compute  the  K  closest  neighbors
	words   and   their   respective   similarity   to   a    QUERY    word.

	./topk_binary binary_vectors.vec K QUERY

	The script will report the closest words and their similarity,  as  well
	as the time needed to compute the K closest neighbors.  You can also run
	multiple top-k queries at the same time, simply replace the  QUERY  word
	with a list of space separated words, like:

	./topk_binary binary_vectors.vec 10 queen automobile man moon computer

AUTHOR

	Written  by  Julien  Tissier  <[email protected]>.

COPYRIGHT

	This software is licensed under the GNU GPLv3 license.  See the  LICENSE
	file for more details.

near-lossless-binarization's People

Contributors

Stargazers

Watchers

Forkers

afcarl kaierlong rupsaijna kyoungrok0517 kiminh buratinator simeng-c cakiki

near-lossless-binarization's Issues

Sokal Michener definition

Hi, is there a reason why your Sokal Michener function differs from other sources online? For example, Scipy sokalmichener returns a different value from your function.

def sokalmichener(u, v):
    ntt = np.dot(u , v.T)
    ntf = np.dot(u, 1 - v.T)
    nff = (np.dot((1.0 - u) , (1.0 - v.T)))
    nft = (np.dot((1.0 - u) , v.T))
    print(ntt + ntf + nff + nft)
    return (ntt + nff) / (ntt + ntf + nff + nft)

u=np.array([1,1,1,1,1,0,0,1])
v=np.array([1,0,0,1,1,0,0,1])
sokalmichener(u, v)

Returns

8.0
0.75

Scipy's version:

distance.sokalmichener(u, v)
0.4

Is the Sokal Michener dissimilarity function a different thing?

When I execute this code with embeddings of small size (e.g., GloVeTwitter 25d) I'd like to use a representation that is smaller than 64 bits, but it doesn't work as the output file doesn't support anything that is smaller than that size.

Is there a way to obtain smaller, binary embeddings with this method?

Thank you,
Emanuele

Could you train on Glove.42B.300d.txt

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

Invalid pointer when binarize wiki-news-300d-1M.vec

Hi,
When i binarize wiki-news-300d-1M.vec file, program output
*** Error in `./binarize': free(): invalid pointer: 0x00007f64c490e010 ***
From the backtrace, i know there is something wrong free memory, but i don't find the exact position.
I am surprised at program works normally when i compress GloVe.6B.300d.txt file.

Problem when testing your solution

Hey Julien Tissier, Christophe Gravier and Amaury Habrard

Your paper looks really interesting and I wanted to test the binary vectors. However, when I run the command ./binarize -input ../crawl-300d-2M.vec on the facebook vectors I get the result on a Ubuntu 16.04 server with openblas-dev installed:

1999995 256

, 12854058564283852332 14495264440446430917 2338217189150912797 3900823073646626759

the 2543537490569830019 7091594523546908885 12132390578482067676 17054061891972847710

. 49085064044372029 13430128919309794722 3676268525978997720 1129446822182950747

and 16350715358351243055 16393012712704504925 969627095091772476 2240764738555920805

to 9099634481703745995 17642094685701064884 6452519320616550255 16120898980517001725

of 236520170687141399 4767709100272182524 12135506450154551421 17634123495325585581

a 11194259637020728784 4228149066443023844 7425191957517212067 12735785767725396608

....

Futhermore, when I then run the command: ./similarity_binary binary_vectors.vec on the result, this in turn results in the following error:

create_vocab(): 0.012973s
*** stack smashing detected ***: ./similarity_binary terminated
Aborted (core dumped)

I've tried to debug the problem, but have not been able to find the source. Am I doing something wrong?

Potential error in read_word function

Hi,
I find your work interesting and I was reading your source code in binarize.c file. In this process, i found a potential error in read_word function.

int getc_unlocked(FILE *);
UTF8 embedding file is mainstream, Restricting the MAXWORDLEN=256 is too small while reading byte by byte, i tried some embedding file like fasttext skip-gram model and got segment fault, At first, i was confused because the max word len is 228 in my embedding file, when i thought about its utf8 format, i understood. i changed the MAXWORDLEN=1024 and successfully got the result. During the process, I referenced liaocs2008's issue.
If we do not explore the max length of the vector file, we could got duplicated word because truncated by the limited MAXWORDLEN.
/* skip white spaces (space or line feed (ascii code 0x0a is \n)) */ while (isspace((tmp[i] = getc_unlocked(fp))))
Yes, when i binarized fasttext skip-gram model using this repository, i got error, it is just because the file contains space word like \t,space .etc. especially when i used pre-trained vector file. In current function, it just skip word and got same empty word

binarize.c:(.text+0xe1c): undefined reference to `cblas_sgemm'

On Fedora 30, GCC 9.2.1, I had to add to the makefile at line 23 the lib '-lcblas' to proper run make.
From:
LDLIBS = -lblas -lm
to:
LDLIBS = -lblas -lm -lcblas

undefined reference to cblas_sgemm

Having trouble to apply make to build the binaries.
In the first step I installed all requirements and used git clone to colne the repository to my local directory.

running cd near-lossless-binarization && make prdoces this error message:

`gcc binarize.c -o binarize -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops -lblas -lm
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_regularizarion_gradient':
binarize.c:(.text+0xddc): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0xf9e): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_reconstruction_gradient':
binarize.c:(.text+0x1059): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x12fb): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x1800): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o:binarize.c:(.text+0x1e71): more undefined references to `cblas_sgemm' follow

exit status is 1.

I want to use your repo in my bachelor thesis. Any advice how to fix this?
The requirements are fulfilled as far as I know.

Probably there is a mistake on my side.

Cannot reproduce results using FastText embeddings

First of all thank you for the great paper, really like the simplicity in the model! Currently I am trying to reproduce the semantic similarity scores but somehow this doesn't seem to work. What I did:

Use the Wiki 1M 300 vec embeddings from https://fasttext.cc/docs/en/english-vectors.html

train using the parameters as specified in the paper for FastText (10 epochs, all other values default)

Run your semantic evaluation scores:

create_vocab(): 0.008999s
load_vectors(): 0.926019s
Filename     | Spearman | OOV
==============================
WS353.txt    |    0.559 |   0%
MEN.txt      |    0.622 |   0%
RW.txt       |    0.421 |   3%
SimVerb.txt  |    0.238 |   0%
SimLex.txt   |    0.334 |   0%
evaluate(): 0.008135s

Maybe I am missing something but the scores for WS353, RW, SimVerb, SimLex are quite a bit lower. Am I using the correct FastText embeddings? I did not perform any sort of normalization or standardization

Thank you in advance!

undefined reference to `cblas_sgemm' error when make

OS : ubuntu 16.04.6
gcc version : 5.4.0-6ubuntu1~16.04.11

First, I installed the OpenBLAS.

sudo apt install libopenblas-dev
sudo update-alternatives --config libblas.so.3

Then, I run make and show the following error.

gcc -ansi -pedantic -lm -pthread -Ofast -funroll-loops -lblas -Wall -Wextra -Wno-unused-result binarize.c -o binarize
/tmp/cc9DvK9o.o: In function `apply_regularizarion_gradient':
binarize.c:(.text+0xea8): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1087): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o: In function `apply_reconstruction_gradient':
binarize.c:(.text+0x1117): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x149b): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1c54): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o:binarize.c:(.text+0x22b7): more undefined references to `cblas_sgemm' follow
collect2: error: ld returned 1 exit status
makefile:28: recipe for target 'binarize' failed
make: *** [binarize] Error 1

How can I deal with this problem, any help?

tca19 / near-lossless-binarization Goto Github PK

near-lossless-binarization's Introduction

near-lossless-binarization's People

Contributors

Stargazers

Watchers

Forkers

near-lossless-binarization's Issues

Sokal Michener definition

Small Embeddings

Could you train on Glove.42B.300d.txt

Invalid pointer when binarize wiki-news-300d-1M.vec

Problem when testing your solution

Potential error in read_word function

binarize.c:(.text+0xe1c): undefined reference to `cblas_sgemm'

undefined reference to cblas_sgemm

Cannot reproduce results using FastText embeddings

undefined reference to `cblas_sgemm' error when make

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent