Giter Site home page Giter Site logo

tca19 / near-lossless-binarization Goto Github PK

View Code? Open in Web Editor NEW
46.0 6.0 8.0 169 KB

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License: GNU General Public License v3.0

C 95.74% Makefile 4.26%
word-embeddings wordembeddings quantized-neural-networks autoencoder binary-word-embeddings binarization binary-word-vectors

near-lossless-binarization's Introduction

                 Near-lossless Binarization of Word Embeddings
                 =============================================

PREAMBLE

	This work  is  one  of  my  contributions  of  my  PhD  thesis  entitled
	"Improving methods to learn word representations for efficient  semantic
	similarities computations" in which  I  propose  new  methods  to  learn
	better word embeddings. You can find and read my thesis freely available
	at https://github.com/tca19/phd-thesis.

ABOUT

	This repository contains source code to  binarize  any  real-value  word
	embeddings into binary  vectors.   It  also  contains  some  scripts  to
	evaluate the performances of the binary vectors on  semantic  similarity
	tasks  and   top-k   queries.    Related   paper   can   be   found   at
	https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.

	If you use this repository, please cite:

	@inproceedings{tissier2019near,
	  author    = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
	  title     = {Near-Lossless Binarization of Word Embeddings},
	  booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
	               Artificial Intelligence, Honolulu, Hawaii, USA,
	               January 27 - February 1, 2019.},
	  volume    = {33},
	  pages     = {7104--7111},
	  year      = {2019},
	  url       = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
	  doi       = {10.1609/aaai.v33i01.33017104}
	}

INSTALLATION

	To compile the source files of this repository, you need to have on your
	system:
	  - OpenBLAS [1]
	  - a C compiler (gcc, clang ...)
	  - make

	Then run the command `make` to build the different  binary  executables.

	[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages

USAGE

	1. Binarize word vectors
	------------------------
	Run the executable `binarize` to transform  real-value  embeddings  into
	binary vectors.  The only mandatory command line argument  is  `-input`,
	the filename containing the real-value vectors.

	./binarize -input vectors.vec

	All  the  other  existing  flags  documentation  can   be   found   with
	`./binarize -h` or `./binarize --help`

	Binary vectors are saved by default into the file  `binary_vectors.vec`.
	The first line of this file indicates the number of binary word  vectors
	and the number of bits in each vector. Each following line are formatted
	like:

	WORD INTEGER_1 INTEGER_2 [...]

	Binary vectors are not saved as strings of zeros (0) and ones (1) but as
	groups of unsigned long integers. Each integer represents 64 bits so for
	a binary vector of 256 bits, there are 4 integers (4 * 64 =  256).   The
	binary  vector  of  a  word  is  the   concatenation   of   the   binary
	representations  of  all  the  integers  on  the  rest  of   its   line.

	2. Evaluate semantic similarity
	-------------------------------
	Run  the  executable  `similarity_binary`  to  evaluate   the   semantic
	similarity  correlation  scores  of   the   produced   binary   vectors.

	./similarity_binary binary_vectors.vec

	This repository includes some semantic similarity datasets:
	  - MEN
	  - Rare Word (RW)
	  - SimVerb 3500 (SimVerb)
	  - SimLex 999 (SimLex)
	  - WordSim 353 (WS353)
	To evaluate on other semantic similarity datasets, simply add them  into
	the datasets/ folder and run again the `./similarity_binary` executable.

	3. Top-K queries
	----------------
	Run the executable `topk_binary` to  compute  the  K  closest  neighbors
	words   and   their   respective   similarity   to   a    QUERY    word.

	./topk_binary binary_vectors.vec K QUERY

	The script will report the closest words and their similarity,  as  well
	as the time needed to compute the K closest neighbors.  You can also run
	multiple top-k queries at the same time, simply replace the  QUERY  word
	with a list of space separated words, like:

	./topk_binary binary_vectors.vec 10 queen automobile man moon computer

AUTHOR

	Written  by  Julien  Tissier  <[email protected]>.

COPYRIGHT

	This software is licensed under the GNU GPLv3 license.  See the  LICENSE
	file for more details.

near-lossless-binarization's People

Contributors

kaierlong avatar tca19 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

near-lossless-binarization's Issues

Sokal Michener definition

Hi, is there a reason why your Sokal Michener function differs from other sources online? For example, Scipy sokalmichener returns a different value from your function.

def sokalmichener(u, v):
    ntt = np.dot(u , v.T)
    ntf = np.dot(u, 1 - v.T)
    nff = (np.dot((1.0 - u) , (1.0 - v.T)))
    nft = (np.dot((1.0 - u) , v.T))
    print(ntt + ntf + nff + nft)
    return (ntt + nff) / (ntt + ntf + nff + nft)

u=np.array([1,1,1,1,1,0,0,1])
v=np.array([1,0,0,1,1,0,0,1])
sokalmichener(u, v)

Returns

8.0
0.75

Scipy's version:

distance.sokalmichener(u, v)
0.4

Is the Sokal Michener dissimilarity function a different thing?

Small Embeddings

When I execute this code with embeddings of small size (e.g., GloVeTwitter 25d) I'd like to use a representation that is smaller than 64 bits, but it doesn't work as the output file doesn't support anything that is smaller than that size.

Is there a way to obtain smaller, binary embeddings with this method?

Thank you,
Emanuele

Could you train on Glove.42B.300d.txt

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

Invalid pointer when binarize wiki-news-300d-1M.vec

Hi,
When i binarize wiki-news-300d-1M.vec file, program output
*** Error in `./binarize': free(): invalid pointer: 0x00007f64c490e010 ***
From the backtrace, i know there is something wrong free memory, but i don't find the exact position.
I am surprised at program works normally when i compress GloVe.6B.300d.txt file.

Problem when testing your solution

Hey Julien Tissier, Christophe Gravier and Amaury Habrard

Your paper looks really interesting and I wanted to test the binary vectors. However, when I run the command ./binarize -input ../crawl-300d-2M.vec on the facebook vectors I get the result on a Ubuntu 16.04 server with openblas-dev installed:

1999995 256

, 12854058564283852332 14495264440446430917 2338217189150912797 3900823073646626759

the 2543537490569830019 7091594523546908885 12132390578482067676 17054061891972847710

. 49085064044372029 13430128919309794722 3676268525978997720 1129446822182950747

and 16350715358351243055 16393012712704504925 969627095091772476 2240764738555920805

to 9099634481703745995 17642094685701064884 6452519320616550255 16120898980517001725

of 236520170687141399 4767709100272182524 12135506450154551421 17634123495325585581

a 11194259637020728784 4228149066443023844 7425191957517212067 12735785767725396608

....

Futhermore, when I then run the command: ./similarity_binary binary_vectors.vec on the result, this in turn results in the following error:

create_vocab(): 0.012973s
*** stack smashing detected ***: ./similarity_binary terminated
Aborted (core dumped)

I've tried to debug the problem, but have not been able to find the source. Am I doing something wrong?

Potential error in read_word function

Hi,
I find your work interesting and I was reading your source code in binarize.c file. In this process, i found a potential error in read_word function.

  1. int getc_unlocked(FILE *);
    UTF8 embedding file is mainstream, Restricting the MAXWORDLEN=256 is too small while reading byte by byte, i tried some embedding file like fasttext skip-gram model and got segment fault, At first, i was confused because the max word len is 228 in my embedding file, when i thought about its utf8 format, i understood. i changed the MAXWORDLEN=1024 and successfully got the result. During the process, I referenced liaocs2008's issue.
    If we do not explore the max length of the vector file, we could got duplicated word because truncated by the limited MAXWORDLEN.

  2. /* skip white spaces (space or line feed (ascii code 0x0a is \n)) */ while (isspace((tmp[i] = getc_unlocked(fp))))
    Yes, when i binarized fasttext skip-gram model using this repository, i got error, it is just because the file contains space word like \t,space .etc. especially when i used pre-trained vector file. In current function, it just skip word and got same empty word

undefined reference to cblas_sgemm

Having trouble to apply make to build the binaries.
In the first step I installed all requirements and used git clone to colne the repository to my local directory.

running cd near-lossless-binarization && make prdoces this error message:

`gcc binarize.c -o binarize -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops -lblas -lm
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_regularizarion_gradient':
binarize.c:(.text+0xddc): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0xf9e): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_reconstruction_gradient':
binarize.c:(.text+0x1059): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x12fb): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x1800): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o:binarize.c:(.text+0x1e71): more undefined references to `cblas_sgemm' follow

`

exit status is 1.

I want to use your repo in my bachelor thesis. Any advice how to fix this?
The requirements are fulfilled as far as I know.

Probably there is a mistake on my side.

Cannot reproduce results using FastText embeddings

First of all thank you for the great paper, really like the simplicity in the model! Currently I am trying to reproduce the semantic similarity scores but somehow this doesn't seem to work. What I did:

Use the Wiki 1M 300 vec embeddings from https://fasttext.cc/docs/en/english-vectors.html

train using the parameters as specified in the paper for FastText (10 epochs, all other values default)

Run your semantic evaluation scores:

create_vocab(): 0.008999s
load_vectors(): 0.926019s
Filename     | Spearman | OOV
==============================
WS353.txt    |    0.559 |   0%
MEN.txt      |    0.622 |   0%
RW.txt       |    0.421 |   3%
SimVerb.txt  |    0.238 |   0%
SimLex.txt   |    0.334 |   0%
evaluate(): 0.008135s

Maybe I am missing something but the scores for WS353, RW, SimVerb, SimLex are quite a bit lower. Am I using the correct FastText embeddings? I did not perform any sort of normalization or standardization

Thank you in advance!

undefined reference to `cblas_sgemm' error when make

OS : ubuntu 16.04.6
gcc version : 5.4.0-6ubuntu1~16.04.11

First, I installed the OpenBLAS.

sudo apt install libopenblas-dev
sudo update-alternatives --config libblas.so.3

Then, I run make and show the following error.

gcc -ansi -pedantic -lm -pthread -Ofast -funroll-loops -lblas -Wall -Wextra -Wno-unused-result binarize.c -o binarize
/tmp/cc9DvK9o.o: In function `apply_regularizarion_gradient':
binarize.c:(.text+0xea8): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1087): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o: In function `apply_reconstruction_gradient':
binarize.c:(.text+0x1117): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x149b): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1c54): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o:binarize.c:(.text+0x22b7): more undefined references to `cblas_sgemm' follow
collect2: error: ld returned 1 exit status
makefile:28: recipe for target 'binarize' failed
make: *** [binarize] Error 1

How can I deal with this problem, any help?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.