Giter Site home page Giter Site logo

buratinator / near-lossless-binarization Goto Github PK

View Code? Open in Web Editor NEW

This project forked from tca19/near-lossless-binarization

0.0 0.0 0.0 169 KB

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License: GNU General Public License v3.0

C 95.74% Makefile 4.26%

near-lossless-binarization's Introduction

                 Near-lossless Binarization of Word Embeddings
                 =============================================

PREAMBLE

	This work  is  one  of  my  contributions  of  my  PhD  thesis  entitled
	"Improving methods to learn word representations for efficient  semantic
	similarities computations" in which  I  propose  new  methods  to  learn
	better word embeddings. You can find and read my thesis freely available
	at https://github.com/tca19/phd-thesis.

ABOUT

	This repository contains source code to  binarize  any  real-value  word
	embeddings into binary  vectors.   It  also  contains  some  scripts  to
	evaluate the performances of the binary vectors on  semantic  similarity
	tasks  and   top-k   queries.    Related   paper   can   be   found   at
	https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.

	If you use this repository, please cite:

	@inproceedings{tissier2019near,
	  author    = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
	  title     = {Near-Lossless Binarization of Word Embeddings},
	  booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
	               Artificial Intelligence, Honolulu, Hawaii, USA,
	               January 27 - February 1, 2019.},
	  volume    = {33},
	  pages     = {7104--7111},
	  year      = {2019},
	  url       = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
	  doi       = {10.1609/aaai.v33i01.33017104}
	}

INSTALLATION

	To compile the source files of this repository, you need to have on your
	system:
	  - OpenBLAS [1]
	  - a C compiler (gcc, clang ...)
	  - make

	Then run the command `make` to build the different  binary  executables.

	[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages

USAGE

	1. Binarize word vectors
	------------------------
	Run the executable `binarize` to transform  real-value  embeddings  into
	binary vectors.  The only mandatory command line argument  is  `-input`,
	the filename containing the real-value vectors.

	./binarize -input vectors.vec

	All  the  other  existing  flags  documentation  can   be   found   with
	`./binarize -h` or `./binarize --help`

	Binary vectors are saved by default into the file  `binary_vectors.vec`.
	The first line of this file indicates the number of binary word  vectors
	and the number of bits in each vector. Each following line are formatted
	like:

	WORD INTEGER_1 INTEGER_2 [...]

	Binary vectors are not saved as strings of zeros (0) and ones (1) but as
	groups of unsigned long integers. Each integer represents 64 bits so for
	a binary vector of 256 bits, there are 4 integers (4 * 64 =  256).   The
	binary  vector  of  a  word  is  the   concatenation   of   the   binary
	representations  of  all  the  integers  on  the  rest  of   its   line.

	2. Evaluate semantic similarity
	-------------------------------
	Run  the  executable  `similarity_binary`  to  evaluate   the   semantic
	similarity  correlation  scores  of   the   produced   binary   vectors.

	./similarity_binary binary_vectors.vec

	This repository includes some semantic similarity datasets:
	  - MEN
	  - Rare Word (RW)
	  - SimVerb 3500 (SimVerb)
	  - SimLex 999 (SimLex)
	  - WordSim 353 (WS353)
	To evaluate on other semantic similarity datasets, simply add them  into
	the datasets/ folder and run again the `./similarity_binary` executable.

	3. Top-K queries
	----------------
	Run the executable `topk_binary` to  compute  the  K  closest  neighbors
	words   and   their   respective   similarity   to   a    QUERY    word.

	./topk_binary binary_vectors.vec K QUERY

	The script will report the closest words and their similarity,  as  well
	as the time needed to compute the K closest neighbors.  You can also run
	multiple top-k queries at the same time, simply replace the  QUERY  word
	with a list of space separated words, like:

	./topk_binary binary_vectors.vec 10 queen automobile man moon computer

AUTHOR

	Written  by  Julien  Tissier  <[email protected]>.

COPYRIGHT

	This software is licensed under the GNU GPLv3 license.  See the  LICENSE
	file for more details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.