Giter Site home page Giter Site logo

tca19 / near-lossless-binarization Goto Github PK

View Code? Open in Web Editor NEW
46.0 6.0 8.0 169 KB

This repository contains source code to binarize any real-value word embeddings into binary vectors.

License: GNU General Public License v3.0

C 95.74% Makefile 4.26%
word-embeddings wordembeddings quantized-neural-networks autoencoder binary-word-embeddings binarization binary-word-vectors

near-lossless-binarization's Issues

Small Embeddings

When I execute this code with embeddings of small size (e.g., GloVeTwitter 25d) I'd like to use a representation that is smaller than 64 bits, but it doesn't work as the output file doesn't support anything that is smaller than that size.

Is there a way to obtain smaller, binary embeddings with this method?

Thank you,
Emanuele

Potential error in read_word function

Hi,
I find your work interesting and I was reading your source code in binarize.c file. In this process, i found a potential error in read_word function.

  1. int getc_unlocked(FILE *);
    UTF8 embedding file is mainstream, Restricting the MAXWORDLEN=256 is too small while reading byte by byte, i tried some embedding file like fasttext skip-gram model and got segment fault, At first, i was confused because the max word len is 228 in my embedding file, when i thought about its utf8 format, i understood. i changed the MAXWORDLEN=1024 and successfully got the result. During the process, I referenced liaocs2008's issue.
    If we do not explore the max length of the vector file, we could got duplicated word because truncated by the limited MAXWORDLEN.

  2. /* skip white spaces (space or line feed (ascii code 0x0a is \n)) */ while (isspace((tmp[i] = getc_unlocked(fp))))
    Yes, when i binarized fasttext skip-gram model using this repository, i got error, it is just because the file contains space word like \t,space .etc. especially when i used pre-trained vector file. In current function, it just skip word and got same empty word

undefined reference to cblas_sgemm

Having trouble to apply make to build the binaries.
In the first step I installed all requirements and used git clone to colne the repository to my local directory.

running cd near-lossless-binarization && make prdoces this error message:

`gcc binarize.c -o binarize -ansi -pedantic -Wall -Wextra -Wno-unused-result -Ofast -funroll-loops -lblas -lm
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_regularizarion_gradient':
binarize.c:(.text+0xddc): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0xf9e): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o: in function `apply_reconstruction_gradient':
binarize.c:(.text+0x1059): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x12fb): undefined reference to `cblas_sgemm'
/usr/bin/ld: binarize.c:(.text+0x1800): undefined reference to `cblas_sgemm'
/usr/bin/ld: /tmp/ccdDjYDi.o:binarize.c:(.text+0x1e71): more undefined references to `cblas_sgemm' follow

`

exit status is 1.

I want to use your repo in my bachelor thesis. Any advice how to fix this?
The requirements are fulfilled as far as I know.

Probably there is a mistake on my side.

Problem when testing your solution

Hey Julien Tissier, Christophe Gravier and Amaury Habrard

Your paper looks really interesting and I wanted to test the binary vectors. However, when I run the command ./binarize -input ../crawl-300d-2M.vec on the facebook vectors I get the result on a Ubuntu 16.04 server with openblas-dev installed:

1999995 256

, 12854058564283852332 14495264440446430917 2338217189150912797 3900823073646626759

the 2543537490569830019 7091594523546908885 12132390578482067676 17054061891972847710

. 49085064044372029 13430128919309794722 3676268525978997720 1129446822182950747

and 16350715358351243055 16393012712704504925 969627095091772476 2240764738555920805

to 9099634481703745995 17642094685701064884 6452519320616550255 16120898980517001725

of 236520170687141399 4767709100272182524 12135506450154551421 17634123495325585581

a 11194259637020728784 4228149066443023844 7425191957517212067 12735785767725396608

....

Futhermore, when I then run the command: ./similarity_binary binary_vectors.vec on the result, this in turn results in the following error:

create_vocab(): 0.012973s
*** stack smashing detected ***: ./similarity_binary terminated
Aborted (core dumped)

I've tried to debug the problem, but have not been able to find the source. Am I doing something wrong?

Could you train on Glove.42B.300d.txt

Hi,

I find your work interesting and I was testing it on glove.6B.300d.txt (after converting to the w2v format). The program works fine and output vectors look normal.

However, when I run it on glove.42B.300d.txt, the program outputs zeros for all words. I am guessing it might be a convergence issue. Could you test and help me figure out the problem?

Thanks!

undefined reference to `cblas_sgemm' error when make

OS : ubuntu 16.04.6
gcc version : 5.4.0-6ubuntu1~16.04.11

First, I installed the OpenBLAS.

sudo apt install libopenblas-dev
sudo update-alternatives --config libblas.so.3

Then, I run make and show the following error.

gcc -ansi -pedantic -lm -pthread -Ofast -funroll-loops -lblas -Wall -Wextra -Wno-unused-result binarize.c -o binarize
/tmp/cc9DvK9o.o: In function `apply_regularizarion_gradient':
binarize.c:(.text+0xea8): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1087): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o: In function `apply_reconstruction_gradient':
binarize.c:(.text+0x1117): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x149b): undefined reference to `cblas_sgemm'
binarize.c:(.text+0x1c54): undefined reference to `cblas_sgemm'
/tmp/cc9DvK9o.o:binarize.c:(.text+0x22b7): more undefined references to `cblas_sgemm' follow
collect2: error: ld returned 1 exit status
makefile:28: recipe for target 'binarize' failed
make: *** [binarize] Error 1

How can I deal with this problem, any help?

Cannot reproduce results using FastText embeddings

First of all thank you for the great paper, really like the simplicity in the model! Currently I am trying to reproduce the semantic similarity scores but somehow this doesn't seem to work. What I did:

Use the Wiki 1M 300 vec embeddings from https://fasttext.cc/docs/en/english-vectors.html

train using the parameters as specified in the paper for FastText (10 epochs, all other values default)

Run your semantic evaluation scores:

create_vocab(): 0.008999s
load_vectors(): 0.926019s
Filename     | Spearman | OOV
==============================
WS353.txt    |    0.559 |   0%
MEN.txt      |    0.622 |   0%
RW.txt       |    0.421 |   3%
SimVerb.txt  |    0.238 |   0%
SimLex.txt   |    0.334 |   0%
evaluate(): 0.008135s

Maybe I am missing something but the scores for WS353, RW, SimVerb, SimLex are quite a bit lower. Am I using the correct FastText embeddings? I did not perform any sort of normalization or standardization

Thank you in advance!

Invalid pointer when binarize wiki-news-300d-1M.vec

Hi,
When i binarize wiki-news-300d-1M.vec file, program output
*** Error in `./binarize': free(): invalid pointer: 0x00007f64c490e010 ***
From the backtrace, i know there is something wrong free memory, but i don't find the exact position.
I am surprised at program works normally when i compress GloVe.6B.300d.txt file.

Sokal Michener definition

Hi, is there a reason why your Sokal Michener function differs from other sources online? For example, Scipy sokalmichener returns a different value from your function.

def sokalmichener(u, v):
    ntt = np.dot(u , v.T)
    ntf = np.dot(u, 1 - v.T)
    nff = (np.dot((1.0 - u) , (1.0 - v.T)))
    nft = (np.dot((1.0 - u) , v.T))
    print(ntt + ntf + nff + nft)
    return (ntt + nff) / (ntt + ntf + nff + nft)

u=np.array([1,1,1,1,1,0,0,1])
v=np.array([1,0,0,1,1,0,0,1])
sokalmichener(u, v)

Returns

8.0
0.75

Scipy's version:

distance.sokalmichener(u, v)
0.4

Is the Sokal Michener dissimilarity function a different thing?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.