zomux / neuralcompressor Goto Github PK

View Code? Open in Web Editor NEW

86.0 4.0 19.0 20 KB

Embedding Quantization (Compress Word Embeddings)

License: MIT License

Python 98.91% Shell 1.09%

neuralcompressor's Introduction

nncompress: Implementations of Embedding Quantization (Compress Word Embeddings)

Thank you for your interest on our paper.

I'm receieving mail basically everyday and happy to know many of you implemented the model correctly.

I'm glad to debug your code or have discussion with you.

Please do not hesitate to mail me for help.

mail_address = "raph_ael@ua_ca.com".replace("_", "")

Requirements:

numpy and tensorflow (I also have the pytorch implementation, which will be uploaded)

Tutorial of the code

Download the project and prepare the data

> git clone https://github.com/zomux/neuralcompressor
> cd neuralcompressor
> bash scripts/download_glove_data.sh

Convert the Glove embeddings to numpy format

> python scripts/convert_glove2numpy.py data/glove.6B.300d.txt

Train the embedding quantization model

> python bin/quantize_embed.py -M 32 -K 16 --train

...
[epoch198] train_loss=12.82 train_maxp=0.98 valid_loss=12.50 valid_maxp=0.98 bps=618 *
[epoch199] train_loss=12.80 train_maxp=0.98 valid_loss=12.53 valid_maxp=0.98 bps=605
Training Done

Evaluate the averaged euclidean distance

> python bin/quantize_embed.py -M 32 -K 16 --evaluate

Mean euclidean distance: 4.889592628145218

Export the word codes and the codebook matrix

> python bin/quantize_embed.py -M 32 -K 16 --export

It will generate two files:

data/mymodel.codes
data/mymodel.codebook.npy

Check the codes

> paste data/glove.6B.300d.word data/mymodel.codes | head -n 100

...
only    15 14 7 10 1 14 14 3 0 9 1 9 3 3 0 0 12 1 3 12 15 3 11 12 12 6 1 5 13 6 2 6
state   7 13 7 3 8 14 10 6 6 4 12 2 9 3 9 0 1 1 3 9 11 10 0 14 14 4 15 5 0 6 2 1
million 5 7 3 15 1 14 4 0 6 11 1 4 8 3 1 0 0 1 3 14 8 6 6 5 2 1 2 12 13 6 6 15
could   3 14 7 0 2 14 5 3 0 9 1 0 2 3 9 0 3 1 3 11 5 15 1 12 12 6 1 6 2 6 2 10
...

Use it in python

from nncompress import EmbeddingCompressor

# Load my embedding matrix
matrix = np.load("data/glove.6B.300d.npy")

# Initialize the compressor
compressor = EmbeddingCompressor(32, 16, "data/mymodel")

# Train the quantization model
compressor.train(matrix)

# Evaluate
distance = compressor.evaluate(matrix)
print("Mean euclidean distance:", distance)

# Export the codes and codebook
compressor.export(matrix, "data/mymodel")

Citation

@inproceedings{shu2018compressing,
title={Compressing Word Embeddings via Deep Compositional Code Learning},
author={Raphael Shu and Hideki Nakayama},
booktitle={International Conference on Learning Representations (ICLR)},
year={2018},
url={https://openreview.net/forum?id=BJRZzFlRb},
}

Arxiv version: https://arxiv.org/abs/1711.01068

neuralcompressor's People

Contributors

Stargazers

Watchers

Forkers

loongchh butsugiri shubhampachori12110095 rsharmapty reiisky idiosyncraticdragon zhuqunxi ml-lab permikomnaskaltara sdadas avnermay buptpriswang sharekiller kkitayama gabrielmajeri tuxedocat wtbacon ruanchaves eddieburning

neuralcompressor's Issues

Pytorch version

Hi, the README suggests there's a Pytorch version. Could you please upload it? I'd love to play around with the code.
Thanks.

Feature : option to anneal temperature and fine-tune codebook

Assign @zomux

Tensorflow version?

I tried running the code with tensorflow r1.6, however,
tensorflow.python.ops.rnn_cell_impl import _linear in embed_compress.py failed.

I suspect that this error is due to the incompatibility in tensorflow versions (e.g. r1.4 vs r1.6).

Could you specify the tensorflow version that is required by neuralcompressor?

This library doesn't work for large embeddings

Issue description

I tried to execute a slightly modified version of this script ( no significative changes were made ) for an embedding with a large vocabulary and 600 dimensions:

from nncompress import EmbeddingCompressor

# Load my embedding matrix
matrix = np.load("data/glove.6B.300d.npy")

# Initialize the compressor
compressor = EmbeddingCompressor(32, 16, "data/mymodel")

# Train the quantization model
compressor.train(matrix)

# Evaluate
distance = compressor.evaluate(matrix)
print("Mean euclidean distance:", distance)

# Export the codes and codebook
compressor.export(matrix, "data/mymodel")

But then, this is what I got:

Traceback (most recent call last):
  File "compress.py", line 82, in <module>
    pipe\
  File "compress.py", line 70, in train
    compressor.train(matrix)
  File "/home/user/summer/smallnilc/nncompress/embed_compress.py", line 159, in train
    word_ids_var, loss_op, train_op, maxp_op = self.build_training_graph(embed_matrix)
  File "/home/user/summer/smallnilc/nncompress/embed_compress.py", line 114, in build_training_graph
    input_matrix = tf.constant(embed_matrix, name="embed_matrix")
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 180, in constant_v1
    allow_broadcast=False)
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 284, in _constant_impl
    allow_broadcast=allow_broadcast))
  File "/home/user/summer/smallnilc/small/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 537, in make_tensor_proto
    "Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Tensorflow devs have answered issues similar to this one by saying that the only solution is to rewrite your code in a way that it doesn't break the hard limit of 2GB imposed by protobuf.

Steps to reproduce the issue

Simply try to compress an embedding above 300 dimensions ( either 600 or 1000 dimensions ).

convert_glove2numpy - issue while conversion for Fasttext Embeddings

@zomux
Fasttext embedding are much larger than glove-6b-300d , so a small change in code will perfectly work for both.
Note: unable to generate PR

Is there any official embedding APIs that can utilize the compositional encoding directly?

I can use this project to generate the compositional encoding for the languages. However, I meet an obstacle that no APIs in tensorflow, as far as I know, support the generated encodings. Is there any demos or related APIs that can deal with this situation?