Giter Site home page Giter Site logo

mittens's Introduction

title

Mittens

This package contains fast TensorFlow and NumPy implementations of GloVe and Mittens.

By vectorizing the GloVe objective function, we deliver massive speed gains over other Python implementations (10x on CPU; 60x on GPU). See the Speed section below.

The caveat is that our implementation is only suitable for modest vocabularies (up to ~20k tokens should be fine) since the co-occurrence matrix must be held in memory.

Vectorizing the objective also reveals that it is amenable to a retrofitting term that encourages representations to remain close to pretrained embeddings. This is useful for domains that require specialized representations but lack sufficient data to train them from scratch. Mittens starts with the general-purpose pretrained representations and tunes them to a specialized domain.

Installation

Dependencies

Mittens only requires numpy. However, if tensorflow is available, that will be used instead. The two implementations use the same cost function and optimizer, so the only difference is that the tensorflow version shows a small speed improvement on CPU, and a large speed improvement when run on GPU.

User installation

The easiest way to install mittens is with pip:

pip install -U mittens

You can also install it by cloning the repository and adding it to your Python path. Make sure you have at least numpy installed.

Note that neither method automatically installs TensorFlow: see their instructions.

Examples

For both examples, it is assumed that you have already computed the weighted co-occurrence matrix (cooccurence for vocabulary vocab).

GloVe

from mittens import GloVe

# Load `cooccurrence`
# Train GloVe model
glove_model = GloVe(n=25, max_iter=1000)  # 25 is the embedding dimension
embeddings = glove_model.fit(cooccurrence)

embeddings is now an np.array of size (len(vocab), n), where the rows correspond to the tokens in vocab.

A small complete example:

from mittens import GloVe
import numpy as np

cooccurrence = np.array([
    [  4.,   4.,   2.,   0.],
    [  4.,  61.,   8.,  18.],
    [  2.,   8.,  10.,   0.],
    [  0.,  18.,   0.,   5.]])
glove_model = GloVe(n=2, max_iter=100)
embeddings = glove_model.fit(cooccurrence)
embeddings

array([[ 1.13700831, -1.16577291],
       [ 2.52644205,  1.56363213],
       [ 0.2376546 ,  0.96793109],
       [ 0.41685158,  1.32988596]], dtype=float32)

Mittens

To use Mittens, you first need pre-trained embeddings. In our paper, we used Pennington et al's embeddings, available from the Stanford GloVe website.

These vectors should be stored in a dict, where the key is the token and the value is the vector. For example, the function glove2dict below manipulates a Stanford embedding file into the appropriate format.

import csv
import numpy as np

def glove2dict(glove_filename):
    with open(glove_filename) as f:
        reader = csv.reader(f, delimiter=' ', quoting=csv.QUOTE_NONE)
        embed = {line[0]: np.array(list(map(float, line[1:])))
                for line in reader}
    return embed

Now that we have our embeddings (stored as original_embeddings), as well as a co-occurrence matrix and associated vocabulary, we're ready to train Mittens:

from mittens import Mittens

# Load `cooccurrence` and `vocab`
# Load `original_embedding`
mittens_model = Mittens(n=50, max_iter=1000)
# Note: n must match the original embedding dimension
new_embeddings = mittens_model.fit(
    cooccurrence,
    vocab=vocab,
    initial_embedding_dict= original_embedding)

Once trained, new_embeddings should be compatible with the existing embeddings in the sense that they will be oriented such that using a mix of the the two embeddings is meaningful (e.g. using original embeddings for any test-set tokens that were not in the training set).

Speed

We compared the per-epoch speed (measured in seconds) for a variety of vocabulary sizes using randomly-generated co-occurrence matrices that were approximately 90% sparse. As we see here, for matrices that fit into memory, performance is competitive with the official C implementation when run on a GPU.

For denser co-occurrence matrices, Mittens will have an advantage over the C implementation since it's speed does not depend on sparsity, while the official release is linear in the number of non-zero entries.

5K (CPU) 10K (CPU) 20K (CPU) 5K (GPU) 10K (GPU) 20K (GPU)
Non-vectorized TensorFlow 14.02 63.80 252.65 13.56 55.51 226.41
Vectorized Numpy 1.48 7.35 50.03
Vectorized TensorFlow 1.19 5.00 28.69 0.27 0.95 3.68
Official GloVe 0.66 1.24 3.50

References

[1] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

[2] Nicholas Dingwall and Christopher Potts. 2018. Mittens: An Extension of GloVe for Learning Domain-Specialized Representations. (NAACL 2018) [code]

mittens's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mittens's Issues

what's meaning of the embeddings from glove_model.fit?

I'm first contact glove. I get the cooccurrence and have a train on your code.But i don't know how to use it what i get.what's the meaning of the embedinngs? Can you recommend a tutorial or give me some explain.If you can tell me the next step that i should do.
I have 7180 vocabulary.so my cooccurrence matrix is 7180*7180.I get the embedings' matrix is 7180 * 100.What's the 100 means?
glove_model = GloVe(max_iter=1000) embeddings = glove_model.fit(cooccurrence)
output: array([[ 0.5545428 , 0.23376928, -0.07426096, ..., 0.990664 , -0.6490942 , 0.6620429 ], [ 0.8841677 , 0.51804036, 0.04785374, ..., 0.68058044, -0.90760165, 0.509221 ], [ 0.20097731, -0.14931226, -0.3834525 , ..., 0.46705124, -0.2532921 , 0.036834 ], ..., [-0.11915646, -0.028824 , -0.05225999, ..., -0.14990021, 0.05760989, -0.12905821], [-0.14854796, -0.02987392, 0.02080684, ..., -0.09068809, 0.1080381 , -0.09017138], [-0.10357033, -0.08430145, -0.03921192, ..., -0.1640319 , 0.05499419, -0.09780643]], dtype=float32)

typeError with tf_mittens.py, line 168

I am trying to using mittens to fit for a target domain, but met the following errors:

new_embed = mittens_model.fit(
... comatrix,
... vocab=id2word_cooc,
... initial_embedding_dict= old_embed)

Traceback (most recent call last):
File "", line 4, in
File "/home/clin/env/local/lib/python2.7/site-packages/mittens/mittens_base.py", line 84, in fit
fixed_initialization=fixed_initialization)
File "/home/clin/env/local/lib/python2.7/site-packages/mittens/tf_mittens.py", line 61, in _fit
self.cost = self._get_cost_function()
File "/home/clin/env/local/lib/python2.7/site-packages/mittens/tf_mittens.py", line 168, in _get_cost_function
if self.mittens > 0:
File "/home/clin/env/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 542, in nonzero
raise TypeError("Using a tf.Tensor as a Python bool is not allowed. "
TypeError: Using a tf.Tensor as a Python bool is not allowed. Use if t is not None: instead of if t: to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.

process finished with exit code 137 (interrupted by signal 9 sigkill)

Hi,
i am trying to run a glove model on Hyper-V virtual machine.
my matrix is composed of ~17K different tokens.
i run the following code:
glove_model = GloVe(n=100, max_iter=10)
embeddings = glove_model.fit(cooccurrence)

and i got this error:
process finished with exit code 137 (interrupted by signal: 9 sigkill)

can someone explain me how can i fix it?

how to use sparse matrix with mittens

I stored the co-occurrence matrix in MatrixMarket format and read into python with mmread() , do I have to convert it as dense matrix (which is impossible for memory issue)? Or does Mittens handle with this format?

Save mittens object in the tensorflow implementation.

If I try saving the trained model (GloVe object) with pickle, it fails because I used the tensorflow implementation. How should I save it?

glove = GloVe(max_iter=self.max_iter, n=self.embedding_dim, learning_rate=self.eta)
G = glove.fit(data)
trained_model = glove

with open(model_path, "w") as f:
pickle.dump(self.trained_model, f)

File "glove_vectorizer.py", line 107, in save_model
with open(model_path, "w") as f:
_pickle.PicklingError: Can't pickle <class 'module'>: attribute lookup module on builtins failed

Training epochs loss

I fine tuned mittens using stanford glove embeddings on my review dataset. After I prepared my co-occurence matrix the vocabulary size was 43,933. Therefore, given the capacity of my computer I fine tuned in two parts.

  1. used 22000 of initial vocab as first pass to fine tune embeddings and,
  2. used remaining vocab data in second pass.

The strange thing that I observe is that for first pass error over 1000 iterations reduced from 91000 (approx.) to 30000(approx.), but for second pass over 1000 iterations error scale was between 95 and 0.79 (approx).

I am confused to see this behaviour because both pass had almost same amount of data. I would like to know why is this happening.

Is this good or bad? If Yes, then how can I fix it?

Make Mittens Deterministic

At present, mittens is not reproducible because of calls to np.random.seed(None) in the function for generating random matrices. This is a nuisance for testing or research reproducibility. I'm still not 100% sure I'm going to use mittens in my current project, but if I do, I will send a pull request to fix this.

TypeError: exponent must be an integer

When trying to build the mittens model, I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-814206757e5a> in <module>()
      2     cooccur_matrix,
      3     vocab=vocab,
----> 4     initial_embedding_dict=vector_dict
      5 )

~/anaconda3/envs/mittens/lib/python3.6/site-packages/mittens/mittens_base.py in fit(self, X, vocab, initial_embedding_dict, fixed_initialization)
     78             X, vocab, initial_embedding_dict
     79         )
---> 80         weights, log_coincidence = self._initialize(X)
     81         return self._fit(X, weights, log_coincidence,
     82                          vocab=vocab,

~/anaconda3/envs/mittens/lib/python3.6/site-packages/mittens/mittens_base.py in _initialize(self, coincidence)
    143         self.n_words = coincidence.shape[0]
    144         bounded = np.minimum(coincidence, self.xmax)
--> 145         weights = (bounded / float(self.xmax)) ** self.alpha
    146         log_coincidence = log_of_array_ignoring_zeros(coincidence)
    147         return weights, log_coincidence

~/anaconda3/envs/mittens/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py in __pow__(self, other)
    320 
    321     def __pow__(self, other):
--> 322         return matrix_power(self, other)
    323 
    324     def __ipow__(self, other):

~/anaconda3/envs/mittens/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py in matrix_power(M, n)
    139         raise ValueError("input must be a square array")
    140     if not issubdtype(type(n), N.integer):
--> 141         raise TypeError("exponent must be an integer")
    142 
    143     from numpy.linalg import inv

TypeError: exponent must be an integer

I was able to identify that the error is because self.alpha is set to 0.75. If I set that to 1.0 I do not get the error (though I will obviously not be weighting things as intended).

Note: I am using numpy 1.14.5

how to initialize vectors for words in corpus *not* in GloVe

Not so much an issue as a question.

Do you have any suggestions on the "best" way to initialize vectors for words in my small corpus that don't appear in GloVe? I'm assuming that I need to do this so that mittens can "retrofit" it. But I suspect that a simple np.random.rand(100) is probably not the best way to go.

Any suggestions would be much appreciated.

Memory Error

I get memory error while converting corpus.matrix (co-occurrencce matrix) to numpy array. This is because the size of my data is quite large.

Is it necessary to convert co-occurrence matrix to numpy array? can we not work with sparse matrix?

What other solutions can you suggest for me?

how to use mittens for more than 20k vocab ?

I am trying to go through the code to understand where the changes can help in using mittens for vocab more than 20k. if you can tell me the approach or point to the part of the code that needs can be done patches to break that limitation. Or any other explanation would be helpful.

Suggestion on how to generate the cooccurrence matrix

Not an Issue, but i was wondering, do you have any suggestion on a library to generate the matrix (and eventually the vocab) or eventually a tutorial?

I have quite a long corpus of more or less 4k short documents and I cannot use the script by the original project since i'm on windows.

Cannot run mittens with tensorflow 2.1

After installing tensorflow 2.1, I cannt run GloVe any more - "fit" function gives the following errors:

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2020-03-01 17:54:01.126392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-01 17:54:01.127263: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.127355: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.127418: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128038: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128072: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128111: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128134: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128168: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128184: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.129233: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.129268: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[{{node Tensordot_1/MatMul}}]]
[[Sum/_5]]
(1) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[{{node Tensordot_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/richard/Documents/AI/XCS224U/similarity_methods.py", line 28, in
embeddings = glove_model.fit(cooccurrence)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 240, in fit
X, fixed_initialization=fixed_initialization)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 84, in fit
fixed_initialization=fixed_initialization)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 83, in _fit
self.log_coincidence: log_coincidence})
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 960, in run
run_metadata_ptr)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
feed_dict_tensor, options, run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[node Tensordot_1/MatMul (defined at /anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py:151) ]]
[[Sum/_5]]
(1) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[node Tensordot_1/MatMul (defined at /anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py:151) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'Tensordot_1/MatMul':
File "/Documents/AI/XCS224U/similarity_methods.py", line 28, in
embeddings = glove_model.fit(cooccurrence)
File "/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 240, in fit
X, fixed_initialization=fixed_initialization)
File "/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 84, in fit
fixed_initialization=fixed_initialization)
File "/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 56, in _fit
self._build_graph(vocab, initial_embedding_dict)
File "/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 151, in _build_graph
tf.tensordot(self.bw, tf.transpose(self.ones), axes=1) +
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 4106, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 2798, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 5626, in mat_mul
name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init
self._traceback = tf_stack.extract_stack()

Tensorflow 2.1 error

After installing tensorflow 2.1, I cannt run GloVe any more - "fit" function gives the following errors:

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
2020-03-01 17:54:01.126392: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-01 17:54:01.127263: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.127355: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.127418: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128038: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128072: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128111: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128134: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128168: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.128184: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.129233: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2020-03-01 17:54:01.129268: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1367, in _do_call
return fn(*args)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1352, in _run_fn
target_list, run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1445, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[{{node Tensordot_1/MatMul}}]]
[[Sum/_5]]
(1) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[{{node Tensordot_1/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/richard/Documents/AI/XCS224U/similarity_methods.py", line 28, in
embeddings = glove_model.fit(cooccurrence)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 240, in fit
X, fixed_initialization=fixed_initialization)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 84, in fit
fixed_initialization=fixed_initialization)
File "/home/richard/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 83, in _fit
self.log_coincidence: log_coincidence})
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 960, in run
run_metadata_ptr)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1183, in _run
feed_dict_tensor, options, run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1361, in _do_run
run_metadata)
File "/home/richard/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1386, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[node Tensordot_1/MatMul (defined at /anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py:151) ]]
[[Sum/_5]]
(1) Internal: Blas GEMM launch failed : a.shape=(4, 1), b.shape=(1, 4), m=4, n=4, k=1
[[node Tensordot_1/MatMul (defined at /anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py:151) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'Tensordot_1/MatMul':
File "/Documents/AI/XCS224U/similarity_methods.py", line 28, in
embeddings = glove_model.fit(cooccurrence)
File "/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 240, in fit
X, fixed_initialization=fixed_initialization)
File "/anaconda3/lib/python3.7/site-packages/mittens/mittens_base.py", line 84, in fit
fixed_initialization=fixed_initialization)
File "/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 56, in _fit
self._build_graph(vocab, initial_embedding_dict)
File "/anaconda3/lib/python3.7/site-packages/mittens/tf_mittens.py", line 151, in _build_graph
tf.tensordot(self.bw, tf.transpose(self.ones), axes=1) +
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 4106, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/math_ops.py", line 2798, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/gen_math_ops.py", line 5626, in mat_mul
name=name)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 742, in _apply_op_helper
attrs=attr_protos, op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3322, in _create_op_internal
op_def=op_def)
File "/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1756, in init
self._traceback = tf_stack.extract_stack()

TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions

When trying to run fit() with a mittens model, I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-814206757e5a> in <module>()
      2     cooccur_matrix,
      3     vocab=vocab,
----> 4     initial_embedding_dict=vector_dict
      5 )

~/anaconda3/envs/mittens/lib/python3.6/site-packages/mittens/mittens_base.py in fit(self, X, vocab, initial_embedding_dict, fixed_initialization)
     78             X, vocab, initial_embedding_dict
     79         )
---> 80         weights, log_coincidence = self._initialize(X)
     81         return self._fit(X, weights, log_coincidence,
     82                          vocab=vocab,

~/anaconda3/envs/mittens/lib/python3.6/site-packages/mittens/mittens_base.py in _initialize(self, coincidence)
    144         bounded = np.minimum(coincidence, self.xmax)
    145         weights = (bounded / float(self.xmax)) ** self.alpha
--> 146         log_coincidence = log_of_array_ignoring_zeros(coincidence)
    147         return weights, log_coincidence
    148 

~/anaconda3/envs/mittens/lib/python3.6/site-packages/mittens/mittens_base.py in log_of_array_ignoring_zeros(M)
    258     log_M = M.copy()
    259     mask = log_M > 0
--> 260     log_M[mask] = np.log(log_M[mask])
    261     return log_M
    262 

TypeError: NumPy boolean array indexing assignment requires a 0 or 1-dimensional input, input has 2 dimensions

It appears that the indexing can't occur with a 2-dimensional array, but the input to this method is the cooccurrence matrix which has to be 2D, correct?

Note: I'm using numpy 1.14.5

Inconsistent results - Mittens VS standard Glove

I am using mittens with a pre-built cooccurrence matrix of domains with the hopes of clustering certain domains that are thematically related, close to each other. Using the non-vectorized glove implementation from https://github.com/stanfordnlp/GloVe, I get very strong results. The current initialization is:

glove_model = Glove(no_components=50, learning_rate=0.03)
glove_model.fit(coo_matrix(matrix, dtype=float), epochs=50, no_threads=64, verbose=True)

Finding the nearest domains to nintendo using cosine distance yields good results.

find_nearest(glove_model, "nintendo", 10)

[('game', 0.955347329117499),
('zavvi', 0.9382098190168783),
('eurogamer', 0.9296358002057901),
('playstation', 0.9290108695965159),
('gamespot', 0.9241452666014682),
('gamesradar', 0.9210470827690169),
('365games', 0.9193152241566838),
('ign', 0.9178656620515147),
('ea', 0.912055674280889),
('forbiddenplanet', 0.9118661547211797)]

Given these results, I wanted to use mittens for two reasons: take advantage of the vectorized implementation for speed, and harness the ability to extend glove into a retrofitted model. However, when I used a basic mittens (without retrofitting existing embeddings), the results come out quite poor, even when the same hyperparameters are used.

glove_mittens_50_50 = GloVe(n=50, max_iter=50, learning_rate=0.03)
cooccurance = np.array(matrix.todense()) # was sparse matrix for original glove
glove_mittens_trained_50_50 = glove_mittens_50_50.fit(cooccurance)

I built a pd dataframe with the resulting numpy matrix and incorporated the domains as the index before writing a function that would calculate the cosine distance in the same way that the original glove model does.

find_nearest(mittens_glove_df_50_50, "nintendo", 10)

[('hmrc', 0.9992567),
('anglingdirect', 0.999141),
('axa', 0.99907136),
('greatist', 0.99906415),
('techadvisor', 0.99906313),
('victorianplumbing', 0.99903136),
('dell', 0.9990228),
('imore', 0.99899846),
('carpetright', 0.99899185)]

As you can see, the results are not at all as expected. Furthermore, while the original glove model will have converged and not change much (only very slightly) by increasing the number of iterations, the vectorized glove in this package will.

find_nearest(mittens_glove_df_50_100, "github", 10)  # 100 iterations

[('yammer', 0.9993163),
('twitch', 0.9992425),
('axs', 0.9992203),
('rottentomatoes', 0.99920493),
('travelsupermarket', 0.99919695),
('lbc', 0.99919665),
('motors', 0.99918556),
('goodreads', 0.9991843),
('deezer', 0.9991767),
('nationalexpress', 0.99917376)]

Is there a reason why this is the case? Am I doing anything wrong, or is there anything else you'd like me to try?

Thanks.

Is there a way to batch train GLoVe models?

I have a gigantic corpus of text to train on that leads to memory issues.
Wondering if there was a way to do a batch training for GLoVe and/or Mittens models similar to partial_fit in some scikit-learn models?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.