Giter Site home page Giter Site logo

philipperemy / keras-attention Goto Github PK

View Code? Open in Web Editor NEW
2.8K 78.0 675.0 4.17 MB

Keras Attention Layer (Luong and Bahdanau scores).

License: Apache License 2.0

Python 100.00%
keras keras-neural-networks attention-mechanism attention-model deep-learning

keras-attention's Introduction

Keras Attention Layer

Downloads Downloads license dep1

Attention Layer for Keras. Supports the score functions of Luong and Bahdanau.

Tested with Tensorflow 2.8, 2.9, 2.10, 2.11, 2.12, 2.13 and 2.14 (Sep 26, 2023).

Installation

PyPI

pip install attention

Attention Layer

Attention(
    units=128,
    score='luong',
    **kwargs
)

Arguments

  • units: Integer. The number of (output) units in the attention vector ($a_t$).

  • score: String. The score function $score(h_t, \bar{h_s})$. Possible values are luong or bahdanau.

    • Luong's multiplicative style. Link to paper.
    • Bahdanau's additive style. Link to paper.

Input shape

3D tensor with shape (batch_size, timesteps, input_dim).

Output shape

  • 2D tensor with shape (batch_size, num_units) ($a_t$).

If you want to visualize the attention weights, refer to this example examples/add_two_numbers.py.

Example

import numpy as np
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import load_model, Model

from attention import Attention


def main():
    # Dummy data. There is nothing to learn in this example.
    num_samples, time_steps, input_dim, output_dim = 100, 10, 1, 1
    data_x = np.random.uniform(size=(num_samples, time_steps, input_dim))
    data_y = np.random.uniform(size=(num_samples, output_dim))

    # Define/compile the model.
    model_input = Input(shape=(time_steps, input_dim))
    x = LSTM(64, return_sequences=True)(model_input)
    x = Attention(units=32)(x)
    x = Dense(1)(x)
    model = Model(model_input, x)
    model.compile(loss='mae', optimizer='adam')
    model.summary()

    # train.
    model.fit(data_x, data_y, epochs=10)

    # test save/reload model.
    pred1 = model.predict(data_x)
    model.save('test_model.h5')
    model_h5 = load_model('test_model.h5', custom_objects={'Attention': Attention})
    pred2 = model_h5.predict(data_x)
    np.testing.assert_almost_equal(pred1, pred2)
    print('Success.')


if __name__ == '__main__':
    main()

Other Examples

Browse examples.

Install the requirements before running the examples: pip install -r examples/examples-requirements.txt.

IMDB Dataset

In this experiment, we demonstrate that using attention yields a higher accuracy on the IMDB dataset. We consider two LSTM networks: one with this attention layer and the other one with a fully connected layer. Both have the same number of parameters for a fair comparison (250K).

Here are the results on 10 runs. For every run, we record the max accuracy on the test set for 10 epochs.

Measure No Attention (250K params) Attention (250K params)
MAX Accuracy 88.22 88.76
AVG Accuracy 87.02 87.62
STDDEV Accuracy 0.18 0.14

As expected, there is a boost in accuracy for the model with attention. It also reduces the variability between the runs, which is something nice to have.

Adding two numbers

Let's consider the task of adding two numbers that come right after some delimiters (0 in this case):

x = [1, 2, 3, 0, 4, 5, 6, 0, 7, 8]. Result is y = 4 + 7 = 11.

The attention is expected to be the highest after the delimiters. An overview of the training is shown below, where the top represents the attention map and the bottom the ground truth. As the training progresses, the model learns the task and the attention map converges to the ground truth.

Finding max of a sequence

We consider many 1D sequences of the same length. The task is to find the maximum of each sequence.

We give the full sequence processed by the RNN layer to the attention layer. We expect the attention layer to focus on the maximum of each sequence.

After a few epochs, the attention layer converges perfectly to what we expected.

References

keras-attention's People

Contributors

luux avatar philipperemy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keras-attention's Issues

get_activations use multi-input data, does not work.

Here is the error message

    layer_name='attention_vec')[0], axis=2).squeeze()
  File "/Users/yu/proj/cancel_blame/code/src/lib/attention/attention_utils.py", line 16, in get_activations
    layer_outputs = [func([inputs, 1.])[0] for func in funcs]
  File "/Users/yu/proj/cancel_blame/code/src/lib/attention/attention_utils.py", line 16, in <listcomp>
    layer_outputs = [func([inputs, 1.])[0] for func in funcs]
  File "/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2666, in __call__
    return self._call(inputs)
  File "/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2619, in _call
    dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
AttributeError: 'list' object has no attribute 'dtype'

Loading model problems

When I'm trying to load a saved model, I get the following error. ! "A Dot layer should be called on a list of 2 inputs".

attention_lstm.py does not work for Theano backend

When I run the script attention_lstm.py, there is a problem in the line 17. Just like the following problem:
"input_dim=int(inputs.shape[2])"
"TypeError: int() argument must be a string or a number, not 'TensorVariable'"

Attention not working for MLP

I need to add attention to my following model. It works perfectly for LSTM model but I get the below error :

def get_ANN_attention_model(num_hidden_layers, num_neurons_per_layer, dropout_rate, activation_func, train_X):
    with tf.device('/gpu:0'):
        model_input = tf.keras.Input(shape=(train_X.shape[1]))  # input layer.
        for i in range(num_hidden_layers):
            x = layers.Dense(num_neurons_per_layer,activation=activation_func,bias_regularizer=L1L2(l1=0.0, l2=0.0001),activity_regularizer=L1L2(1e-5,1e-4))(model_input)
            x = layers.Dropout(dropout_rate)(x)
            x = Attention(num_hidden_layers)(x)
        outputs = layers.Dense(1, activation='linear')(x)
        model = tf.keras.Model(inputs=model_input, outputs=outputs)
        model.summary()
    return model

ERROR
hidden_size = int(hidden_states.shape[2])
File "C:\Users\bhask\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\framework\tensor_shape.py", line 896, in getitem
return self._dims[key].value
IndexError: list index out of range

You code is outdated!

Your code doesn't fit to new versions of keras
To fix it change those strings in "attention_dense.py":

  1. "from keras.layers import Input, Dense, merge" on "from keras.layers import Input, Dense,multiply";

  2. "attention_mul = merge([inputs, attention_probs], output_shape=32, name='attention_mul', mode='mul')" on "attention_mul = multiply([inputs, attention_probs],name='attention_mul')" ;
    and in "attention_lstm.py"
    :
    in"attention_lstm.py":

    1. import multiply too;
    1. "output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')" change on "output_attention_mul = multiply([inputs, a_probs], name='attention_mul')"

Many to many sequence generation

Can you give an example of how to use this for many to many sequence generation with different input and output lengths (greater than 1)? For example, if we have input of 10 timesteps say [1,2,3,4,5,6,7,8,9,10] and we want to generate output [1,10].

some confusions.

attention_luong
Hello , Thanks for an easy code to read. But i have some confusions.

  1. your attention functions takes the hidden state of input i.e lstm outputs from encoders and then does all the processes then. but according to what I have read , it must form some kind of function with the hidden state of the target , like in the given picture . Why haven't you did that ? otherwise you are just making an lstm function manually.

  2. Why have you used permute layers before softmax layer ?

  3. why have you averaged the outputs of softmax layer ?

why Permute before attention dense layer in attention_3d_block?

    a = Permute((2, 1))(inputs)
    a = Dense(TIME_STEPS, activation='softmax')(a)

this line ,why do you permute times_tep and input_dim
what if I don't permute , and followed by a dense layer with input_dim ? since dense layer is with the shape of "time_Step *time_step" ,what is the difference when I change it to "input_dim * input_dim"
Dense(input_Dim activation='softmax')(a)

Visualizing attention weights with input arrays

When predicting on test data with the trained model, how can I visualize the attention weights? I'd like to study where the model designates as "important areas".

For reference, my input data is usually of shape (100, 900, 4) with 3 output classification options.

Thanks!

SINGLE_ATTENTION_VECTOR = false

Do you have some reference paper, about SINGLE_ATTENTION_VECTOR = false ?

As far as I know, most of papers will set SINGLE_ATTENTION_VECTOR = true.

use attention_3d_block in many to many mapping

Hi, I'm beginner of Keras and tring to use attention_3d_block in translation module.
I have input of 5 sentences, each sentences has padding to 6 words, each word is presented in 620 dim(as embedding dim).
And the output is 5 sentences, sentences padding to 9 words, and word is presented in 1-of-k in 30 dim(as vocabulary size)
How to use attention_3d_block in this scenario as the LSTM is many to many?

s b3v8 0fr ex 3 he0wk

ask a problem about your code

in your code ,you want to pay more attention on the 10th step. your Experimental results also prove it.
But, your code seems not foucs on the 10th step. please read following code.
score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states) # score_first_part dot last_hidden_state => attention_weights # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size) => (batch_size, time_steps) h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states) score = dot([score_first_part, h_t], [2, 1], name='attention_score')

the way you calculate ‘score’ is score_first_part dot h_t.
the way you get h_t : h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state') . in my view 'lambda x: x[:, -1, :]' means you choose the last step in the time sequence , in other word, you pay more attention on the 20th step.(in your code you define TIME_STEPS = 20).
so, if my understanding is right, you should change you code to be h_t = Lambda(lambda x: x[:, 9, :], output_shape=(hidden_size,), name='last_hidden_state') .
of course, my understanding perhaps wrong. i am lookingforward your reply .
thank you.

Interpreting attention weights for more than one input features.

How can we get attention weights for each input feature when our input consists of multiple inputs?
I am getting only one array of attention weights and I am not sure how to interpret it for multiple inputs.

shape of attention weights (attached as fig) is:
(300, 6)
where 6 is the sequence_length/lookback steps/time steps.

attention_weight

attention when using more than one feature

Hi Philip
Your example of attention has 1 feature (2000, 20,1), my dataset has 60 features (200, 1000,60), in that case I have to do something different to what you do in your example?

Thank you!

Attention Visualization

In the final visualization of the attention weights it says this is showing the attention over input dimensions but the x axis goes to the length of the time steps. So it is showing how important the time step is and not each feature. Shouldn't it be the other way? Where each x is a feature?

When I apply this to my own dataset it just says the most recent time steps are the most important.

Hiddent state parameter, what really should be passed?

Hi, thanks for the implementation!
I have been trying to implement this code
model = Sequential() model.add(Embedding(300000, 100, input_length=250)) model.add(LSTM(units=250, return_sequences=True, dropout=0.1, recurrent_dropout=0.2)) model.add(attention_3d_block( )) model.add(Flatten()) model.add(Dense(200, activation='relu')) model.add(Dense(3, activation='softmax'))

Error TypeError: attention_3d_block() missing 1 required positional argument: 'hidden_states'
I tried to explore the given documentation but I couldn't understand what really should be passed there.

One to One keras model with Attention in Keras

Hello,

I have a keras model that has sequence of inputs and sequence of outputs where each input has an associated output(Label). lets say (part of speech tagging (POS tagging)

Seq_in[0][0:3]
array([[15],[28], [23]])

Seq_out[0][0:3]
array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32)

I want to build attention on top of the lstm layer. I am following this work " Attention-Based Bidirectional Long Short-Term Memory Networks for
Relation Classification " Zhou et al, 2016

X_train, X_val, Y_train, Y_val = train_test_split(Seq_in,Seq_out, test_size=0.20)

TIME_STEPS = 500
INPUT_DIM = 1
lstm_units = 256

inputs = Input(shape=(TIME_STEPS, INPUT_DIM))

activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(inputs) # First laer bidirictional
activations = Dropout(0.2)(activations)
activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(activations) # Second layer bidirectional
activations = Dropout(0.2)(activations)
attention = Dense(1,activation='tanh')(activations) # This is equation (9) in the paper. Squashing each output state vector to a scaler.
attention = Flatten()(attention)
attention = Activation('softmax')(attention) # This is equation (10) in the paper.
attention = RepeatVector(512)(attention) # Repeating the softmax vector to have the same dimintion as the output state vector (512)
attention = Permute([2,1])(attention) # permute

sent_representation = multiply([activations,attention]) # multiply the attention vector with the output state vector element-wise.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation) # summation of all output state vectors
sent_representation = RepeatVector(TIME_STEPS)(sent_representation) # Repeat vector to be the same diminsion as the time steps
sent_representation = concatenate([activations,sent_representation]) # concatenate the sentence representation to the output states

output = Dense(15, activation='softmax')(sent_representation)#(out_attention_mul) # Find the softmax for the current label
model = Model(inputs=inputs, outputs=output)

sgd = optimizers.SGD(lr=.1,momentum=0.9,decay=1e-3,nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.fit(X_train,Y_train,epochs=2, validation_data=(X_val, Y_val),verbose=1)


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 500, 1) 0


bidirectional_1 (Bidirectional) (None, 500, 512) 528384 input_1[0][0]


dropout_1 (Dropout) (None, 500, 512) 0 bidirectional_1[0][0]


bidirectional_2 (Bidirectional) (None, 500, 512) 1574912 dropout_1[0][0]


dropout_2 (Dropout) (None, 500, 512) 0 bidirectional_2[0][0]


dense_1 (Dense) (None, 500, 1) 513 dropout_2[0][0]


flatten_1 (Flatten) (None, 500) 0 dense_1[0][0]


activation_1 (Activation) (None, 500) 0 flatten_1[0][0]


repeat_vector_1 (RepeatVector) (None, 512, 500) 0 activation_1[0][0]


permute_1 (Permute) (None, 500, 512) 0 repeat_vector_1[0][0]


multiply_1 (Multiply) (None, 500, 512) 0 dropout_2[0][0]
permute_1[0][0]


lambda_1 (Lambda) (None, 500) 0 multiply_1[0][0]


repeat_vector_2 (RepeatVector) (None, 500, 500) 0 lambda_1[0][0]


concatenate_1 (Concatenate) (None, 500, 1012) 0 dropout_2[0][0]
repeat_vector_2[0][0]


dense_2 (Dense) (None, 500, 15) 15195 concatenate_1[0][0]

Total params: 2,119,004
Trainable params: 2,119,004
Non-trainable params: 0


I think this code performs what the paper does, except that the concatenate step merges the attention weights to all the output state vectors and do not change them for each time step so for each output label.
So I think, for each time step output, I have to do something so the attention weights differ. Am I right?
Any help is appreciated

Thanks in advance

Questions on implementation details

Update on 2019/2/14, nearly one year later:

The implementation in this repo is definitely bugged. Please refer to my implementation in a reply below for correction. My version has been working in our product since this thread and it outperforms both vanilla LSTM without attention and the incorrect version in this repo by a significant margin. I am not the only one raising the question 1.

Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step). If you are looking for a ready-to-use attention for sequence-to-sequence networks, check this out: https://github.com/farizrahman4u/seq2seq.

============Original answer==============

I am currently working on a text generation task and learnt attention from TensorFlow tutorials. The implementation details seems quite different from your code.

This is how TensorFlow tutorial describes the process:

image

image

If I am understanding it correctly, all learnable parameters in the attention mechanism are stored in W, which has a shape of (rnn_size, rnn_size) (rnn_size is the size of hidden state). So first you need to use W to calculate the score of each hidden state based on the value of the hidden state h_t and h_s, but I am not seeing h_t anywhere in your code. Instead, you applied a dense layer on all h_s. And that means pre_act (Edit: h_t should be h_s in this equation) becomes the score in the paper. This seems wrong.

In the next step you element-wise multiplies the attention weights with hidden states as equation (2). Then somehow missed the equation (3).

I noticed the tutorial is about Seq2Seq (Encoder-Decoder) model and your code is an RNN. Maybe that is why your code is different. Do you have any source on how attention is applied to a non Seq2Seq network?

Here is your code:

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

What is the logic behind the attention layer?

Ik would like to understand intuitively or theoretically, how the attention layer reflects the attention of the model for a prediction?
Because it is easy for the model to give equal weight for each input feature in the attention layer, and that defeats the purpose of the attention layer.

TypeError: 'module' object is not callable

output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
At this line error happens:

output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
TypeError: 'module' object is not callable

Not sure what is wrong. Could you help to resolve?

Output with multiple time steps

Hi,

Can this be used for predicting output with multiple time-steps?
If no, how can the code be changed to accommodate this? Thanks.

possible bug in attention_lstm.py

lines 56-59 should be

if APPLY_ATTENTION_BEFORE_LSTM:
  m = model_attention_applied_before_lstm()
else:
  m = model_attention_applied_after_lstm()

IndexError: list index out of range

Dear sir: when I run python attention_dense.py ,the following errors show:

----- activations -----
Traceback (most recent call last):
File "attention_dense.py", line 39, in
attention_vector = get_activations(m, testing_inputs_1, print_shape_only=True)[1].flatten()
IndexError: list index out of range

would you please help me ?thank you very much!

attention_lstm.py and Tensorflow

In the attention_3d_block, I have some questions/bug (I think). I am running on Tensorflow.
(1) inputs doesn't have a shape method. So it crashes. I assume you meant to call the shape function on the numpy array on inputs_1.
(2) Is there a reason for calling Permute?
(3) What is the Reshape layer supposed to do? After the call to Permute, isn't the output of the previous permute layer already in shape (Batch Size, input_dim, TIME_STEPS)?
(4) The next call to Dense expects ndim =2, not 3. So the code crashes for me. I assume you meant the previous Reshape layer to map the 3d input to 2d?
(5) I would just like to point out that APPLY_ATTENTION_BEFORE_LSTM is False iff you call model_attention_applied_before_lstm.

Add guidance to README to use Functional API for saving models that use this layer

Hi there!

Thanks so much for implementing this and all of the other work that you do!

I ran in to an issue with loading a model uses this the Attention layer in a sequential model. However, the Attention layer is defined using the Function API and Keras does not like it when you try to load a mixed model.

Specifically, my error was

m = keras.models.load_model('saved_mixed_model_path',
            custom_objects = { 'Attention': Attention}
           )

=> ValueError: A merge layer should be called on a list of inputs.

To solve this, I had to convert my model to one that uses the functional API and retrain.

Part of my confusion stems from the examples where both the Sequential and Functional APIs are used. In this example you successfully save and load a model using only Functional API. But in this lstm example the Sequential API is used and no loading/saving is done.

Could a caveat be added to the README.md saying that if you plan to load/save these models, only the Functional API should be used when building the model that uses the Attention layer?

Cheers

How to do Stacked LSTM with attention using this framework ?

hello,

I have run your code successful.

I have also include stacked LSTM in your code :

def model_attention_applied_before_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    attention_mul = attention_3d_block(inputs)
    lstm_units = 32
    attention_mul = LSTM(lstm_units, return_sequences=True)(attention_mul)
    attention_mul = LSTM(lstm_units, return_sequences=False)(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

But maybe this is not the correct way to apply staked LSTM with attention right ?

My ultimate goal is to include attention into this code (classification of multivariate time series ) :


class LSTMNet:
    @staticmethod
    def build(timeSteps,variables,classes):
        inputNet = Input(shape=(timeSteps,variables))
       lstm=Bidirectional(GRU(100,recurrent_dropout=0.4,dropout=0.4,return_sequences=True),merge_mode='concat')(inputNet) 
       lstm=Bidirectional(GRU(50,recurrent_dropout=0.4,dropout=0.4,return_sequences=True),merge_mode='concat')(lstm) 
        lstm=Bidirectional(GRU(20,recurrent_dropout=0.4,dropout=0.4,return_sequences=False),merge_mode='concat')(lstm) 
        # a softmax classifier
        classificationLayer=Dense(classes,activation='softmax')(lstm)
        model=Model(inputNet,classificationLayer)
        return model

Thanks in advance for any possible info

get_activations not producing list

Thanks for uploading this to github! Great for learning more about attention models. When I run attention_dense.py, however, I get this error (after the model finishing training):


IndexError Traceback (most recent call last)
in ()
37 # Attention vector corresponds to the second matrix.
38 # The first one is the Inputs output.
---> 39 attention_vector = get_activations(m, testing_inputs_1, print_shape_only=True)[1].flatten()
40 print('attention =', attention_vector)
41

IndexError: list index out of range

Any idea why the get_activations function isn't working properly?

weird attention weights when adding sequence of numbers.

I am trying to slightly modify your example of adding numbers such that the target is the sum of all the numbers in the sequence before delimiter. Below is the modified code

def add_numbers_before_delimiter(n: int, seq_length: int, delimiter: float = 0.0,
                                         index_1: int = None) -> (np.array, np.array):
    """
    Task: Add all the numbers that come before the delimiter.
    x = [1, 2, 3, 0, 4, 5, 6, 7, 8, 9]. Result is y =  6.
    @param n: number of samples in (x, y).
    @param seq_length: length of the sequence of x.
    @param delimiter: value of the delimiter. Default is 0.0
    @param index_1: index of the number that comes after the first 0.
    @return: returns two numpy.array x and y of shape (n, seq_length, 1) and (n, 1).
    """
    x = np.random.uniform(0, 1, (n, seq_length))
    y = np.zeros(shape=(n, 1))
    for i in range(len(x)):
        if index_1 is None:
            a = np.random.choice(range(1, len(x[i])), size=1, replace=False)
        else:
            a = index_1
        y[i] =  np.sum(x[i, 0:a])
        x[i, a] = delimiter

    x = np.expand_dims(x, axis=-1)
    return x, y


def main():
    numpy.random.seed(7)

    # data. definition of the problem.
    seq_length = 20
    x_train, y_train = add_numbers_before_delimiter(20_000, seq_length)
    x_val, y_val = add_numbers_before_delimiter(4_000, seq_length)

    # just arbitrary values. it's for visual purposes. easy to see than random values.
    test_index_1 = 4
    x_test, _ = add_numbers_before_delimiter(10, seq_length, 0, test_index_1)
    # x_test_mask is just a mask that, if applied to x_test, would still contain the information to solve the problem.
    # we expect the attention map to look like this mask.
    x_test_mask = np.zeros_like(x_test[..., 0])
    x_test_mask[:, test_index_1:test_index_1 + 1] = 1

    model = Sequential([
        LSTM(100, input_shape=(seq_length, 1), return_sequences=True),
        SelfAttention(name='attention_weight'),
        Dropout(0.2),
        Dense(1, activation='linear')
    ])

    model.compile(loss='mse', optimizer='adam')
    print(model.summary())

    output_dir = 'task_add_two_numbers'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    max_epoch = int(sys.argv[1]) if len(sys.argv) > 1 else 200

    class VisualiseAttentionMap(Callback):

        def on_epoch_end(self, epoch, logs=None):
            attention_map = get_activations(model, x_test, layer_names='attention_weight')['attention_weight']

            # top is attention map.
            # bottom is ground truth.
            plt.imshow(np.concatenate([attention_map, x_test_mask]), cmap='hot')

            iteration_no = str(epoch).zfill(3)
            plt.axis('off')
            plt.title(f'Iteration {iteration_no} / {max_epoch}')
            plt.savefig(f'{output_dir}/epoch_{iteration_no}.png')
            plt.close()
            plt.clf()

    model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=max_epoch,
              batch_size=64, callbacks=[VisualiseAttentionMap()])


if __name__ == '__main__':
    main()

I was expecting the model to focus on all values in x_test sequence before index 4. However as you can see in gif, the model focuses on just one point. Can you please elaborate where I am mistaking?

Thank in advance.

add_numbers

Restricting attention weights to domain

In my application, the attention weights are centering on locations which are indicative of a subset of the classes. Therefore, while the algorithm performs well on this subset, it sometimes misclassifies on the other classes because the attention weights cause the obvious differences to be considered "residual".

Is there a documented way of restricting the attention weights to a certain value or index domain to enforce constraints on its focus? This question makes me think of NLP problems where frameworks commonly pair ML methodologies with a set of predetermined rules (usually defined with spacy).

Any thoughts? Thanks in advance.

2D attention

@philipperemy

Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-

Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.

bucketing problem

My sequences have varying lengths and I’m using bucketing to solve the issue. Therefore I define the LSTM input shape as (None, None, features), i.e. there are no explicit timesteps. I wonder if the code can fit my input? Thanks.

fig

Hi, I am wondering the figures in your markdown.
What app you used to create these beautiful hand-written figures.
Thx

what do the h_t mean in the Attention model?

Hi there!
Thanks so much for implementing this and all of the other work that you do!
I wanna know the meaning of h_t,i.e h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states) . Well, in Luong's paper the h_t was used as the input the hidden state. But how to explain it in a scene which is not seq2seq?

Attention Mechanism not working

Hi,
I have added an attention layer (following the example) to my simple LSTM network shown below.

timestep = timesteps
features = 11
model = Sequential()
model.add(LSTM(64, input_shape=(timestep,features), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(16, return_sequences=True))
model.add(Attention(32))
model.add(Dense(32))
model.add(Dense(16))
model.add(Dense(1))
print(model.summary())
The code worked fine up till last week and I got a summary of model having attention layer details like this:
image

However, now running the same code gives me a weird error.
ValueError: tf.function-decorated function tried to create variables on non-first call.

What I noticed is that the model summary has changed too:
image

I am tight on time due an upcoming deadline. Any assistance would be highly appreciated.
P.S. This was a fully working model that has stopped working all of a sudden for no apparent reason.

get_config

Hi,
Perhaps do you have another implementation with the get_config function for saving the model in keras? I had been trying but I always get this error:
raise ValueError('A Dot layer should be called '

ValueError: A Dot layer should be called on a list of 2 inputs.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.