philipperemy / keras-attention Goto Github PK

View Code? Open in Web Editor NEW

2.8K 78.0 676.0 4.17 MB

Keras Attention Layer (Luong and Bahdanau scores).

License: Apache License 2.0

Python 100.00%

keras keras-neural-networks attention-mechanism attention-model deep-learning

keras-attention's Introduction

Keras Attention Layer

Attention Layer for Keras. Supports the score functions of Luong and Bahdanau.

Tested with Tensorflow 2.8, 2.9, 2.10, 2.11, 2.12, 2.13 and 2.14 (Sep 26, 2023).

Installation

PyPI

pip install attention

Attention Layer

Attention(
    units=128,
    score='luong',
    **kwargs
)

Arguments

units: Integer. The number of (output) units in the attention vector ($a_t$).
score: String. The score function $score(h_t, \bar{h_s})$. Possible values are luong or bahdanau.
- Luong's multiplicative style. Link to paper.
- Bahdanau's additive style. Link to paper.

Input shape

3D tensor with shape (batch_size, timesteps, input_dim).

Output shape

2D tensor with shape (batch_size, num_units) ($a_t$).

If you want to visualize the attention weights, refer to this example examples/add_two_numbers.py.

Example

import numpy as np
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import load_model, Model

from attention import Attention


def main():
    # Dummy data. There is nothing to learn in this example.
    num_samples, time_steps, input_dim, output_dim = 100, 10, 1, 1
    data_x = np.random.uniform(size=(num_samples, time_steps, input_dim))
    data_y = np.random.uniform(size=(num_samples, output_dim))

    # Define/compile the model.
    model_input = Input(shape=(time_steps, input_dim))
    x = LSTM(64, return_sequences=True)(model_input)
    x = Attention(units=32)(x)
    x = Dense(1)(x)
    model = Model(model_input, x)
    model.compile(loss='mae', optimizer='adam')
    model.summary()

    # train.
    model.fit(data_x, data_y, epochs=10)

    # test save/reload model.
    pred1 = model.predict(data_x)
    model.save('test_model.h5')
    model_h5 = load_model('test_model.h5', custom_objects={'Attention': Attention})
    pred2 = model_h5.predict(data_x)
    np.testing.assert_almost_equal(pred1, pred2)
    print('Success.')


if __name__ == '__main__':
    main()

Other Examples

Browse examples.

Install the requirements before running the examples: pip install -r examples/examples-requirements.txt.

IMDB Dataset

In this experiment, we demonstrate that using attention yields a higher accuracy on the IMDB dataset. We consider two LSTM networks: one with this attention layer and the other one with a fully connected layer. Both have the same number of parameters for a fair comparison (250K).

Here are the results on 10 runs. For every run, we record the max accuracy on the test set for 10 epochs.

Measure	No Attention (250K params)	Attention (250K params)
MAX Accuracy	88.22	88.76
AVG Accuracy	87.02	87.62
STDDEV Accuracy	0.18	0.14

As expected, there is a boost in accuracy for the model with attention. It also reduces the variability between the runs, which is something nice to have.

Adding two numbers

Let's consider the task of adding two numbers that come right after some delimiters (0 in this case):

x = [1, 2, 3, 0, 4, 5, 6, 0, 7, 8]. Result is y = 4 + 7 = 11.

The attention is expected to be the highest after the delimiters. An overview of the training is shown below, where the top represents the attention map and the bottom the ground truth. As the training progresses, the model learns the task and the attention map converges to the ground truth.

Finding max of a sequence

We consider many 1D sequences of the same length. The task is to find the maximum of each sequence.

We give the full sequence processed by the RNN layer to the attention layer. We expect the attention layer to focus on the maximum of each sequence.

After a few epochs, the attention layer converges perfectly to what we expected.

References

keras-attention's People

Contributors

Stargazers

Watchers

Forkers

abali96 kinect59 ml-ai-nlp-ir jdc08161063 little1tow tonytongzhao benjamesbabala niucheney sunjieee chenglongchen neo4reo mathias3 sachuin23 dut-liuyang toni-antonova alonegu sun-peach melody-xiaomi seathiefwang scutzck033 gumplus leezqcst timecracker vanova dylan-fan wensiding zekunzh colinsongf hawklucky binliang-nlp alexkruegger lgcming jkhlot kaushalshetty 201528014227051 peydon s1162276945 leidongfeng ghego fengyujuan eustcpl raghavendranpm adamncr lizhangzhan duolajiang zhengyu19921215 kakuremino generalzh oldregan poivrenoir fancyerii hyperloco yinruiqing rosssong shubhampachori12110095 duolinwang prakritidev gehongpeng chenyao808 shuaaai amarry grainw victorpu zchen0420 orchestor yfliao johnbralich3 alashkari sabirdvd tks-tud aulukelvin tigeryang93 fitrialif pydemia little-girl-1992 winggy learcane ufukhurriyetoglu xiongfeihtp sinianyutian lixiaosi33 kzhoulatte manuelschmidt chen89 xiaoyeye1117 dbverstr5324 sodagreencellur k-sandhu zhangyang5511 penghuishang idoamihai leriomaggio ab-be hanfeijp oldmonk101 yaduvanshiankitofficial iris8beiny qiongxiao wenbotse rygbee

keras-attention's Issues

Questions on implementation details

Update on 2019/2/14, nearly one year later:

The implementation in this repo is definitely bugged. Please refer to my implementation in a reply below for correction. My version has been working in our product since this thread and it outperforms both vanilla LSTM without attention and the incorrect version in this repo by a significant margin. I am not the only one raising the question 1.

Both this repo and my version of attention are intended for sequence-to-one networks (although it can be easily tweaked for seq2seq by replacing h_t with current state of the decoder step). If you are looking for a ready-to-use attention for sequence-to-sequence networks, check this out: https://github.com/farizrahman4u/seq2seq.

============Original answer==============

I am currently working on a text generation task and learnt attention from TensorFlow tutorials. The implementation details seems quite different from your code.

This is how TensorFlow tutorial describes the process:

If I am understanding it correctly, all learnable parameters in the attention mechanism are stored in , which has a shape of (rnn_size, rnn_size) (rnn_size is the size of hidden state). So first you need to use to calculate the score of each hidden state based on the value of the hidden state and , but I am not seeing anywhere in your code. Instead, you applied a dense layer on all . And that means (Edit: h_t should be h_s in this equation) becomes the in the paper. This seems wrong.

In the next step you element-wise multiplies the attention weights with hidden states as equation (2). Then somehow missed the equation (3).

I noticed the tutorial is about Seq2Seq (Encoder-Decoder) model and your code is an RNN. Maybe that is why your code is different. Do you have any source on how attention is applied to a non Seq2Seq network?

Here is your code:

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul


def model_attention_applied_after_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    lstm_units = 32
    lstm_out = LSTM(lstm_units, return_sequences=True)(inputs)
    attention_mul = attention_3d_block(lstm_out)
    attention_mul = Flatten()(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

weird attention weights when adding sequence of numbers.

I am trying to slightly modify your example of adding numbers such that the target is the sum of all the numbers in the sequence before delimiter. Below is the modified code

def add_numbers_before_delimiter(n: int, seq_length: int, delimiter: float = 0.0,
                                         index_1: int = None) -> (np.array, np.array):
    """
    Task: Add all the numbers that come before the delimiter.
    x = [1, 2, 3, 0, 4, 5, 6, 7, 8, 9]. Result is y =  6.
    @param n: number of samples in (x, y).
    @param seq_length: length of the sequence of x.
    @param delimiter: value of the delimiter. Default is 0.0
    @param index_1: index of the number that comes after the first 0.
    @return: returns two numpy.array x and y of shape (n, seq_length, 1) and (n, 1).
    """
    x = np.random.uniform(0, 1, (n, seq_length))
    y = np.zeros(shape=(n, 1))
    for i in range(len(x)):
        if index_1 is None:
            a = np.random.choice(range(1, len(x[i])), size=1, replace=False)
        else:
            a = index_1
        y[i] =  np.sum(x[i, 0:a])
        x[i, a] = delimiter

    x = np.expand_dims(x, axis=-1)
    return x, y


def main():
    numpy.random.seed(7)

    # data. definition of the problem.
    seq_length = 20
    x_train, y_train = add_numbers_before_delimiter(20_000, seq_length)
    x_val, y_val = add_numbers_before_delimiter(4_000, seq_length)

    # just arbitrary values. it's for visual purposes. easy to see than random values.
    test_index_1 = 4
    x_test, _ = add_numbers_before_delimiter(10, seq_length, 0, test_index_1)
    # x_test_mask is just a mask that, if applied to x_test, would still contain the information to solve the problem.
    # we expect the attention map to look like this mask.
    x_test_mask = np.zeros_like(x_test[..., 0])
    x_test_mask[:, test_index_1:test_index_1 + 1] = 1

    model = Sequential([
        LSTM(100, input_shape=(seq_length, 1), return_sequences=True),
        SelfAttention(name='attention_weight'),
        Dropout(0.2),
        Dense(1, activation='linear')
    ])

    model.compile(loss='mse', optimizer='adam')
    print(model.summary())

    output_dir = 'task_add_two_numbers'
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    max_epoch = int(sys.argv[1]) if len(sys.argv) > 1 else 200

    class VisualiseAttentionMap(Callback):

        def on_epoch_end(self, epoch, logs=None):
            attention_map = get_activations(model, x_test, layer_names='attention_weight')['attention_weight']

            # top is attention map.
            # bottom is ground truth.
            plt.imshow(np.concatenate([attention_map, x_test_mask]), cmap='hot')

            iteration_no = str(epoch).zfill(3)
            plt.axis('off')
            plt.title(f'Iteration {iteration_no} / {max_epoch}')
            plt.savefig(f'{output_dir}/epoch_{iteration_no}.png')
            plt.close()
            plt.clf()

    model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=max_epoch,
              batch_size=64, callbacks=[VisualiseAttentionMap()])


if __name__ == '__main__':
    main()

I was expecting the model to focus on all values in x_test sequence before index 4. However as you can see in gif, the model focuses on just one point. Can you please elaborate where I am mistaking?

Thank in advance.

bucketing problem

My sequences have varying lengths and I’m using bucketing to solve the issue. Therefore I define the LSTM input shape as (None, None, features), i.e. there are no explicit timesteps. I wonder if the code can fit my input? Thanks.

ask a problem about your code

in your code ,you want to pay more attention on the 10th step. your Experimental results also prove it.
But, your code seems not foucs on the 10th step. please read following code.
score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states) # score_first_part dot last_hidden_state => attention_weights # (batch_size, time_steps, hidden_size) dot (batch_size, hidden_size) => (batch_size, time_steps) h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states) score = dot([score_first_part, h_t], [2, 1], name='attention_score')

the way you calculate ‘score’ is score_first_part dot h_t.
the way you get h_t : h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state') . in my view 'lambda x: x[:, -1, :]' means you choose the last step in the time sequence , in other word, you pay more attention on the 20th step.(in your code you define TIME_STEPS = 20).
so, if my understanding is right, you should change you code to be h_t = Lambda(lambda x: x[:, 9, :], output_shape=(hidden_size,), name='last_hidden_state') .
of course, my understanding perhaps wrong. i am lookingforward your reply .
thank you.

why Permute before attention dense layer in attention_3d_block?

    a = Permute((2, 1))(inputs)
    a = Dense(TIME_STEPS, activation='softmax')(a)

this line ,why do you permute times_tep and input_dim
what if I don't permute , and followed by a dense layer with input_dim ? since dense layer is with the shape of "time_Step *time_step" ,what is the difference when I change it to "input_dim * input_dim"
Dense(input_Dim activation='softmax')(a)

fig

Hi, I am wondering the figures in your markdown.
What app you used to create these beautiful hand-written figures.
Thx

Output with multiple time steps

Hi,

Can this be used for predicting output with multiple time-steps?
If no, how can the code be changed to accommodate this? Thanks.

visualizing soft attention

How can we visualize the soft attention similar to the Bengio et al. paper?

pip install and numpy, keras packages are forced to be uninstalled

Hi,

As I install the keras-attention-mechanism to my conda3 by pip, the essential packages of numpy and keras are unexpectedly being uninstalled. Do you know why?

Bests,
Peiwan

attention_lstm.py and Tensorflow

In the attention_3d_block, I have some questions/bug (I think). I am running on Tensorflow.
(1) inputs doesn't have a shape method. So it crashes. I assume you meant to call the shape function on the numpy array on inputs_1.
(2) Is there a reason for calling Permute?
(3) What is the Reshape layer supposed to do? After the call to Permute, isn't the output of the previous permute layer already in shape (Batch Size, input_dim, TIME_STEPS)?
(4) The next call to Dense expects ndim =2, not 3. So the code crashes for me. I assume you meant the previous Reshape layer to map the 3d input to 2d?
(5) I would just like to point out that APPLY_ATTENTION_BEFORE_LSTM is False iff you call model_attention_applied_before_lstm.

use attention_3d_block in many to many mapping

Hi, I'm beginner of Keras and tring to use attention_3d_block in translation module.
I have input of 5 sentences, each sentences has padding to 6 words, each word is presented in 620 dim(as embedding dim).
And the output is 5 sentences, sentences padding to 9 words, and word is presented in 1-of-k in 30 dim(as vocabulary size)
How to use attention_3d_block in this scenario as the LSTM is many to many?

why add a Dense(64) layer after the attention layer

what's the point of adding another attention_mul = Dense(units=64)(attention_mul) ?

2D attention

@philipperemy

Do you know how I can apply the attention module to a 2D shaped input , I would like to apply to apply attention after the LSTM layer-

Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

Would really appreciate your suggestion on how to modify attention_3D block to make it work for a 2D input as well. thanks.

You code is outdated!

Your code doesn't fit to new versions of keras
To fix it change those strings in "attention_dense.py":

"from keras.layers import Input, Dense, merge" on "from keras.layers import Input, Dense,multiply";
"attention_mul = merge([inputs, attention_probs], output_shape=32, name='attention_mul', mode='mul')" on "attention_mul = multiply([inputs, attention_probs],name='attention_mul')" ;
and in "attention_lstm.py":
in"attention_lstm.py":
1. import multiply too;
1. "output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')" change on "output_attention_mul = multiply([inputs, a_probs], name='attention_mul')"

How to do Stacked LSTM with attention using this framework ?

hello,

I have run your code successful.

I have also include stacked LSTM in your code :

def model_attention_applied_before_lstm():
    inputs = Input(shape=(TIME_STEPS, INPUT_DIM,))
    attention_mul = attention_3d_block(inputs)
    lstm_units = 32
    attention_mul = LSTM(lstm_units, return_sequences=True)(attention_mul)
    attention_mul = LSTM(lstm_units, return_sequences=False)(attention_mul)
    output = Dense(1, activation='sigmoid')(attention_mul)
    model = Model(input=[inputs], output=output)
    return model

But maybe this is not the correct way to apply staked LSTM with attention right ?

My ultimate goal is to include attention into this code (classification of multivariate time series ) :


class LSTMNet:
    @staticmethod
    def build(timeSteps,variables,classes):
        inputNet = Input(shape=(timeSteps,variables))
       lstm=Bidirectional(GRU(100,recurrent_dropout=0.4,dropout=0.4,return_sequences=True),merge_mode='concat')(inputNet) 
       lstm=Bidirectional(GRU(50,recurrent_dropout=0.4,dropout=0.4,return_sequences=True),merge_mode='concat')(lstm) 
        lstm=Bidirectional(GRU(20,recurrent_dropout=0.4,dropout=0.4,return_sequences=False),merge_mode='concat')(lstm) 
        # a softmax classifier
        classificationLayer=Dense(classes,activation='softmax')(lstm)
        model=Model(inputNet,classificationLayer)
        return model

Thanks in advance for any possible info

Many to many sequence generation

Can you give an example of how to use this for many to many sequence generation with different input and output lengths (greater than 1)? For example, if we have input of 10 timesteps say [1,2,3,4,5,6,7,8,9,10] and we want to generate output [1,10].

where is dense attention implementation ？

Add guidance to README to use Functional API for saving models that use this layer

Hi there!

Thanks so much for implementing this and all of the other work that you do!

I ran in to an issue with loading a model uses this the Attention layer in a sequential model. However, the Attention layer is defined using the Function API and Keras does not like it when you try to load a mixed model.

Specifically, my error was

m = keras.models.load_model('saved_mixed_model_path',
            custom_objects = { 'Attention': Attention}
           )

=> ValueError: A merge layer should be called on a list of inputs.

To solve this, I had to convert my model to one that uses the functional API and retrain.

Part of my confusion stems from the examples where both the Sequential and Functional APIs are used. In this example you successfully save and load a model using only Functional API. But in this lstm example the Sequential API is used and no loading/saving is done.

Could a caveat be added to the README.md saying that if you plan to load/save these models, only the Functional API should be used when building the model that uses the Attention layer?

Cheers

possible bug in attention_lstm.py

lines 56-59 should be

if APPLY_ATTENTION_BEFORE_LSTM:
  m = model_attention_applied_before_lstm()
else:
  m = model_attention_applied_after_lstm()

attention when using more than one feature

Hi Philip
Your example of attention has 1 feature (2000, 20,1), my dataset has 60 features (200, 1000,60), in that case I have to do something different to what you do in your example?

Thank you!

IndexError: list index out of range

Dear sir: when I run python attention_dense.py ,the following errors show:

----- activations -----
Traceback (most recent call last):
File "attention_dense.py", line 39, in
attention_vector = get_activations(m, testing_inputs_1, print_shape_only=True)[1].flatten()
IndexError: list index out of range

would you please help me ?thank you very much!

Attention Visualization

In the final visualization of the attention weights it says this is showing the attention over input dimensions but the x axis goes to the length of the time steps. So it is showing how important the time step is and not each feature. Shouldn't it be the other way? Where each x is a feature?

When I apply this to my own dataset it just says the most recent time steps are the most important.

SINGLE_ATTENTION_VECTOR = false

Do you have some reference paper, about SINGLE_ATTENTION_VECTOR = false ?

As far as I know, most of papers will set SINGLE_ATTENTION_VECTOR = true.

what is the meaning of the second parameter in dot([], [1, 1], name='context_vector')

Hi, Thanks for your awesome work.
I have a confusion about the code: context_vector = dot([hidden_states, attention_weights], [1, 1], name='context_vector')
What is the meaning of the second parameter?

TypeError: 'module' object is not callable

output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
At this line error happens:

output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
TypeError: 'module' object is not callable

Not sure what is wrong. Could you help to resolve?

attention implementation help for OCR

Hi,

how can i add the attention model fot keras image_ocr implementation

这个是CNN版本的attention吗？

这个是keras实现的CNN加上attention的代码吗？

get_activations not producing list

Thanks for uploading this to github! Great for learning more about attention models. When I run attention_dense.py, however, I get this error (after the model finishing training):

IndexError Traceback (most recent call last)
in ()
37 # Attention vector corresponds to the second matrix.
38 # The first one is the Inputs output.
---> 39 attention_vector = get_activations(m, testing_inputs_1, print_shape_only=True)[1].flatten()
40 print('attention =', attention_vector)
41

IndexError: list index out of range

Any idea why the get_activations function isn't working properly?

Use this repository for CNN

Dear Sir,

Would it be possible to use this repo for CNN network also?

Thanks and regards.

Hiddent state parameter, what really should be passed?

Hi, thanks for the implementation!
I have been trying to implement this code
model = Sequential() model.add(Embedding(300000, 100, input_length=250)) model.add(LSTM(units=250, return_sequences=True, dropout=0.1, recurrent_dropout=0.2)) model.add(attention_3d_block( )) model.add(Flatten()) model.add(Dense(200, activation='relu')) model.add(Dense(3, activation='softmax'))

Error TypeError: attention_3d_block() missing 1 required positional argument: 'hidden_states'
I tried to explore the given documentation but I couldn't understand what really should be passed there.

How to implement Multi-Hop Attention using Keras?

MultiHopAttention was proposed by Fackbook.

papers using dense attention mechanism

Hello,

Is the dense attention mechanism based on a particulier paper?
Or are there papers using this mechanism?

attention_lstm.py does not work for Theano backend

When I run the script attention_lstm.py, there is a problem in the line 17. Just like the following problem:
"input_dim=int(inputs.shape[2])"
"TypeError: int() argument must be a string or a number, not 'TensorVariable'"

Using attention with multivariate timeseries data

Hey, I' am trying to use attention with timeseries data that has more than 1 feature this leads to an incompatible shapes error. What changes do I make to get it to work?

get_activations use multi-input data, does not work.

Here is the error message

    layer_name='attention_vec')[0], axis=2).squeeze()
  File "/Users/yu/proj/cancel_blame/code/src/lib/attention/attention_utils.py", line 16, in get_activations
    layer_outputs = [func([inputs, 1.])[0] for func in funcs]
  File "/Users/yu/proj/cancel_blame/code/src/lib/attention/attention_utils.py", line 16, in <listcomp>
    layer_outputs = [func([inputs, 1.])[0] for func in funcs]
  File "/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2666, in __call__
    return self._call(inputs)
  File "/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2619, in _call
    dtype=tf.as_dtype(tensor.dtype).as_numpy_dtype))
AttributeError: 'list' object has no attribute 'dtype'

some confusions.

Hello , Thanks for an easy code to read. But i have some confusions.

your attention functions takes the hidden state of input i.e lstm outputs from encoders and then does all the processes then. but according to what I have read , it must form some kind of function with the hidden state of the target , like in the given picture . Why haven't you did that ? otherwise you are just making an lstm function manually.
Why have you used permute layers before softmax layer ?
why have you averaged the outputs of softmax layer ?

Attention Mechanism not working

Hi,
I have added an attention layer (following the example) to my simple LSTM network shown below.

timestep = timesteps
features = 11
model = Sequential()
model.add(LSTM(64, input_shape=(timestep,features), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(32, return_sequences=True))
model.add(LSTM(16, return_sequences=True))
model.add(Attention(32))
model.add(Dense(32))
model.add(Dense(16))
model.add(Dense(1))
print(model.summary())
The code worked fine up till last week and I got a summary of model having attention layer details like this:

However, now running the same code gives me a weird error.
ValueError: tf.function-decorated function tried to create variables on non-first call.

What I noticed is that the model summary has changed too:

I am tight on time due an upcoming deadline. Any assistance would be highly appreciated.
P.S. This was a fully working model that has stopped working all of a sudden for no apparent reason.

What is the logic behind the attention layer?

Ik would like to understand intuitively or theoretically, how the attention layer reflects the attention of the model for a prediction?
Because it is easy for the model to give equal weight for each input feature in the attention layer, and that defeats the purpose of the attention layer.

Visualizing attention weights with input arrays

When predicting on test data with the trained model, how can I visualize the attention weights? I'd like to study where the model designates as "important areas".

For reference, my input data is usually of shape (100, 900, 4) with 3 output classification options.

Thanks!

what do the h_t mean in the Attention model?

Hi there!
Thanks so much for implementing this and all of the other work that you do!
I wanna know the meaning of h_t，i.e h_t = Lambda(lambda x: x[:, -1, :], output_shape=(hidden_size,), name='last_hidden_state')(hidden_states) . Well, in Luong's paper the h_t was used as the input the hidden state. But how to explain it in a scene which is not seq2seq?

2D LSTM attention

Can we use the same code for 2D LSTM attention ?

Loading model problems

When I'm trying to load a saved model, I get the following error. ! "A Dot layer should be called on a list of 2 inputs".

Interpreting attention weights for more than one input features.

How can we get attention weights for each input feature when our input consists of multiple inputs?
I am getting only one array of attention weights and I am not sure how to interpret it for multiple inputs.

shape of attention weights (attached as fig) is:
(300, 6)
where 6 is the sequence_length/lookback steps/time steps.

How to visualise as 2dimensional heatmap?

lets say we are predicting with timestep of 24, and get 24 result as output. how can we visualise as heatmap like in https://github.com/datalogue/keras-attention

Attention not working for MLP

I need to add attention to my following model. It works perfectly for LSTM model but I get the below error :

def get_ANN_attention_model(num_hidden_layers, num_neurons_per_layer, dropout_rate, activation_func, train_X):
    with tf.device('/gpu:0'):
        model_input = tf.keras.Input(shape=(train_X.shape[1]))  # input layer.
        for i in range(num_hidden_layers):
            x = layers.Dense(num_neurons_per_layer,activation=activation_func,bias_regularizer=L1L2(l1=0.0, l2=0.0001),activity_regularizer=L1L2(1e-5,1e-4))(model_input)
            x = layers.Dropout(dropout_rate)(x)
            x = Attention(num_hidden_layers)(x)
        outputs = layers.Dense(1, activation='linear')(x)
        model = tf.keras.Model(inputs=model_input, outputs=outputs)
        model.summary()
    return model

ERROR
hidden_size = int(hidden_states.shape[2])
File "C:\Users\bhask\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\framework\tensor_shape.py", line 896, in getitem
return self._dims[key].value
IndexError: list index out of range

Is this Reshape step redundant?

See this line of code: https://github.com/philipperemy/keras-attention-mechanism/blob/master/attention_lstm.py#L19

Isnt this redundant? Because the Permute layer right before it will reshape the Tensor.

Let me know if I'm missing something. I am trying to understand attention and thus far your writeup is helping

Is this attention is applicable for use with the encoder/decoder mechanism?

Restricting attention weights to domain

In my application, the attention weights are centering on locations which are indicative of a subset of the classes. Therefore, while the algorithm performs well on this subset, it sometimes misclassifies on the other classes because the attention weights cause the obvious differences to be considered "residual".

Is there a documented way of restricting the attention weights to a certain value or index domain to enforce constraints on its focus? This question makes me think of NLP problems where frameworks commonly pair ML methodologies with a set of predetermined rules (usually defined with spacy).

Any thoughts? Thanks in advance.

One to One keras model with Attention in Keras

Hello,

I have a keras model that has sequence of inputs and sequence of outputs where each input has an associated output(Label). lets say (part of speech tagging (POS tagging)

Seq_in[0][0:3]
array([[15],[28], [23]])

Seq_out[0][0:3]
array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32)

I want to build attention on top of the lstm layer. I am following this work " Attention-Based Bidirectional Long Short-Term Memory Networks for
Relation Classification " Zhou et al, 2016

X_train, X_val, Y_train, Y_val = train_test_split(Seq_in,Seq_out, test_size=0.20)

TIME_STEPS = 500
INPUT_DIM = 1
lstm_units = 256

inputs = Input(shape=(TIME_STEPS, INPUT_DIM))

activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(inputs) # First laer bidirictional
activations = Dropout(0.2)(activations)
activations = Bidirectional(LSTM(lstm_units, return_sequences=True))(activations) # Second layer bidirectional
activations = Dropout(0.2)(activations)
attention = Dense(1,activation='tanh')(activations) # This is equation (9) in the paper. Squashing each output state vector to a scaler.
attention = Flatten()(attention)
attention = Activation('softmax')(attention) # This is equation (10) in the paper.
attention = RepeatVector(512)(attention) # Repeating the softmax vector to have the same dimintion as the output state vector (512)
attention = Permute([2,1])(attention) # permute

sent_representation = multiply([activations,attention]) # multiply the attention vector with the output state vector element-wise.
sent_representation = Lambda(lambda xin: K.sum(xin, axis=-1))(sent_representation) # summation of all output state vectors
sent_representation = RepeatVector(TIME_STEPS)(sent_representation) # Repeat vector to be the same diminsion as the time steps
sent_representation = concatenate([activations,sent_representation]) # concatenate the sentence representation to the output states

output = Dense(15, activation='softmax')(sent_representation)#(out_attention_mul) # Find the softmax for the current label
model = Model(inputs=inputs, outputs=output)

sgd = optimizers.SGD(lr=.1,momentum=0.9,decay=1e-3,nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
model.fit(X_train,Y_train,epochs=2, validation_data=(X_val, Y_val),verbose=1)

Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 500, 1) 0

bidirectional_1 (Bidirectional) (None, 500, 512) 528384 input_1[0][0]

dropout_1 (Dropout) (None, 500, 512) 0 bidirectional_1[0][0]

bidirectional_2 (Bidirectional) (None, 500, 512) 1574912 dropout_1[0][0]

dropout_2 (Dropout) (None, 500, 512) 0 bidirectional_2[0][0]

dense_1 (Dense) (None, 500, 1) 513 dropout_2[0][0]

flatten_1 (Flatten) (None, 500) 0 dense_1[0][0]

activation_1 (Activation) (None, 500) 0 flatten_1[0][0]

repeat_vector_1 (RepeatVector) (None, 512, 500) 0 activation_1[0][0]

permute_1 (Permute) (None, 500, 512) 0 repeat_vector_1[0][0]

multiply_1 (Multiply) (None, 500, 512) 0 dropout_2[0][0]
permute_1[0][0]

lambda_1 (Lambda) (None, 500) 0 multiply_1[0][0]

repeat_vector_2 (RepeatVector) (None, 500, 500) 0 lambda_1[0][0]

concatenate_1 (Concatenate) (None, 500, 1012) 0 dropout_2[0][0]
repeat_vector_2[0][0]

dense_2 (Dense) (None, 500, 15) 15195 concatenate_1[0][0]

Total params: 2,119,004
Trainable params: 2,119,004
Non-trainable params: 0

I think this code performs what the paper does, except that the concatenate step merges the attention weights to all the output state vectors and do not change them for each time step so for each output label.
So I think, for each time step output, I have to do something so the attention weights differ. Am I right?
Any help is appreciated

Thanks in advance

get_config

Hi,
Perhaps do you have another implementation with the get_config function for saving the model in keras? I had been trying but I always get this error:
raise ValueError('A Dot layer should be called '

ValueError: A Dot layer should be called on a list of 2 inputs.