cyberzhg / keras-self-attention Goto Github PK

View Code? Open in Web Editor NEW

652.0 8.0 121.0 111 KB

Attention mechanism for processing sequential data that considers the context for each timestamp.

Home Page: https://pypi.org/project/keras-self-attention/

License: MIT License

Shell 0.73% Python 99.27%

keras attention-mechanism

keras-self-attention's Introduction

Keras Self-Attention

[中文|English]

Attention mechanism for processing sequential data that considers the context for each timestamp.

Install

pip install keras-self-attention

Usage

Basic

By default, the attention layer uses additive attention and considers the whole context while calculating the relevance. The following code creates an attention layer that follows the equations in the first section (attention_activation is the activation function of e_{t, t'}):

from tensorflow import keras
from keras_self_attention import SeqSelfAttention


model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=10000,
                                 output_dim=300,
                                 mask_zero=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units=128,
                                                       return_sequences=True)))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(keras.layers.Dense(units=5))
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['categorical_accuracy'],
)
model.summary()

Local Attention

The global context may be too broad for one piece of data. The parameter attention_width controls the width of the local context:

from keras_self_attention import SeqSelfAttention

SeqSelfAttention(
    attention_width=15,
    attention_activation='sigmoid',
    name='Attention',
)

Multiplicative Attention

You can use multiplicative attention by setting attention_type:

from keras_self_attention import SeqSelfAttention

SeqSelfAttention(
    attention_width=15,
    attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
    attention_activation=None,
    kernel_regularizer=keras.regularizers.l2(1e-6),
    use_attention_bias=False,
    name='Attention',
)

Regularizer

To use the regularizer, set attention_regularizer_weight to a positive number:

from tensorflow import keras
from keras_self_attention import SeqSelfAttention

inputs = keras.layers.Input(shape=(None,))
embd = keras.layers.Embedding(input_dim=32,
                              output_dim=16,
                              mask_zero=True)(inputs)
lstm = keras.layers.Bidirectional(keras.layers.LSTM(units=16,
                                                    return_sequences=True))(embd)
att = SeqSelfAttention(attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL,
                       kernel_regularizer=keras.regularizers.l2(1e-4),
                       bias_regularizer=keras.regularizers.l1(1e-4),
                       attention_regularizer_weight=1e-4,
                       name='Attention')(lstm)
dense = keras.layers.Dense(units=5, name='Dense')(att)
model = keras.models.Model(inputs=inputs, outputs=[dense])
model.compile(
    optimizer='adam',
    loss={'Dense': 'sparse_categorical_crossentropy'},
    metrics={'Dense': 'categorical_accuracy'},
)
model.summary(line_length=100)

Load the Model

Make sure to add SeqSelfAttention to custom objects:

from tensorflow import keras

keras.models.load_model(model_path, custom_objects=SeqSelfAttention.get_custom_objects())

History Only

Set history_only to True when only historical data could be used:

SeqSelfAttention(
    attention_width=3,
    history_only=True,
    name='Attention',
)

Multi-Head

Please refer to keras-multi-head.

keras-self-attention's People

Contributors

Stargazers

Watchers

Forkers

anirband cjopengler kaeflint gu5hanl1gh7n1n gepeng18 halfopen bigzhao zw76859420 whaozl curiouscowboy rnnnnn wooramkang bertorob xinpingluo crazyxuehu cofec wendydadong phelanwang zanluyang jjtail boluoyu rahasayantan ziganlan tjunlp shubhampachori12110095 leedaga vandana-rajan sprinterzzj jasondu1993 fzy0728 princessd8251 pokemonjjj yoyo-yun ibrahim85 giangzuzana legendtianjin yynst2 lhideki crystal22 mark0428 amoonhappy niley1nov alsm168 dangxuanhong luckiday cosecant-csc little1tow littlepai yxuandai zeth1888 tetragramm nbcstevenchen panna19951227 sahand68 zehuangfang alaskaw kyroad yejiachen thirasan cdyangbo schenbergzy aierwiki wuyifanisai hzy95 panky8070 hankharry wanglilin piyushbhuwalka-sopho zbn123 amsainju yang-tradelab youngxu06 rizardrenanda louiseviden barisuke yiwenwang47 zhenhengdong apal9569 grkhr etcec mandalbiswadip hamidehkerdegari yoghourtcover tartakovsky shinychoudhury hpw-hub hyojungchoi jihochoi agupta24 stepholi kapitsa2811 ozgekokyay a3678911 dp-sun deeplearning-machinelearning goomoo99 zczs123 zhangjiwei-japan elephann john-520

keras-self-attention's Issues

Compatibility with `tf.keras`

I have been looking into self-attention using TensorFlow. More specifically I use the Keras API which is integrated the tf.keras module.

I have tried both the Sequential and Functional API to no avail:

text_inputs = tf.keras.layers.Input(shape=(None,))
embd_layer = tf.keras.layers.Embedding(input_dim=VOCAB_SIZE,
                                output_dim=EMBEDDING_DIM,
                                mask_zero=True,
                                weights=None,
                                trainable=None is None,
                                name='Embedding')(text_inputs)
lstm_layer = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(units=512,
                                                      recurrent_dropout=0.4,
                                                      return_sequences=True),
                                                      name='Bi-LSTM')(embd_layer)
attention_layer = SeqSelfAttention(attention_activation='sigmoid',
                               attention_width=9,
                               return_attention=False,
                               name='Attention')(lstm_layer)

returns TypeError: The added layer must be an instance of class Layer. Found: <keras_self_attention.seq_self_attention.SeqSelfAttention object at 0x7f87ee16bd30> (I think because TensorFlow expects a tf.keras.Layer object).

And using the Functional API:

text_inputs = tf.keras.layers.Input(shape=(SEQ_LENGTH,))
x = tf.keras.layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM, input_length=SEQ_LENGTH)(text_inputs)
x = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(512, return_sequences=True))(x)
x = SeqSelfAttention(attention_activation='sigmoid')(x)

returns ValueError: Layer Attention was called with an input that isn't a symbolic tensor. Received type: <class 'tensorflow.python.keras.engine.base_layer.DeferredTensor'>. Full input: [<DeferredTensor 'None' shape=(?, ?, 1024) dtype=float32>]. All inputs to the layer should be tensors

Any clue? Is it because I am not using Keras but tf.keras instead?

How to plot an attention heat map ?

I tried the flag return_attention = true but it returns an array filled with ones.

Compatibility with Tensorflow 2.0

I'm trying to create a model using keras-self-attentionon Google colab, and since the default Tensorflow version is 2.0 now, this error prompt :

model = models.Sequential()
model.add( Embedding(max_features, 32))
model.add(Bidirectional( LSTM(32, return_sequences=True)))
# adding an attention layer
model.add(SeqWeightedAttention())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _get_default_graph()
     65     try:
---> 66         return tf.get_default_graph()
     67     except AttributeError:

AttributeError: module 'tensorflow' has no attribute 'get_default_graph'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
4 frames
<ipython-input-7-9c4e625938a2> in <module>()
      3 model.add(Bidirectional( LSTM(32, return_sequences=True)))
      4 # adding an attention layer
----> 5 model.add(SeqWeightedAttention())

/usr/local/lib/python3.6/dist-packages/keras_self_attention/seq_weighted_attention.py in __init__(self, use_bias, return_attention, **kwargs)
     10 
     11     def __init__(self, use_bias=True, return_attention=False, **kwargs):
---> 12         super(SeqWeightedAttention, self).__init__(**kwargs)
     13         self.supports_masking = True
     14         self.use_bias = use_bias

/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py in __init__(self, **kwargs)
    130         if not name:
    131             prefix = self.__class__.__name__
--> 132             name = _to_snake_case(prefix) + '_' + str(K.get_uid(prefix))
    133         self.name = name
    134 

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in get_uid(prefix)
     84     """
     85     global _GRAPH_UID_DICTS
---> 86     graph = _get_default_graph()
     87     if graph not in _GRAPH_UID_DICTS:
     88         _GRAPH_UID_DICTS[graph] = defaultdict(int)

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _get_default_graph()
     67     except AttributeError:
     68         raise RuntimeError(
---> 69             'It looks like you are trying to use '
     70             'a version of multi-backend Keras that '
     71             'does not support TensorFlow 2.0. We recommend '

**RuntimeError: It looks like you are trying to use a version of multi-backend Keras that does not support TensorFlow 2.0. We recommend using `tf.keras`, or alternatively, downgrading to TensorFlow 1.14.**

Supporting convolutional LSTM?

Does it support ConvLSTM?

attention w/ TimeDistributed

Hi,

Is it possible to use the attention layer w/ a TimeDistributed(Dense(...))?

If so, how would one have to modify the example in the README?

Thank you,
Adrian

初始化器写成了正则化器

keras-self-attention/keras_self_attention/seq_self_attention.py

Lines 88 to 89 in 2d4f5e6

    
           'kernel_initializer': keras.regularizers.serialize(self.kernel_initializer), 
        
           'bias_initializer': keras.regularizers.serialize(self.bias_initializer),

两个 keras.regularizers.serialize，应该是 keras.initializers.serialize

how to apply Attention between two LSTM layers?

Scaled Dot Product attention error

When applying scaled dot product attention is gives the following error:

TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.

Any idea?

tf.keras.layers.Attention？

"Tuple index out of range" when using SeqWeightedAttention

elif keras_mode == "RNN":
            model.add(Reshape((1, list_of_embeddings[1].size), input_shape = Emb_train.shape[1:])) 
            model.add(Bidirectional(GRU(list_of_embeddings[1].size, activation = 'relu'))) ##this works too - seems to be better for smaller datasets too!
            model.add(SeqWeightedAttention())
            model.add(Dense(len(np.unique(Y_val)),activation='softmax',kernel_initializer=kernel_initializer, use_bias = False))

Traceback (most recent call last):
  File "classification.py", line 182, in <module>
    pipe.fit(X_train, Y_train)
  File "/usr/lib/python3.7/site-packages/sklearn/pipeline.py", line 267, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/usr/lib/python3.7/site-packages/keras/wrappers/scikit_learn.py", line 210, in fit
    return super(KerasClassifier, self).fit(x, y, **kwargs)
  File "/usr/lib/python3.7/site-packages/keras/wrappers/scikit_learn.py", line 141, in fit
    self.model = self.build_fn(**self.filter_sk_params(self.build_fn))
  File "classification.py", line 144, in create_model
    model.add(SeqWeightedAttention())
  File "/usr/lib/python3.7/site-packages/keras/engine/sequential.py", line 181, in add
    output_tensor = layer(self.outputs[0])
  File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 431, in __call__
    self.build(unpack_singleton(input_shapes))
  File "/usr/lib/python3.7/site-packages/keras_self_attention/seq_weighted_attention.py", line 27, in build
    self.W = self.add_weight(shape=(int(input_shape[2]), 1),
IndexError: tuple index out of range

Dense layer after Self Attention Layer throws an error

If I try the code given in the example with text data :

import keras
from keras_self_attention import SeqSelfAttention

model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=10000,
                                 output_dim=300,
                                 mask_zero=True))
model.add(keras.layers.Bidirectional(keras.layers.LSTM(units=128,
                                                       return_sequences=True)))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(keras.layers.Dense(units=5))
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['categorical_accuracy'],
)
model.summary()

It throws an error :

ValueError: Error when checking target: expected dense_10 to have 3 dimensions, but got array with shape (28696, 2)

I need to add another lstm layer after selfattention with return_sequence=False, to make this run. Am i missing something? The output from self attention layer is of shape (maxlen, number of units in lstm layer(say units=200)), how do we get a vector out of the layer of size (units=200)?

Examples for a basic NMT model?

I couldn't understand how to use it to build a sequence to sequence machine translation task. An example/tutorial will be really helpful.

Masking implementation

Hi @CyberZHG, I'm using self-attention over an RNN for a classification problem, however I'm a bit confused with the masking implementation and their differences among the provided attention types. I apologize for the size of the post in advance.

To test the masking, I created a placeholder tensor to represent the output hidden states from an RNN with T=6 timesteps [t0,...,t5] and D=3 units, where timesteps t2, t4 and t5 are masked:

h_states = tf.convert_to_tensor(np.array([[[0.5,0.2,0.1],[0.4,0.9,0.3],[-1,-1,-1],[0.1,0.2,0.1], [-1,-1,-1], [-1,-1,-1]]]), dtype='float32')
masked_states = Masking(mask_value=-1)(h_states)

SeqSelfAttention

SeqSelfAttention(return_attention=True)(masked_states)
When calling the additive or dot attention, I was surprised to find that only a_{i,j}, with i,j = [2,4,5] in the [TxT] attention matrix were masked:

SeqSelfAttention(return_attention=True)(masked_states)
[<tf.Tensor: shape=(1, 6, 3), dtype=float32, numpy=
 array([[[0.13978598, 0.16256869, 0.06499083],
         [0.1412907 , 0.16534805, 0.06593135],
         [0.32337117, 0.3761466 , 0.15032762],
         [0.14026345, 0.163282  , 0.06523413],
         [0.32337117, 0.3761466 , 0.15032762],
         [0.32337117, 0.3761466 , 0.15032762]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 6, 6), dtype=float32, numpy=
 array([[[0.159832  , 0.10862342, 0.18911283, 0.16420609, 0.18911283, 0.18911283],
         [0.16049388, 0.11161783, 0.18797404, 0.16396616, 0.18797404, 0.18797404],
         [0.36969936, 0.25163805, 0.        , 0.37866256, 0.        , 0.        ],
         [0.16022852, 0.10937916, 0.18880568, 0.16397531, 0.18880568, 0.18880568],
         [0.36969936, 0.25163805, 0.        , 0.37866256, 0.        , 0.        ],
         [0.36969936, 0.25163805, 0.        , 0.37866256, 0.        , 0.        ]]], dtype=float32)>]

Q1: Shouldn't the [2,4,5] rows and columns be masked entirely instead since the values result from alignments with masked timesteps?

SeqWeightedAttention

SeqWeightedAttention seems to mask the padding timesteps completely:

SeqWeightedAttention(return_attention=True)(masked_states)
[<tf.Tensor: shape=(1, 3), dtype=float32, numpy=array([[0.33272028, 0.43565503, 0.16733001]], dtype=float32)>,
 <tf.Tensor: shape=(1, 6), dtype=float32, numpy=array([[0.32931313, 0.33665004, 0.        , 0.33403683, 0.        , 0.        ]], dtype=float32)>]

ScaledDotProductAttention

ScaledDotProductAttention expectedly returned similar values to Keras' implementation tf.keras.layers.Attention(use_scale=True), except for the existing masked timestep values:

ScaledDotProductAttention(return_attention=True)(masked_states)
[<tf.Tensor: shape=(1, 6, 3), dtype=float32, numpy=
 array([[[0.34341848, 0.4522895 , 0.17208272],
         [0.3484643 , 0.5025628 , 0.18644652],
         [0.33333334, 0.43333334, 0.16666667],
         [0.33703578, 0.4488316 , 0.17109475],
         [0.33333334, 0.43333334, 0.16666667],
         [0.33333334, 0.43333334, 0.16666667]]], dtype=float32)>,
 <tf.Tensor: shape=(1, 6, 6), dtype=float32, numpy=
 array([[[0.33823597, 0.3604136 , 0.        , 0.3013504 , 0.        , 0.        ],
         [0.29698637, 0.43223262, 0.        , 0.27078095, 0.        , 0.        ],
         [0.33333334, 0.33333334, 0.        , 0.33333334, 0.        , 0.        ],
         [0.32598415, 0.3554737 , 0.        , 0.31854212, 0.        , 0.        ],
         [0.33333334, 0.33333334, 0.        , 0.33333334, 0.        , 0.        ],
         [0.33333334, 0.33333334, 0.        , 0.33333334, 0.        , 0.        ]]], dtype=float32)>]

Here the mask propagates over the columns but not the rows.

Keras Dot Attention

Finally, even though its implementation is supposedly not supported for RNN (as per code documentation), the final result is more aligned with my expected behavior, where the values for the masked timesteps are removed:

Attention(use_scale=True)([masked_states,masked_states])
<tf.Tensor: shape=(1, 6, 3), dtype=float32, numpy=
array([[[0.35038543, 0.46623248, 0.17606643],
        [0.35869   , 0.5558893 , 0.20168266],
        [0.        , 0.        , 0.        ],
        [0.3397184 , 0.46044892, 0.17441398],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]], dtype=float32)>

Q2: Is there a need to multiply the output of SeqSelfAttention or ScaledDotAttention by the initial mask before summing over the timestep dimension to obtain a final vector?

[edit: question wording, example removed]

keras-self-attention Paper ?

Hi, Thanks for the amazing implementation.

I just want to know, this implementation is based on which paper exactly?

Thanks!

Error when load model

Hello, my name is raspati, i have problem when use keras-self-attention layer.
i can't load model, ValueError: Unknown layer: SeqSelfAttention.

import keras
from keras_self_attention import SeqSelfAttention

keras.models.load_model("/content/drive/MyDrive/Colab Notebooks/Model/GRU_apnea.h5", custom_objects=SeqSelfAttention.get_custom_objects())

results[2]应该改为results[0]

keras-self-attention/tests/scaled_dot_attention/test_history.py

Line 43 in 2d4f5e6

), results[2])

感觉results[2]应该改为results[0]吧？不过自己是新手，不太确定自己对不对，给大神反馈下

Does it support masking?

Hello CyberZHG

I have a sequence of inputs and sequence of outputs where each input has an associated output(Label). lets say (part of speech tagging (POS tagging))

Seq_in[0][0:3]
array([[15],[28], [23]])

Seq_out[0][0:3]
array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]],
dtype=float32)

I am using the following code for training:

X_train, X_val, Y_train, Y_val = train_test_split(Seq_in,Seq_out, test_size=0.20)

model = Sequential()
model.add(Masking(mask_value=5, input_shape= (Seq_in.shape[1],1))) # time steps is 500
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(256, return_sequences=True)))
model.add(Dropout(0.2))
model.add(seq_self_attention.SeqSelfAttention())
model.add(Dense(15, activation='softmax'))

sgd = optimizers.SGD(lr=.1,momentum=0.9,decay=1e-3,nesterov=True)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

model.fit(X_train,Y_train,epochs=2, validation_data=(X_val, Y_val),verbose=2)

I have a couple of concerns:
it seems that the implementation supports masking, but what I am doing in the code is a correct way to use masking or there is another way?

why do we need the variable units in the constructor? does not the code figuer it out itself?

following the equations posted in the readme file, the process is to sum each neighbor states ht` with the state of the current time step ht, then taking the tanh of each unit in each state, which produce the same shape. first equation.

second, each states ht` is squashed to one value (scalar) using sigmoid function. Second equation.

Third, we find the softmax between the current state of the current time step with the other states ht`.

Finally, we multiply the softmax probability (attention weight) with each unit and then taking the weighted sum.

is my understanding correct? if so, why do we need the unit in the constructor?

Also, we have to methods multiplicative and additive, where can I see the difference in regard to the equations

Sorry, too many questions, I would appreciate your answers...
Thank you

error in building my bi-lstm with attention, help

Dear author,
Thanks for your keras-self-attention.

    Recently I am learning to develop a bi-lstm with attention model, and meet a mistake when use self-attention:

(for imdb dataset)
`model3 = Sequential()

model3.add( Embedding(max_features, 32) )
model3.add( layers.Bidirectional( layers.LSTM(32, return_sequences=True) ) )
model3.add(SeqSelfAttention(activation='sigmoid') )
model3.add(Dense(1, activation='sigmoid') )

model3.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model3.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)`

 when I run model.fit, the value error comes,

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
----> 1 history = model3.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.2)

~/denglz/venv4re/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
950 sample_weight=sample_weight,
951 class_weight=class_weight,
--> 952 batch_size=batch_size)
953 # Prepare validation data.
954 do_validation = False

~/denglz/venv4re/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
787 feed_output_shapes,
788 check_batch_axis=False, # Don't enforce the batch size.
--> 789 exception_prefix='target')
790
791 # Generate sample-wise weight values given the sample_weight and

~/denglz/venv4re/lib/python3.6/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
126 ': expected ' + names[i] + ' to have ' +
127 str(len(shape)) + ' dimensions, but got array '
--> 128 'with shape ' + str(data_shape))
129 if not check_batch_axis:
130 data_shape = data_shape[1:]

ValueError: Error when checking target: expected dense_6 to have 3 dimensions, but got array with shape (25000, 1)`

 Am i using the keras-self-attention in wrong way? Need your help , thanks a lot..

reference for multiplicative attention

Sir/Ma'am

Kindly point me to some reference for the multiplicative self-attention mechanism.

Regards

AttributeError: module 'tensorflow' has no attribute 'get_default_graph' while using 'SeqSelfAttention'

Hey CyberZHG,
thank you for your cool packages. I've used keras-self-attention, and firstly it worked okay but the other day 'AttributeError: module 'tensorflow' has no attribute 'get_default_graph'' started to appear every time I try to use SeqSelfAttention. Without using your code, the error disappears.
I couldn't figure out what was the problem. I tried to upgrade/downgrade and reinstall tf and Keras (following the posts from StackOverFlow) but it didn't help.
So maybe you can explain to me what's wrong? The problem seems to be somehow connected with keras-self-attention. I'm new to neural networks and programming in general, so I hope if this question is stupid, you'll be patient to answer in detail (because several days of googling did not help). Thank you in advance!
Here is my code:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import GRU
from keras_self_attention import SeqSelfAttention

max_features = 4 #number of words in the dictionary
num_classes = 2
model = Sequential()
model.add(GRU(128, input_shape=(70, max_features), return_sequences=True, activation='tanh'))
model.add(SeqSelfAttention(attention_activation='sigmoid')) 
model.add(Flatten())
model.add(Dense(num_classes, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])
model.summary()

Here is my error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-15-1807f7e55fc9> in <module>
     11 model.add(GRU(128, input_shape=(70, max_features), return_sequences=True, activation='tanh'))
     12 # model.add(LSTM(128, input_shape=(70, max_features), return_sequences=True)) #return_sequences: output for att.layer
---> 13 model.add(SeqSelfAttention(attention_activation='sigmoid'))
     14 # model.add(Dropout(0.5))
     15 model.add(Flatten())

~/anaconda3/lib/python3.7/site-packages/keras_self_attention/seq_self_attention.py in __init__(self, units, attention_width, attention_type, return_attention, history_only, kernel_initializer, bias_initializer, kernel_regularizer, bias_regularizer, kernel_constraint, bias_constraint, use_additive_bias, use_attention_bias, attention_activation, attention_regularizer_weight, **kwargs)
     47         :param kwargs: Parameters for parent class.
     48         """
---> 49         super(SeqSelfAttention, self).__init__(**kwargs)
     50         self.supports_masking = True
     51         self.units = units

~/anaconda3/lib/python3.7/site-packages/keras/engine/base_layer.py in __init__(self, **kwargs)
    130         if not name:
    131             prefix = self.__class__.__name__
--> 132             name = _to_snake_case(prefix) + '_' + str(K.get_uid(prefix))
    133         self.name = name
    134 

~/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in get_uid(prefix)
     72     """
     73     global _GRAPH_UID_DICTS
---> 74     graph = tf.get_default_graph()
     75     if graph not in _GRAPH_UID_DICTS:
     76         _GRAPH_UID_DICTS[graph] = defaultdict(int)

AttributeError: module 'tensorflow' has no attribute 'get_default_graph'

SeqSelfAttention returning tuple (tensor, weights) raise TypeError on Tensor object not iterable

Hi,

I'm running the code Regularizer example from your README, and I getting the following error from SeqSelAttention layer

TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn.

raised in "lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 436, in iter

I'm using:

python 3.6.5
tensorflow 1.9.0

Any clue?

ScaledDotProductAttention not returning attention properly.

I have used your MultiheadAttention model on a complex input, outputting attention/self attention with success.
After that, I have tried experimenting with the ScaledDotProductAttention model that is used from Multihead with the aim to increase interpretability and decrease inference/training times.

While building a simple example such as the one bellow, I was unable to extract a correct attention vector even after training the model.
The attention vector was always: [0.99, 0.99, 0.99, .... , 0.99] for single dimentional input.

Any clues to what I am doing that prevents the model outputting a correct attention vector?

Reference code bellow:

from __future__ import absolute_import, division, print_function, unicode_literals
import os

import tensorflow as tf

import cProfile
import keras
import keras.backend as K
from keras_self_attention import ScaledDotProductAttention
import tensorflow as tf
import numpy as np
import keras


input_query = keras.layers.Input(
    shape=(9,),
    name='Input-Q',
)
input_key = keras.layers.Input(
    shape=(1, ),
    name='Input-K',
)
input_value = keras.layers.Input(
    shape=(1, ),
    name='Input-V',
)
emb = keras.layers.Embedding(input_dim=20, output_dim=10)

query = emb(input_query)
key = emb(input_key)
value = emb(input_value)

att_layer, a = ScaledDotProductAttention(
    name='ScaledDotProductAttention',
    return_attention= True
)([query, key, value])

att_layer = keras.layers.GlobalAveragePooling1D(name='AttAvg')(att_layer)

query = keras.layers.GlobalAveragePooling1D()(query)

query = keras.layers.Reshape(target_shape=(1, 10))(query)

att_layer = keras.layers.Reshape(target_shape=(1, 10))(att_layer)

y = keras.layers.concatenate([att_layer, query])
y = keras.layers.Reshape(target_shape=(20,))(y)
y = keras.layers.Dense(300, activation='relu')(y)

y = keras.layers.Dense(1, activation='sigmoid', name='output_sigmoid')(y)


model = keras.models.Model(
    inputs=[input_query, input_key, input_value], outputs=y)
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy'],
)
model.summary()

Attention is:  Tensor("ScaledDotProductAttention_2/truediv_1:0", shape=(?, 9, 1), dtype=float32) Tensor("ScaledDotProductAttention_2/MatMul_1:0", shape=(?, 9, 10), dtype=float32) Tensor("embedding_4_2/GatherV2:0", shape=(?, 1, 10), dtype=float32)
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input-Q (InputLayer)            (None, 9)            0                                            
__________________________________________________________________________________________________
Input-K (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
Input-V (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         multiple             200         Input-Q[0][0]                    
                                                                 Input-K[0][0]                    
                                                                 Input-V[0][0]                    
__________________________________________________________________________________________________
ScaledDotProductAttention (Scal [(None, 9, 10), (Non 0           embedding_4[0][0]                
                                                                 embedding_4[1][0]                
                                                                 embedding_4[2][0]                
__________________________________________________________________________________________________
AttAvg (GlobalAveragePooling1D) (None, 10)           0           ScaledDotProductAttention[0][0]  
__________________________________________________________________________________________________
global_average_pooling1d_3 (Glo (None, 10)           0           embedding_4[0][0]                
__________________________________________________________________________________________________
reshape_7 (Reshape)             (None, 1, 10)        0           AttAvg[0][0]                     
__________________________________________________________________________________________________
reshape_6 (Reshape)             (None, 1, 10)        0           global_average_pooling1d_3[0][0] 
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 1, 20)        0           reshape_7[0][0]                  
                                                                 reshape_6[0][0]                  
__________________________________________________________________________________________________
reshape_8 (Reshape)             (None, 20)           0           concatenate_3[0][0]              
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 300)          6300        reshape_8[0][0]                  
__________________________________________________________________________________________________
output_sigmoid (Dense)          (None, 1)            301         dense_2[0][0]                    
==================================================================================================
Total params: 6,801
Trainable params: 6,801
Non-trainable params: 0
__________________________________________________________________________________________________

Data generation

# Data Generation.

from random import randint
def get_q(m,n):
    
    q = []
    
    for i in range(n):
        a = []
        k = randint(1,10)
        for i in range(m):
            a.append(k)
        q.append(a)
    return q

def get_k(n):
    l = randint(1,10)
    k = []
    for i in range(n):
        k.append((l))
    
    return k



def potato_gun(iterations):
    i_q = np.array([get_q(1,9) for i in range(0,iterations)])
    i_k = np.array([get_q(1,1) for i in range(0,iterations)])


    i_y = []
    for q, v in zip(i_q, i_k):
        y = 0

        for row in q:
            if np.array_equal(v[0],row):
                y = 1
        i_y.append(y)
    
    return [np.reshape(i_q, newshape=(iterations, 9)), 
            np.reshape(i_k, newshape=(iterations,)), 
            np.reshape(i_k, newshape=(iterations,))], i_y


dummy_dataset = potato_gun(240000)
x_train, y_train = (dummy_dataset[0],dummy_dataset[1])
dummy_dataset = potato_gun(80000)
x_test, y_test = (dummy_dataset[0],dummy_dataset[1])

Simple rule:

If input[1] exists in the input[0] , y = 1, else y = 0

idx = 5
print(f'If {x_train[0][idx]} contains {x_train[1][idx]}, Then y = {y_train[idx]}')

If [ 9  8  9  6  4  2  6  7 10] contains [7], Then y = 1

Training

model.fit(x=x_train,
      y=y_train,
      batch_size=32,
      epochs=10,
      verbose=1,
      validation_data=(x_test, y_test),
      shuffle=True,
      class_weight=None,
      sample_weight=None
          )

Train on 240000 samples, validate on 80000 samples
Epoch 1/10
240000/240000 [==============================] - 11s 47us/step - loss: 0.4857 - acc: 0.7541 - val_loss: 0.3153 - val_acc: 0.8487
Epoch 2/10
240000/240000 [==============================] - 11s 45us/step - loss: 0.1943 - acc: 0.9161 - val_loss: 0.0770 - val_acc: 0.9788
Epoch 3/10
240000/240000 [==============================] - 11s 46us/step - loss: 0.0113 - acc: 0.9986 - val_loss: 3.9026e-04 - val_acc: 1.0000
Epoch 4/10
240000/240000 [==============================] - 11s 46us/step - loss: 2.3738e-04 - acc: 1.0000 - val_loss: 9.9840e-05 - val_acc: 1.0000
Epoch 5/10
240000/240000 [==============================] - 12s 50us/step - loss: 6.5733e-05 - acc: 1.0000 - val_loss: 2.9516e-05 - val_acc: 1.0000
Epoch 6/10
240000/240000 [==============================] - 13s 56us/step - loss: 3.2543e-05 - acc: 1.0000 - val_loss: 1.5050e-05 - val_acc: 1.0000
Epoch 7/10
240000/240000 [==============================] - 13s 56us/step - loss: 2.5813e-06 - acc: 1.0000 - val_loss: 1.8419e-06 - val_acc: 1.0000
Epoch 8/10
240000/240000 [==============================] - 12s 50us/step - loss: 5.1268e-05 - acc: 1.0000 - val_loss: 2.1687e-06 - val_acc: 1.0000
Epoch 9/10
240000/240000 [==============================] - 12s 49us/step - loss: 5.6301e-07 - acc: 1.0000 - val_loss: 3.2817e-07 - val_acc: 1.0000
Epoch 10/10
240000/240000 [==============================] - 12s 51us/step - loss: 7.6280e-05 - acc: 1.0000 - val_loss: 9.8148e-07 - val_acc: 1.0000





<keras.callbacks.History at 0x7fa5630a7630>

Extracting the submodel & confirming the output shapes:

target_layer = model.get_layer('ScaledDotProductAttention')
layer_output = target_layer.output
from keras.models import Model

submodel_emb = Model(inputs=model.input, outputs=layer_output)
print(submodel_emb.summary())
print(f'Shapes: \n Values: {submodel_emb.output_shape[0]}, Attention: {submodel_emb.output_shape[1]}')

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input-Q (InputLayer)            (None, 9)            0                                            
__________________________________________________________________________________________________
Input-K (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
Input-V (InputLayer)            (None, 1)            0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         multiple             200         Input-Q[0][0]                    
                                                                 Input-K[0][0]                    
                                                                 Input-V[0][0]                    
__________________________________________________________________________________________________
ScaledDotProductAttention (Scal [(None, 9, 10), (Non 0           embedding_3[0][0]                
                                                                 embedding_3[1][0]                
                                                                 embedding_3[2][0]                
==================================================================================================
Total params: 200
Trainable params: 200
Non-trainable params: 0
__________________________________________________________________________________________________
None
Shapes: 
 Values: (None, 9, 10), Attention: (None, 9, 1)

Expected behaviour:

Model should export an attention vector containing attention weights summing up to 1 during .predict()

Observed behaviour:

With input array [2,2,2,2,2,2,2,2,1] the attention should be directed in the last position of the input.

Contrary to the expectations, the attention vector is [0.99, 0.99, 0.99, .... , 0.99]

idx = 5
values, attention = submodel_emb.predict(
    (
        [np.array([[2]*8 + [1]]),
        np.array([1,]),
        np.array([1])]
    )
)
print(attention)

[[[0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]
  [0.9999999]]]

About Keras self-attention

Hi, Thank you for your contribution，I want to know if this is a self-attention or a general attention.I hope to get your reply.Thank you

Getting Error :Invalid argument: Incompatible shapes

I am getting below error whenever I introduce Attention layer

InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [1,37500] vs. [1,125]
	 [[node LogicalAnd (defined at <command-367511126204188>:2) ]]
	 [[ConstantFolding/assert_greater_equal_4/Assert/AssertGuard/switch_pred/_114_const_false/_187]]
  (1) Invalid argument:  Incompatible shapes: [1,37500] vs. [1,125]
	 [[node LogicalAnd (defined at <command-367511126204188>:2) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_36165]

it is related to batch size, however when i remove the attention layer model works fine.

Output of the layer is 3 Dimension and not 2 Dimension

Compute_out_shape outputs shape = (?, max seqeunce length, units of bidirectional layer)

keras-self-attention for time series forecasting

I have a time series case, and the input shape of network iss 2000*214 (2000 samples – here, every sample is a day, and I have 214 features). Also, X as the input of the fit function is (2000, 1, 214), and the output is (2000, 1).

ipt   = Input(shape = (2000, 214))
x     = LSTM(250, activation='tanh', return_sequences=True)(ipt)
x     = SeqSelfAttention(return_attention=True, name='att')(x)
x     = concatenate(x)
x     = Flatten()(x)
out   = Dense(1, activation='relu')(x)
dl_model = Model(ipt, out)
dl_model.compile(optimizer = 'adam', loss = 'mse')

After training the model, I use this:


outputs   = [layer.output for layer in dl_model.layers if 'att' in layer.name]
layers_fn = K.function([dl_model.input, K.learning_phase()], outputs[0])
ww = layers_fn([X_train, 1])

The problem is that all values in ww[1] are one. I used this package with a sample of random numbers, and it behaved similarly. In this particular case, how can I extract attention to understand the importance of past samples.

T

Which paper does Multiplicative Attention refer to?

It is ready to work with multivariate ?

hello,

I have this model :

        variables=9
        timeSteps=128 
        inputNet = Input(shape=(timeSteps,variables)) 
               lstm=Bidirectional(LSTM(100,recurrent_dropout=0.4,dropout=0.4,return_sequences=True),merge_mode='concat')(inputNet) #worse using stateful=True
        lstm=SeqSelfAttention(attention_activation='sigmoid')(lstm)  
      lstm=Bidirectional(LSTM(50,recurrent_dropout=0.4,dropout=0.4,return_sequences=False),merge_mode='concat')(lstm) #worse using stateful=True         
        classificationLayer=Dense(classes,activation='softmax')(lstm)
        model=Model(inputNet,classificationLayer)

It do not seam to improve to the same model but without the line :
lstm=SeqSelfAttention(attention_activation='sigmoid')(lstm)

My code makes sense or my problem/data does not need attention?

Thanks !

“OverflowError: Python int too large to convert to C long” on windows but not linux

    attention = SeqSelfAttention(attention_activation='sigmoid',
                                 history_only=True,
                                 attention_type=SeqSelfAttention.ATTENTION_TYPE_ADD
                                 )(biLSTM_LSTM_H)

raises:
OverflowError: Python int too large to convert to C long.
This happens on windows, on Linux there is not problem.
When removing history_only=True parameter it works.

visualizing attention weights

how can i please visualize the weights of attention layer? i have implemented it in BiLSTM

ignore this issue. my mistake

Your attention codes all call self.add_weight, but I don't see that function defined anywhere in this repo...

No attribute like return sequence = True/Flase

Hi,
I'm using the attention model with Bidirectional LSTMs for sequence classification.
Here is the code:

    inp = Input(shape=(maxlen, ), name="Input")
    x = Embedding(max_features, embed_size, weights=[embedding_matrix],
                  trainable=False, name="embedding")(inp)
    x = Bidirectional(LSTM(300, return_sequences=True, dropout=0.25,
                           recurrent_dropout=0.25),name="lstm_1")(x)
    x = Bidirectional(LSTM(300, return_sequences=True, dropout=0.25,
                           recurrent_dropout=0.25),name="lstm_2")(x)
    x = SeqSelfAttention(attention_type=SeqSelfAttention.ATTENTION_TYPE_MUL, 
                         return_sequences = False,
                         kernel_regularizer=keras.regularizers.l2(1e-4),
                         bias_regularizer=keras.regularizers.l1(1e-4),
                         attention_regularizer_weight=1e-4,
                         name="Attention")(x)
    
    x = Dense(256, activation="relu")(x)
    x = Dropout(0.25)(x)
    x = Dense(6, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x, name="Model")

and the output I'm getting is:

Layer (type)                 Output Shape              Param #   
=================================================================
Input (InputLayer)           (None, 150)               0         
_________________________________________________________________
embedding (Embedding)        (None, 150, 300)          30000000  
_________________________________________________________________
lstm_1 (Bidirectional)       (None, 150, 600)          1442400   
_________________________________________________________________
lstm_2 (Bidirectional)       (None, 150, 600)          2162400   
_________________________________________________________________
Attention (SeqSelfAttention) (None, 150, 600)          360001    
_________________________________________________________________
dense_8 (Dense)              (None, 150, 256)          153856    
_________________________________________________________________
dropout_4 (Dropout)          (None, 150, 256)          0         
_________________________________________________________________
dense_9 (Dense)              (None, 150, 6)            1542      
=================================================================
Total params: 34,120,199
Trainable params: 4,120,199
Non-trainable params: 30,000,000
_________________________________________________________________

As you can see that output of attention model is (None,150,600) inplace of (None,600). Please tell me how I can do that ?
Thanks

can this be used in a seq 2 seq task?

Can this be used in a seq 2 seq task? With an encoder LSTM and a decoder LSTM? The examples don't seem to cover this

How to implement “Multi-Head”

I want to use 'Multi-Head', but I don't know how. Is stacking SelfAttention OK?

Does it support functional API?

Could you please provide real-world examples of the models with combination of keras functional API?

Tensorflow 2.0 Compatibility

Hi,

When I tried to use this package in Tensorflow 2.0, I got the following error:

RuntimeError: It looks like you are trying to use a version of multi-backend Keras that does not support TensorFlow 2.0. We recommend using tf.keras, or alternatively, downgrading to TensorFlow 1.14.

"tuple index out of range" when using self attention layer in imdb dataset

Hi, I followed the steps to use self-attention layer just as README.md did but I got an error when I created my own model.
Here is my code:

from keras.datasets import imdb # the dataset I used
# ....
review_input = Input(shape=(MAX_WORDS_PER_REVIEW,), dtype='int32')
embedding_layer = Embedding(MAX_WORDS, EMBEDDING_DIM, input_length=MAX_WORDS_PER_REVIEW)
embedding_review = embedding_layer(review_input)
lstm = LSTM(100)(embedding_review)
att_lstm = SelfAttention(units=100, attention_activation="sigmoid")(lstm)  # I used Attention Layer after LSTM layer
preds = Dense(1, activation='sigmoid')(att_lstm)

model = Model(review_input, preds)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Here is the error:

"self_attention.py", line 106, in **_build_additive_attention**
    feature_dim = int(input_shape[2])
IndexError: tuple index out of range

It seems that the dimension is not suitable, but I don't know why this happened, since the code worked without the attention layer.

init() missing 3 required positional arguments: 'node_def', 'op', and 'message'

when i used the SeqSelfAttention function, the code return this error: init() missing 3 required positional arguments: 'node_def', 'op', and 'message', how to fix this ?

在加法模式和乘法模式里，一个加 ba,一个加ba[0]

keras-self-attention/keras_self_attention/seq_self_attention.py

Line 215 in 2d4f5e6

e += self.ba[0]

ba的shape为(1,)，在加法模式中是直接加 self.ba，在乘法模式里却是加 self.ba[0]，感觉应该没必要取[0]

Is it only for sequential data?

May I use it for CNN models? How to add this layer into it? Many thanks.

what's the difference between SeqWeightedAttention and SeqSelfAttention?

SeqWeightedAttention is another implementation of self attention or SeqWeightedAttention is not an implementation of self attention?

The gradient is missing sometimes

Dear Zhao,
Recently I tried your self-attention layer for my work, I really appreciate it.

However, the decrease of gradient sometimes got stuck from the beginning; I tried some ways, however, it is not stable, could you please give some suggestions?

Best wishes,
Sunberg

Attention to 2D input

Hi @CyberZHG

Thank you for your work on this repo. I am trying to use your repo for a time series forecasting problem with 2D tensor input to the attention module , my model -


Layer (type)                    Output Shape         Param #     Connected to                     
features (InputLayer)           (None, 16, 1816)     0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 2048)         31662080    features[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 1024)         2098176     lstm_1[0][0]                     
__________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU)       (None, 1024)         0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 120)          123000      leaky_re_lu_2[0][0]              
__________________________________________________________________________________________________
feature_weights (InputLayer)    (None, 120)          0                                            
__________________________________________________________________________________________________
multiply_1 (Multiply)           (None, 120)          0           dense_3[0][0]                    
                                                                 feature_weights[0][0]            

Total params: 33,883,256
Trainable params: 33,883,256
Non-trainable params: 0
__________________________________________________________________________________________________

However your attention module requires a 3D input, can you suggest necessary changes to make it work after the LSTM layer, i.e. 2D (None,2048) input ?

Attention Weights

How to get the final attention weights when setting return_attention = True, suppose you set a the word vector as the input.