j-min / mocha-pytorch Goto Github PK

View Code? Open in Web Editor NEW

75.0 10.0 20.0 319 KB

PyTorch Implementation of "Monotonic Chunkwise Attention" (ICLR 2018)

Python 100.00%

attention-mechanism seq2seq pytorch monotonic-attention hard-attention

mocha-pytorch's Introduction

PyTorch Implementation of Monotonic Chunkwise Attention

Requirements

PyTorch 0.4

TODOs

Soft MoChA
Hard MoChA
Linear Time Decoding
Experiment with Real-world dataset

Model figure

Linear Time Decoding

It's not clear if authors' TF implementation supports decoding in linear time. They calculate energies for whole encoder outputs instead of scanning from previously attended encoder output.

References

Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss and Douglas Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments (ICML 2017)
Chung-Cheng Chiu and Colin Raffel. Monotonic Chunkwise Attention (ICLR 2018)

mocha-pytorch's People

Contributors

Stargazers

Watchers

mocha-pytorch's Issues

Something Wrong in Energy

I think
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(sequence_length, 1) + self.b)
should be writen as
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(1,sequence_length).reshape(batch_size*sequence_length,-1) + self.b)

trained weights or training code?

Excuse me，is there any trained weights or training code?

Questions about MonotonicAttention.soft

Is the returned attention by MonotonicAttention.soft() a probability distribution?

Seems to be not, the following code:

from attention import MonotonicAttention

monotonic = MonotonicAttention().cuda()

batch_size = 1
sequence_length= 5
enc_dim, dec_dim = 10, 10
prev_attention = None
for t in range(5):
    encoder_outputs = torch.randn(batch_size, sequence_length, enc_dim).cuda()
    decoder_h = torch.randn(batch_size, dec_dim).cuda()
    attention = monotonic.soft(encoder_outputs, decoder_h, previous_alpha=prev_attention)
    prev_attention = attention
    # probability distribution ?
    print(torch.sum(attention, dim=-1).detach().cpu().numpy())

returns:

[1.]
[0.0550258]
[0.00664481]
[0.00043618]
[4.0174375e-05]

If it was a probability distribution like softmax, every row should return 1 or ?. The consequence is my alignments look like this image:

So my questions are:

Is the returned attention by MonotonicAttention.soft() a probability distribution?
if not, is it possible to convert to one?

implementation of `safe_cumprod`

cumprod in the MoChA paper is defined to be exclusive, while the safe_cumprod in this repo does not. Shouldn't it be:

def safe_cumprod(self, x, exclusive=False):
    """Numerically stable cumulative product by cumulative sum in log-space"""
    bsz = x.size(0)
    logsum = torch.cumsum(torch.log(torch.clamp(x, min=1e-20, max=1)), dim=1)
    if exclusive:
        logsum = torch.cat([torch.zeros(bsz, 1).to(logsum), logsum], dim=1)[:, :-1]
    return torch.exp(logsum)

And in the function soft() of MonotonicAttention:

cumprod_1_minus_p = self.safe_cumprod(1 - p_select, exclusive=True)

safe_cumprod still causes Nan grad

I tried this MonotonicAttention in my seq2seq model, which works well with vanilla attention, while after training for a while, it still encountered the Nan grad issue. I checked the parameters with Nan grad, which are all params before MonotonicAttention's output. I also deleted the "safe_cumprod" operation, and this works well. So I think there may be some problems. Does anyone tried MonotonicAttention, and what's your situation?