j-min / mocha-pytorch Goto Github PK

View Code? Open in Web Editor NEW

75.0 10.0 20.0 319 KB

PyTorch Implementation of "Monotonic Chunkwise Attention" (ICLR 2018)

Python 100.00%

attention-mechanism seq2seq pytorch monotonic-attention hard-attention

mocha-pytorch's Issues

Questions about MonotonicAttention.soft

Is the returned attention by MonotonicAttention.soft() a probability distribution?

Seems to be not, the following code:

from attention import MonotonicAttention

monotonic = MonotonicAttention().cuda()

batch_size = 1
sequence_length= 5
enc_dim, dec_dim = 10, 10
prev_attention = None
for t in range(5):
    encoder_outputs = torch.randn(batch_size, sequence_length, enc_dim).cuda()
    decoder_h = torch.randn(batch_size, dec_dim).cuda()
    attention = monotonic.soft(encoder_outputs, decoder_h, previous_alpha=prev_attention)
    prev_attention = attention
    # probability distribution ?
    print(torch.sum(attention, dim=-1).detach().cpu().numpy())

returns:

[1.]
[0.0550258]
[0.00664481]
[0.00043618]
[4.0174375e-05]

If it was a probability distribution like softmax, every row should return 1 or ?. The consequence is my alignments look like this image:

So my questions are:

Is the returned attention by MonotonicAttention.soft() a probability distribution?
if not, is it possible to convert to one?

implementation of `safe_cumprod`

cumprod in the MoChA paper is defined to be exclusive, while the safe_cumprod in this repo does not. Shouldn't it be:

def safe_cumprod(self, x, exclusive=False):
    """Numerically stable cumulative product by cumulative sum in log-space"""
    bsz = x.size(0)
    logsum = torch.cumsum(torch.log(torch.clamp(x, min=1e-20, max=1)), dim=1)
    if exclusive:
        logsum = torch.cat([torch.zeros(bsz, 1).to(logsum), logsum], dim=1)[:, :-1]
    return torch.exp(logsum)

And in the function soft() of MonotonicAttention:

cumprod_1_minus_p = self.safe_cumprod(1 - p_select, exclusive=True)

safe_cumprod still causes Nan grad

I tried this MonotonicAttention in my seq2seq model, which works well with vanilla attention, while after training for a while, it still encountered the Nan grad issue. I checked the parameters with Nan grad, which are all params before MonotonicAttention's output. I also deleted the "safe_cumprod" operation, and this works well. So I think there may be some problems. Does anyone tried MonotonicAttention, and what's your situation?

Something Wrong in Energy

I think
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(sequence_length, 1) + self.b)
should be writen as
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(1,sequence_length).reshape(batch_size*sequence_length,-1) + self.b)

trained weights or training code?

Excuse me，is there any trained weights or training code?

j-min / mocha-pytorch Goto Github PK

mocha-pytorch's Issues

Questions about MonotonicAttention.soft

implementation of `safe_cumprod`

safe_cumprod still causes Nan grad

Something Wrong in Energy

trained weights or training code?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent