Giter Site home page Giter Site logo

mocha-pytorch's Introduction

PyTorch Implementation of Monotonic Chunkwise Attention

Requirements

  • PyTorch 0.4

TODOs

  • Soft MoChA
  • Hard MoChA
  • Linear Time Decoding
  • Experiment with Real-world dataset

Model figure

Model figure 1

Linear Time Decoding

It's not clear if authors' TF implementation supports decoding in linear time. They calculate energies for whole encoder outputs instead of scanning from previously attended encoder output.

References

mocha-pytorch's People

Contributors

j-min avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mocha-pytorch's Issues

Something Wrong in Energy

I think
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(sequence_length, 1) + self.b)
should be writen as
energy = self.tanh(self.W(encoder_outputs) + self.V(decoder_h).repeat(1,sequence_length).reshape(batch_size*sequence_length,-1) + self.b)

Questions about MonotonicAttention.soft

Is the returned attention by MonotonicAttention.soft() a probability distribution?

Seems to be not, the following code:

from attention import MonotonicAttention

monotonic = MonotonicAttention().cuda()

batch_size = 1
sequence_length= 5
enc_dim, dec_dim = 10, 10
prev_attention = None
for t in range(5):
    encoder_outputs = torch.randn(batch_size, sequence_length, enc_dim).cuda()
    decoder_h = torch.randn(batch_size, dec_dim).cuda()
    attention = monotonic.soft(encoder_outputs, decoder_h, previous_alpha=prev_attention)
    prev_attention = attention
    # probability distribution ?
    print(torch.sum(attention, dim=-1).detach().cpu().numpy())

returns:

[1.]
[0.0550258]
[0.00664481]
[0.00043618]
[4.0174375e-05]

If it was a probability distribution like softmax, every row should return 1 or ?. The consequence is my alignments look like this image:
alignment_img

So my questions are:

  • Is the returned attention by MonotonicAttention.soft() a probability distribution?
  • if not, is it possible to convert to one?

implementation of `safe_cumprod`

cumprod in the MoChA paper is defined to be exclusive, while the safe_cumprod in this repo does not. Shouldn't it be:

def safe_cumprod(self, x, exclusive=False):
    """Numerically stable cumulative product by cumulative sum in log-space"""
    bsz = x.size(0)
    logsum = torch.cumsum(torch.log(torch.clamp(x, min=1e-20, max=1)), dim=1)
    if exclusive:
        logsum = torch.cat([torch.zeros(bsz, 1).to(logsum), logsum], dim=1)[:, :-1]
    return torch.exp(logsum)

And in the function soft() of MonotonicAttention:

cumprod_1_minus_p = self.safe_cumprod(1 - p_select, exclusive=True)

safe_cumprod still causes Nan grad

I tried this MonotonicAttention in my seq2seq model, which works well with vanilla attention, while after training for a while, it still encountered the Nan grad issue. I checked the parameters with Nan grad, which are all params before MonotonicAttention's output. I also deleted the "safe_cumprod" operation, and this works well. So I think there may be some problems. Does anyone tried MonotonicAttention, and what's your situation?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.