Giter Site home page Giter Site logo

How to make decoder step? about code2seq HOT 18 CLOSED

tech-srl avatar tech-srl commented on May 27, 2024 1
How to make decoder step?

from code2seq.

Comments (18)

urialon avatar urialon commented on May 27, 2024

Hi @SpirinEgor ,
Thank you for your kind words and for your interest in code2seq.

The decoder LSTM cell is of size 320 (defined here: https://github.com/tech-srl/code2seq/blob/master/model.py#L406 )
And it takes inputs of size 128, right?
I think (but not sure) that the conversion is in the input layer of the LSTM. An LSTM of size X can get inputs of size Y, because the inputs are dot-producted by an input matrix.

I am not 100% sure about this, I implemented it long ago. I'll take another deeper look.

Are you trying to re-implement this and it's unclear what do the TensorFlow wrappers here do?

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

Yes, 320 is the size of the hidden state.
In LSTM we have equations like Wx + Uh + b and TensorFlow create only one matrix (concatenation of W and U).
So, when I print the shape of this matrix, I got [768; 1280]. 768 stands for concatenation of x and h, and if h has shape 320, therefore, x has shape 448.

I don't think, that there are transformations for input in LSTM. Because for LSTM for embedding paths, I got shape [256; 512], which corresponds to hidden state with shape 128 and embedded input with shape 128.

Yes, I'm trying to re-implement this on PyTorch, I found some realizations but there are some issues with them so I decide to write my own :)

from code2seq.

urialon avatar urialon commented on May 27, 2024

Can you please phrase again your question? I can't see an apparent mistake, but if there was a concrete question maybe I could think of an answer. When you wrote "And I don't understand, how it happens" - can you please explain what happens?

(on a side note, it's not really important to implement it exactly as it is. The important thing is to use seq2seq+attention where the "memories" are the encoded paths)

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

My question is about the decoder algorithm, how to make the decoder step?
Because, as I can see, there are different algorithms in paper and in this repository. I implement a paper version (I describe it in the first message), but I can't achieve the same result, so I wondering that there is another way to compute the next token.

from code2seq.

urialon avatar urialon commented on May 27, 2024

Oh, you mean that you already implemented that and the results are lower? How much lower?

Can you use PyTorch's standard LSTM encoder-decoder +attention?

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

I got f1-score 30 on the validation set for java-small (I use your preprocessed data, without changes).
Yes, I use standard LSTM from PyTorch both for encoder and decoder, for attention I implement Luong general attention. Or what do you mean?

I'm trying to find a bug for a few weeks, moreover, I print layers with shape for my and your models. It seems, that encoders are equals, but your decoder has more parameters (320 * 1280). So, I think that I misunderstand this part...

from code2seq.

urialon avatar urialon commented on May 27, 2024

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

I will tell you the test score later, also I will test your hypothesis about increasing decoder size, thanks for it.
Yes, I use the same hyperparameters with SGD optimizers with momentum. And I use the exponential scheduler for decay lr.

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

Hello again,
For the test set my model with the same hyperparameters achieve 31.9 f1-score.
Also, I ran experiment with bigger DECODER_SIZE, I use 450 which gave me 17M parameters (your model contains 16.8M parameters). I got 32.8 f1-score.

I noticed, that for test data, there are ~89 paths per sample (I count all paths and divide by the number of samples). Which is much less than for training (you report 171). Is this similar to your statistic? Also, I use MAX_CONTEXT=200 and it's equal to take all paths for most of the samples. Did you use the same number for java-small?

from code2seq.

urialon avatar urialon commented on May 27, 2024

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

Yes, I implement random sampling. It works the same for all holdouts: load all paths, shuffle them if RANDOM_CONTEXT is passed, take min(MAX_CONTEXT, len(paths)) paths for passing to the model.
For validation and test holdouts I use RANDOM_CONTEXT=False for the deterministic result.

Yes, I use early stopping. It takes 7-8 epochs for the loss to plateau and starts to increase.

I will investigate this strange behavior, my main version is about TensorFlow wrappers (they have very awful documentation, there is no information about how it works). I will also check again my data.

So, my last question is returned to the first message. Am I correct with the decoder algorithm? Or there is some difference, for example, sometimes attention is calculated before RNN: first, it concatenated with input and only then is passed to RNN?

from code2seq.

urialon avatar urialon commented on May 27, 2024

It looks like you are correct.
If I'm not missing anything, your description looks like our description in Section 3 of the paper.
It is also possible that our description in the paper is not 100% identical to what TensorFlow's wrappers do exactly. I agree that the documentation is lacking. When we wrote the paper, I dived into their implementation to understand what exactly they do.

  1. Did you read Section 3 recently, to verify that you didn't miss anything?
  2. Do you initialize the decoder state with the average of the input paths?
  3. Do you encode the paths using a bidirectional LSTM?
  4. Do you use the same vocabularies as us and vocabulary sizes? Are you using the .dict.c2s vocabulary file that our preprocessing creates?
  5. Do you use dropout in the same places and a recurrent dropout on the encoder LSTMs and the decoder LSTM?

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

Yes, I suggest the algorithm after reading Section 3. I didn't close your paper for a few weeks, always looking for some details :)

Yes, I'm using the average of the input paths to initialize the decoder state. And I'm using bidirectional LSTM for encoding them.

For collecting vocabulary I use your .dict.c2s, therefore all embedding matrices have the same shapes as yours.

There are some differences between dropouts in LSTM layers in PyTorch and TensorFlow. In TF dropout applied before each layer (so for 1-layer LSTM it would be applied once). But in PyTorch dropout applied after each layer except the last layer (for 1-layer LSTM it wouldn't be applied at all). So, I add an additional dropout layer before LSTM.

Ok, thank you for all answers! I will dig into TensorFlow wrappers. Hope to find something useful. :)

from code2seq.

urialon avatar urialon commented on May 27, 2024

from code2seq.

SpirinEgor avatar SpirinEgor commented on May 27, 2024

Hi, again :)

Returning to the average number of contexts per sample. I use your preprocessed data for training holdout and this script to count the average number of contexts per sample:

n_samples_true = 0
n_paths_true = 0
with open("java-small/java-small.train.c2s") as data_file:
    for line in data_file:
        name, *context = line.strip().split(' ')
        n_samples_true += 1
        n_paths_true += len(context)

After that, n_paths_true / n_samples_true returned 135.21. This is much less than 171, so my question is: am I reading the data incorrectly, or is the wrong archive loaded on s3?

Also, what do you do with samples where the number of paths is less than MAX_CONTEXT?

These questions don't correspond to the decoder algorithm, so we can close this issue and open new about data.

from code2seq.

urialon avatar urialon commented on May 27, 2024

Yes, you're right, I am getting 135 as well.
I am pretty sure that the correct data is loaded on S3.
To verify this, you can train my code on this data and see whether it reaches the results reported in the paper.

Either the 171 value is mistaken, or alternatively, it is possible that I computed this average before truncating to 1000 paths:

it is possible that originally there were 2000 paths in a given example. Then, during preprocessing, I sampled only 1000 paths to keep in the dataset. Out of these 1000 paths, a different subset of 200 is sampled dynamically during training.

from code2seq.

urialon avatar urialon commented on May 27, 2024

When the number of paths is less than MAX_CONTEXT - we simply pad them (append a learned vector of "PAD" such that there will be 200 paths overall).

from code2seq.

urialon avatar urialon commented on May 27, 2024

Closing due to inactivity, feel free to re-open.

from code2seq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.