Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to make decoder step? about code2seq HOT 18 CLOSED

tech-srl commented on May 27, 2024 1

How to make decoder step?

from code2seq.

Comments (18)

urialon commented on May 27, 2024

Hi @SpirinEgor ,
Thank you for your kind words and for your interest in code2seq.

The decoder LSTM cell is of size 320 (defined here: https://github.com/tech-srl/code2seq/blob/master/model.py#L406 )
And it takes inputs of size 128, right?
I think (but not sure) that the conversion is in the input layer of the LSTM. An LSTM of size X can get inputs of size Y, because the inputs are dot-producted by an input matrix.

I am not 100% sure about this, I implemented it long ago. I'll take another deeper look.

Are you trying to re-implement this and it's unclear what do the TensorFlow wrappers here do?

from code2seq.

SpirinEgor commented on May 27, 2024

Yes, 320 is the size of the hidden state.
In LSTM we have equations like Wx + Uh + b and TensorFlow create only one matrix (concatenation of W and U).
So, when I print the shape of this matrix, I got [768; 1280]. 768 stands for concatenation of x and h, and if h has shape 320, therefore, x has shape 448.

I don't think, that there are transformations for input in LSTM. Because for LSTM for embedding paths, I got shape [256; 512], which corresponds to hidden state with shape 128 and embedded input with shape 128.

Yes, I'm trying to re-implement this on PyTorch, I found some realizations but there are some issues with them so I decide to write my own :)

from code2seq.

urialon commented on May 27, 2024

Can you please phrase again your question? I can't see an apparent mistake, but if there was a concrete question maybe I could think of an answer. When you wrote "And I don't understand, how it happens" - can you please explain what happens?

(on a side note, it's not really important to implement it exactly as it is. The important thing is to use seq2seq+attention where the "memories" are the encoded paths)

from code2seq.

SpirinEgor commented on May 27, 2024

My question is about the decoder algorithm, how to make the decoder step?
Because, as I can see, there are different algorithms in paper and in this repository. I implement a paper version (I describe it in the first message), but I can't achieve the same result, so I wondering that there is another way to compute the next token.

from code2seq.

urialon commented on May 27, 2024

Oh, you mean that you already implemented that and the results are lower? How much lower?

Can you use PyTorch's standard LSTM encoder-decoder +attention?

from code2seq.

SpirinEgor commented on May 27, 2024

I got f1-score 30 on the validation set for java-small (I use your preprocessed data, without changes).
Yes, I use standard LSTM from PyTorch both for encoder and decoder, for attention I implement Luong general attention. Or what do you mean?

I'm trying to find a bug for a few weeks, moreover, I print layers with shape for my and your models. It seems, that encoders are equals, but your decoder has more parameters (320 * 1280). So, I think that I misunderstand this part...

from code2seq.

urialon commented on May 27, 2024

I see. 1. What F1 score are you getting with the same model on the **test** set? (I want to compare this score with the F1=43 that my implementation gets, and I don't remember my validation score) 2. What if you increase your decoder's sizes, such that their number of parameters is approximately equal? 3. Did you use the same learning rates, optimizers, and learning rate decay schedule?

…

On Fri, Jul 24, 2020 at 12:32 PM Egor Spirin ***@***.***> wrote: I got f1-score 30 on the validation set for java-small (I use your preprocessed data, without changes). Yes, I use standard LSTM from PyTorch both for encoder and decoder, for attention I implement Luong general attention. Or what do you mean? I'm trying to find a bug for a few weeks, moreover, I print layers with shape for my and your models. It seems, that encoders are equals, but your decoder has more parameters (320 * 1280). So, I think that I misunderstand this part... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMEB5WTAICK2ZG5AT3LR5FIMDANCNFSM4PFRQA6A> .

from code2seq.

SpirinEgor commented on May 27, 2024

I will tell you the test score later, also I will test your hypothesis about increasing decoder size, thanks for it.
Yes, I use the same hyperparameters with SGD optimizers with momentum. And I use the exponential scheduler for decay lr.

from code2seq.

SpirinEgor commented on May 27, 2024

Hello again,
For the test set my model with the same hyperparameters achieve 31.9 f1-score.
Also, I ran experiment with bigger DECODER_SIZE, I use 450 which gave me 17M parameters (your model contains 16.8M parameters). I got 32.8 f1-score.

I noticed, that for test data, there are ~89 paths per sample (I count all paths and divide by the number of samples). Which is much less than for training (you report 171). Is this similar to your statistic? Also, I use MAX_CONTEXT=200 and it's equal to take all paths for most of the samples. Did you use the same number for java-small?

from code2seq.

urialon commented on May 27, 2024

Hi Egor, Yes, on average I also see 89 paths per example in the test set of java-small. The 171 was measured on the *training* set. The main reason for this difference is because at training time we *sample* 200 out of all paths, and perform this at runtime. So the 171 number was measured as the average of numbers where some of them are much larger than 200. At test time, to make the test results deterministic, we don't sample anything at runtime, and instead, sampled up to 200 paths in advance. So the average that you compute - is *after* we sampled 200 paths in examples that had more than 200 paths. Did you implement this runtime-sampling during training in your model? It might help a bit, but I would be surprised if is responsible for such a large gap. In our experiments on Java-medium, this sampling gained ~3 F1 points. But **maybe** because java-small is smaller, this sampling (which acts as regularization) has a bigger effect. If you didn't implement this runtime-sampling, you can try to: 1. Train my implementation on Java-small, disable random sampling (here: https://github.com/tech-srl/code2seq/blob/master/config.py#L26 ), and compare this to your implementation. 2. Train your implementation on Java-medium, and compare it to our "no-random" ablation (Table 3 in the paper). Additionally - when do you stop training? Do you use early stopping? Let me know if you have any questions! Uri

…

On Mon, Jul 27, 2020 at 4:52 PM Egor Spirin ***@***.***> wrote: Hello again, For the test set my model with the same hyperparameters achieve 31.9 f1-score. Also, I ran experiment with bigger DECODER_SIZE, I use 450 which gave me 17M parameters (your model contains 16.8M parameters). I got 32.8 f1-score. I noticed, that for test data, there are ~89 paths per sample (I count all paths and divide by the number of samples). Which is much less than for training (you report 171). Is this similar to your statistic? Also, I use MAX_CONTEXT=200 and it's equal to *take all paths* for most of the samples. Did you use the same number for java-small? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMC5QX5Y3AEM6B6NDPDR5WBCHANCNFSM4PFRQA6A> .

from code2seq.

SpirinEgor commented on May 27, 2024

Yes, I implement random sampling. It works the same for all holdouts: load all paths, shuffle them if RANDOM_CONTEXT is passed, take min(MAX_CONTEXT, len(paths)) paths for passing to the model.
For validation and test holdouts I use RANDOM_CONTEXT=False for the deterministic result.

Yes, I use early stopping. It takes 7-8 epochs for the loss to plateau and starts to increase.

I will investigate this strange behavior, my main version is about TensorFlow wrappers (they have very awful documentation, there is no information about how it works). I will also check again my data.

So, my last question is returned to the first message. Am I correct with the decoder algorithm? Or there is some difference, for example, sometimes attention is calculated before RNN: first, it concatenated with input and only then is passed to RNN?

from code2seq.

urialon commented on May 27, 2024

It looks like you are correct.
If I'm not missing anything, your description looks like our description in Section 3 of the paper.
It is also possible that our description in the paper is not 100% identical to what TensorFlow's wrappers do exactly. I agree that the documentation is lacking. When we wrote the paper, I dived into their implementation to understand what exactly they do.

Did you read Section 3 recently, to verify that you didn't miss anything?
Do you initialize the decoder state with the average of the input paths?
Do you encode the paths using a bidirectional LSTM?
Do you use the same vocabularies as us and vocabulary sizes? Are you using the .dict.c2s vocabulary file that our preprocessing creates?
Do you use dropout in the same places and a recurrent dropout on the encoder LSTMs and the decoder LSTM?

from code2seq.

SpirinEgor commented on May 27, 2024

Yes, I suggest the algorithm after reading Section 3. I didn't close your paper for a few weeks, always looking for some details :)

Yes, I'm using the average of the input paths to initialize the decoder state. And I'm using bidirectional LSTM for encoding them.

For collecting vocabulary I use your .dict.c2s, therefore all embedding matrices have the same shapes as yours.

There are some differences between dropouts in LSTM layers in PyTorch and TensorFlow. In TF dropout applied before each layer (so for 1-layer LSTM it would be applied once). But in PyTorch dropout applied after each layer except the last layer (for 1-layer LSTM it wouldn't be applied at all). So, I add an additional dropout layer before LSTM.

Ok, thank you for all answers! I will dig into TensorFlow wrappers. Hope to find something useful. :)

from code2seq.

urialon commented on May 27, 2024

Good luck! Sorry I couldn't be more helpful. Regarding dropout in LSTMs: Notice that this is not applied "before" or "after" the LSTM layer, but instead, it is supposed to be applied between the LSTM time steps.

…

On Tue, 28 Jul 2020 at 11:37 Egor Spirin ***@***.***> wrote: Yes, I suggest the algorithm after reading Section 3. I didn't close your paper for a few weeks, always looking for some details :) Yes, I'm using the average of the input paths to initialize the decoder state. And I'm using bidirectional LSTM for encoding them. For collecting vocabulary I use your .dict.c2s, therefore all embedding matrices have the same shapes as yours. There are some differences between dropouts in LSTM layers in PyTorch and TensorFlow. In TF dropout applied before each layer (so for 1-layer LSTM it would be applied once). But in PyTorch dropout applied after each layer except the last layer (for 1-layer LSTM it wouldn't be applied at all). So, I add an additional dropout layer before LSTM. Ok, thank you for all answers! I will dig into TensorFlow wrappers. Hope to find something useful. :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADSOXMGYQPK3JIMP5WFI4G3R52E3FANCNFSM4PFRQA6A> .

from code2seq.

SpirinEgor commented on May 27, 2024

Hi, again :)

Returning to the average number of contexts per sample. I use your preprocessed data for training holdout and this script to count the average number of contexts per sample:

n_samples_true = 0
n_paths_true = 0
with open("java-small/java-small.train.c2s") as data_file:
    for line in data_file:
        name, *context = line.strip().split(' ')
        n_samples_true += 1
        n_paths_true += len(context)

After that, n_paths_true / n_samples_true returned 135.21. This is much less than 171, so my question is: am I reading the data incorrectly, or is the wrong archive loaded on s3?

Also, what do you do with samples where the number of paths is less than MAX_CONTEXT?

These questions don't correspond to the decoder algorithm, so we can close this issue and open new about data.

from code2seq.

urialon commented on May 27, 2024

Yes, you're right, I am getting 135 as well.
I am pretty sure that the correct data is loaded on S3.
To verify this, you can train my code on this data and see whether it reaches the results reported in the paper.

Either the 171 value is mistaken, or alternatively, it is possible that I computed this average before truncating to 1000 paths:

it is possible that originally there were 2000 paths in a given example. Then, during preprocessing, I sampled only 1000 paths to keep in the dataset. Out of these 1000 paths, a different subset of 200 is sampled dynamically during training.

from code2seq.

urialon commented on May 27, 2024

When the number of paths is less than MAX_CONTEXT - we simply pad them (append a learned vector of "PAD" such that there will be 200 paths overall).

from code2seq.

urialon commented on May 27, 2024

Closing due to inactivity, feel free to re-open.

from code2seq.

How to make decoder step? about code2seq HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent