Hi there, I got a question about the evaluation on text generation. In your AAAI2017 p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

evaluation issues about seqgan HOT 6 CLOSED

lantaoyu commented on July 30, 2024 2

evaluation issues

from seqgan.

Comments (6)

LantaoYu commented on July 30, 2024

To calculate BLEU score, you need to provide references (demonstrations) since BLEU is a kind of metric measuring the "similarity" between the evaluated sample and the reference(s). Thus, in our case, we use the human demonstrations (i.e. the real poems) as the references for calculating BLEU score. Instead of putting effort on selecting which poems as the references, we use all the real poems in the test set as the references since the goal is to generate samples which seems like coming from the real underlying distribution.

from seqgan.

xiaopyyy commented on July 30, 2024

@LantaoYu Thanks for the quick response. But my biggest concern is in what way to generate the test samples. Specifically, using all the real poems in the test set as the references, what is the input to the trained generator during the test procedure?

from seqgan.

LantaoYu commented on July 30, 2024

In this work, the generation procedure is in an unconditional fashion, that is we first start from an initial state, arbitrarily pick the first token, then follow the learned policy to sample the rest of the sequence.

from seqgan.

xiaopyyy commented on July 30, 2024

@LantaoYu Could you please explain the details in how to set the initial state, and how to "arbitrarily" pick the first token? From my view, the start first token is really important during testing, and will definitely influence the evaluation results a lot. Is the BLEU score you reported in your paper the average results? What's the number you used to sample the results? Btw, have you used word embedding during training for poem/text generation? Thanks!

from seqgan.

LantaoYu commented on July 30, 2024

It may need some clarification here. Actually, "arbitrarily" is not accurate. Note that when training a language model, the first input token is a predefined "start_token" and the label for the "start_token" is the first token in the real sequence. Thus in the test stage, the first input token is also the "start_token". As for the initial hidden state, for example, we can set it to be zeros, as in the synthetic data exp. So after training a language model p(a|s), the learned distribution of the first token is p(a_1 | a_0 = start_token, s_0) and we sample from this distribution. For the BLEU question, of course the result is average of a large number of samples, say 100,000. For the word embedding question, we don't use pre-trained word embedding. They are trainable parameters.

from seqgan.

sh0416 commented on July 30, 2024

@LantaoYu I just see your paper, and it is really exciting. I am also curious about your experiment setting. The ambiguous part is the train-validation-test split. In the paper, you use "a collection of 11092 paragraphs from Obama's political speeches". The curious thing is how to split the dataset.
Did you use the whole Obama's speech as a training dataset and introduce an additional corpus as test dataset? Or did you split the Obama's speech dataset into three parts like 8000 for train dataset, 1000 for validation dataset, and the remain as test dataset?
Thanks in advance :)

from seqgan.

evaluation issues about seqgan HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent