Hello, i don't understand the combination of reward and loss functio

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Understanding of reward and loss function about seqgan HOT 7 CLOSED

lantaoyu commented on August 28, 2024

Understanding of reward and loss function

from seqgan.

Comments (7)

zhengliz commented on August 28, 2024 2

@tocab I agree with you. We are minimizing self.g_loss, which equals to maximizing the whole part inside -tf.reduce_sum(***). But *** is a product between a LogLikelihood that is <0, and a reward in the range of [0, 1]. Maximizing *** will simply push the reward to 0. Therefore, I believe even though the loss is decreasing, the model is actually getting less reward, which is in the opposite direction of what we want.

from seqgan.

LantaoYu commented on August 28, 2024 1

Hi, the reward should be the likelihood of a generated sample being real.

The intuitive explanation is, in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

Back to your discussion: "Maximizing *** will simply push the reward to 0." It should be noted that when training generator, the discriminator serves as a fixed environment and the reward is simply an external signal from that environment, which is not trainable. To be more specific, see this line, the reward for generator is a placeholder, which is equivalent to a provided constant number. So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

from seqgan.

LantaoYu commented on August 28, 2024 1

@eduOS Thanks for your comment! Let's discuss your point one by one.

First, about your recommended "this answer" and the image. It's just an explanation of how the original GAN works and I don't see any contradictions here. The insight is GAN is a good framework for optimizing the symmetric and smooth JS divergence, but only for continuous random variables. So let's find out how to extend it to discrete sequences modeling.

Second, about your recommended tutorial and this quote

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

It should be noted that there is an error in this part of the tutorial and hence also in your quote. maximize tf.reduce_mean(tf.log(D_fake)) is equivalent to minimizing tf.reduce_mean(1 - tf.log(D_fake)), if you throw away the reduce_mean operation and the constant number 1. And if you look at the original paper, it says

And its meaning is "maximizing the likelihood of a fake sample being real is better that minimizing the likelihood of a fake sample being fake", and the reason is the latter will cause gradient vanishing and block optimizing, not others. But one important thing is we are always "maximizing the likelihood of a fake sample being real".

Third, about "the RL language part". I don't quite understand what you mean by "penalizing", as it seems not an "RL language". I think this part is pretty clear, in RL, the most important thing is to specify what is the reward, i.e. what action is good and what is bad. As I discussed, in GAN, when training G, you always want it to generate samples that D think is real, so the reward is just the likelihood of a sample being real. After agreeing on this, the rest is just RL policy gradient derivations, I recommend David SIlver's slides on Policy Gradient.

Fourth, about "E [Q(s,a) * log(p_\theta(a|s))]" and "both in the paper and in this implementation the model is trying to minimize this". Please look at the code carefully. In this line, we define the loss of G as "-E [Q(s,a) * log(p_\theta(a|s))]", and we are minimizing this loss. So we are minimizing the negative of the expectation, i.e. maximizing E [Q(s,a) * log(p_\theta(a|s))].

Again, thanks for your interest in my work. I do admit that there are some limitations of SeqGAN like high variance etc. Since it was done two years ago, I also recommend our latest paper and the code, which I believe is the state-of-the-art.

from seqgan.

kunrenzhilu commented on August 28, 2024

Why?

loss gets minimized, the rewards will be minimized too

from seqgan.

luofuli commented on August 28, 2024

@zhengliz @tocab I agree with you two. So is there anyone who replace the label item[0] for ypred?
More specifically,
change
ypred = np.array([item[1] for item in ypred_for_auc])
to
ypred = np.array([item[0] for item in ypred_for_auc])

from seqgan.

eduOS commented on August 28, 2024

What I learned, in computer vision scenarios include:

I found this answer is really of help:

And so is this tutorial which runs that:

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

The way I intuit the above is as bellow:
I'd like to paraphrase the quote above as: to maximize tf.log(D_fake), which implies maximizing the probability of the sample being real, is better than minimizing 1-tf.log(D_fake), which means minimizing the probability of the sample being fake. From the perspective of the Generator, either way can let the generator adjust its parameters to optimize the likelihood of the sample being real. That is, if the discriminator tells the sample as real then the generator needs less loss to reduce in Tensorflow(1-tf.log(D_fake), as said aforementioned) and hence less gradient. And vice versa.

In this scenario:

~~I beg to differ and stick by changing item[1] to item[0] as @zhengliz said, which conflicts with what the author @LantaoYu replied~~Let me paraphrase and analyse the reply from the author:

in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

If I've correctly grasped the meaning of the saying "in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real", I thought it implies that: G needs to optimize its trainable variables to get a higher reward(the probability of being real in his case, as supplemented). But hardly can anyone be convinced that by ~~penalizing more~~ scaling less the loss(negative log likelihood from the generator) down(amounts to penalizing more the network by a comparatively larger loss than the same network with the same loss but with a smaller reward between 0 and 1), the model won't drive us in the opposite direction. I think if the model do in the reverse way, that is, there exists a positive correlation between the loss and the reward, it would be more reasonable. In this implementation the correlation is negative, that is the more proper word with a larger likelihood more likely suffer from a bigger reward.

And subsequently the author wrote:

So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

@LantaoYu explained further that "For a good action that successfully fool the discriminator, you need to increase its probability in your distribution", but didn't articulate why "Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing". IIUC, E [Q(s,a) * log(p_\theta(a|s))] stands for the mean of the multiplication of the the probability of the sample being real and ~~the original loss~~the probability of the generated word corresponding to the target. But both in the paper and in this implementation the model is trying to minimize this, then how comes it that maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the function (2) in the paper? So, how can I comprehend "maximize E [Q(s,a) * log(p_\theta(a|s))]" correctly? Does maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the E [Q(s,a) * log(p_\theta(a|s))]?

Please help figure out anything that is wrong in my reasoning above. Thanks.

from seqgan.

eduOS commented on August 28, 2024

@LantaoYu I've realized what I misunderstood. The larger the (negative likelihood * the reward) the larger the gradients and the better the parameters are optimized. I misinterpreted the combined loss.

from seqgan.

Understanding of reward and loss function about seqgan HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent