Giter Site home page Giter Site logo

Comments (7)

zhengliz avatar zhengliz commented on July 30, 2024 2

@tocab I agree with you. We are minimizing self.g_loss, which equals to maximizing the whole part inside -tf.reduce_sum(***). But *** is a product between a LogLikelihood that is <0, and a reward in the range of [0, 1]. Maximizing *** will simply push the reward to 0. Therefore, I believe even though the loss is decreasing, the model is actually getting less reward, which is in the opposite direction of what we want.

from seqgan.

LantaoYu avatar LantaoYu commented on July 30, 2024 1

Hi, the reward should be the likelihood of a generated sample being real.

The intuitive explanation is, in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

Back to your discussion: "Maximizing *** will simply push the reward to 0." It should be noted that when training generator, the discriminator serves as a fixed environment and the reward is simply an external signal from that environment, which is not trainable. To be more specific, see this line, the reward for generator is a placeholder, which is equivalent to a provided constant number. So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

from seqgan.

LantaoYu avatar LantaoYu commented on July 30, 2024 1

@eduOS Thanks for your comment! Let's discuss your point one by one.

First, about your recommended "this answer" and the image. It's just an explanation of how the original GAN works and I don't see any contradictions here. The insight is GAN is a good framework for optimizing the symmetric and smooth JS divergence, but only for continuous random variables. So let's find out how to extend it to discrete sequences modeling.

Second, about your recommended tutorial and this quote

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

It should be noted that there is an error in this part of the tutorial and hence also in your quote. maximize tf.reduce_mean(tf.log(D_fake)) is equivalent to minimizing tf.reduce_mean(1 - tf.log(D_fake)), if you throw away the reduce_mean operation and the constant number 1. And if you look at the original paper, it says
2018-06-26 11 50 04
And its meaning is "maximizing the likelihood of a fake sample being real is better that minimizing the likelihood of a fake sample being fake", and the reason is the latter will cause gradient vanishing and block optimizing, not others. But one important thing is we are always "maximizing the likelihood of a fake sample being real".

Third, about "the RL language part". I don't quite understand what you mean by "penalizing", as it seems not an "RL language". I think this part is pretty clear, in RL, the most important thing is to specify what is the reward, i.e. what action is good and what is bad. As I discussed, in GAN, when training G, you always want it to generate samples that D think is real, so the reward is just the likelihood of a sample being real. After agreeing on this, the rest is just RL policy gradient derivations, I recommend David SIlver's slides on Policy Gradient.

Fourth, about "E [Q(s,a) * log(p_\theta(a|s))]" and "both in the paper and in this implementation the model is trying to minimize this". Please look at the code carefully. In this line, we define the loss of G as "-E [Q(s,a) * log(p_\theta(a|s))]", and we are minimizing this loss. So we are minimizing the negative of the expectation, i.e. maximizing E [Q(s,a) * log(p_\theta(a|s))].

Again, thanks for your interest in my work. I do admit that there are some limitations of SeqGAN like high variance etc. Since it was done two years ago, I also recommend our latest paper and the code, which I believe is the state-of-the-art.

from seqgan.

kunrenzhilu avatar kunrenzhilu commented on July 30, 2024

Why?

loss gets minimized, the rewards will be minimized too

from seqgan.

luofuli avatar luofuli commented on July 30, 2024

@zhengliz @tocab I agree with you two. So is there anyone who replace the label item[0] for ypred?
More specifically,
change
ypred = np.array([item[1] for item in ypred_for_auc])
to
ypred = np.array([item[0] for item in ypred_for_auc])

from seqgan.

eduOS avatar eduOS commented on July 30, 2024

What I learned, in computer vision scenarios include:

I found this answer is really of help:
gan

And so is this tutorial which runs that:

Also, as per the paper’s suggestion, it’s better to maximize tf.reduce_mean(tf.log(D_fake)) instead of minimizing tf.reduce_mean(1 - tf.log(D_fake)) in the algorithm above.

The way I intuit the above is as bellow:
I'd like to paraphrase the quote above as: to maximize tf.log(D_fake), which implies maximizing the probability of the sample being real, is better than minimizing 1-tf.log(D_fake), which means minimizing the probability of the sample being fake. From the perspective of the Generator, either way can let the generator adjust its parameters to optimize the likelihood of the sample being real. That is, if the discriminator tells the sample as real then the generator needs less loss to reduce in Tensorflow(1-tf.log(D_fake), as said aforementioned) and hence less gradient. And vice versa.

In this scenario:

I beg to differ and stick by changing item[1] to item[0] as @zhengliz said, which conflicts with what the author @LantaoYu repliedLet me paraphrase and analyse the reply from the author:

in adversarial training, given a fixed (optimal) discriminator, the generator always learns to generate samples that can fool the discriminator, which means if G generate a good example (i.e. the discriminator classify it to be real with high confidence), then G should adjust the parameters to assign this sequence a high density, in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real.

If I've correctly grasped the meaning of the saying "in the RL language, G need to adjust parameters to maximize the received reward, i.e. the probability of generated sequences being real", I thought it implies that: G needs to optimize its trainable variables to get a higher reward(the probability of being real in his case, as supplemented). But hardly can anyone be convinced that by penalizing more scaling less the loss(negative log likelihood from the generator) down(amounts to penalizing more the network by a comparatively larger loss than the same network with the same loss but with a smaller reward between 0 and 1), the model won't drive us in the opposite direction. I think if the model do in the reverse way, that is, there exists a positive correlation between the loss and the reward, it would be more reasonable. In this implementation the correlation is negative, that is the more proper word with a larger likelihood more likely suffer from a bigger reward.

And subsequently the author wrote:

So when you optimize G, the reward is fixed and serves as a signal, telling you which action is good or bad. For a good action that successfully fool the discriminator, you need to increase its probability in your distribution. Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing.

@LantaoYu explained further that "For a good action that successfully fool the discriminator, you need to increase its probability in your distribution", but didn't articulate why "Maximize E [Q(s,a) * log(p_\theta(a|s))] with respect to \theta does exactly this thing". IIUC, E [Q(s,a) * log(p_\theta(a|s))] stands for the mean of the multiplication of the the probability of the sample being real and the original lossthe probability of the generated word corresponding to the target. But both in the paper and in this implementation the model is trying to minimize this, then how comes it that maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the function (2) in the paper? So, how can I comprehend "maximize E [Q(s,a) * log(p_\theta(a|s))]" correctly? Does maximizing E [Q(s,a) * log(p_\theta(a|s))] amounts to minimizing the E [Q(s,a) * log(p_\theta(a|s))]?

Please help figure out anything that is wrong in my reasoning above. Thanks.

from seqgan.

eduOS avatar eduOS commented on July 30, 2024

@LantaoYu I've realized what I misunderstood. The larger the (negative likelihood * the reward) the larger the gradients and the better the parameters are optimized. I misinterpreted the combined loss.

from seqgan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.