Giter Site home page Giter Site logo

Comments (4)

dosssman avatar dosssman commented on July 28, 2024 1

Greetings.

I would say that the

the resampled actions from this part (right hand side

you have mentioned instead correspond to the following:

curr_actions_tensor, curr_log_pis = self._get_policy_actions(obs, num_actions=self.num_random, network=self.policy)
# skipped lines
q1_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf1)
q2_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf2)

So it should be current state with new actions.

That would correspond to the lines highlighted above I think.

Also, from the equation in your comment, the log exp sum is computed over multiple actions {a_i }, i \in {1.. num_actions} sampled for a specific state s.
Therefore, if we were to rigorously follow that same equation, if we compute the new_curr_actions_tensor using the next_obs, the log sum exp should also be taken with respect to those next_obs, I think.

Nevertheless, it would still "work" since the goal of the CQL objective is to minimize the the Q values for know states but "out of distributions" actions. Namely, new_curr_actions_tensor would indeed be "out of distribution" with respect to states next_obs.

from cql.

olliejday avatar olliejday commented on July 28, 2024

I think this is OK actually.

Perhaps confusingly named "q1_next_actions" but it seems to be the resampled actions from this part (right hand side) of the estimation for the log sum exp term (from appendix F in the paper):

image

So it should be current state with new actions.

from cql.

aviralkumar2907 avatar aviralkumar2907 commented on July 28, 2024

Sorry for the late reply. It is mathematically correct, since it is just a third term for passing action samples for computing the logsumexp. In this code version, the log-sum-exp is computed using there terms:

  1. Actions from the current policy. This is what I guess is clear.
  2. Uniform actions.
  3. Actions from the policy at the next state. Note that we can still use these next actions with the state since these are just action samples given to us. If we know the probabilities from which these actions are sampled, which is \pi(next_actions|next_obs), then the Q-function term should be Q(curr_state, next_action) - \log \pi(next_action|next_obs), where this is fine, since we sampled next actions from the policy at the next state but we are using these action samples to compute the log-sum-exp of the Q-function.

from cql.

dosssman avatar dosssman commented on July 28, 2024

Thanks for the answer.

from cql.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.