Greetings. Thank you for your amazing work on Offline RL, as well as

SAC CQL: Potential mismatch between observations and actions fed to the Q network in CQL computation. about cql HOT 4 CLOSED

dosssman commented on July 28, 2024

SAC CQL: Potential mismatch between observations and actions fed to the Q network in CQL computation.

from cql.

Comments (4)

dosssman commented on July 28, 2024 1

Greetings.

I would say that the

the resampled actions from this part (right hand side

you have mentioned instead correspond to the following:

curr_actions_tensor, curr_log_pis = self._get_policy_actions(obs, num_actions=self.num_random, network=self.policy)
# skipped lines
q1_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf1)
q2_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf2)

So it should be current state with new actions.

That would correspond to the lines highlighted above I think.

Also, from the equation in your comment, the log exp sum is computed over multiple actions {a_i }, i \in {1.. num_actions} sampled for a specific state s.
Therefore, if we were to rigorously follow that same equation, if we compute the new_curr_actions_tensor using the next_obs, the log sum exp should also be taken with respect to those next_obs, I think.

Nevertheless, it would still "work" since the goal of the CQL objective is to minimize the the Q values for know states but "out of distributions" actions. Namely, new_curr_actions_tensor would indeed be "out of distribution" with respect to states next_obs.

from cql.

olliejday commented on July 28, 2024

I think this is OK actually.

Perhaps confusingly named "q1_next_actions" but it seems to be the resampled actions from this part (right hand side) of the estimation for the log sum exp term (from appendix F in the paper):

So it should be current state with new actions.

from cql.

aviralkumar2907 commented on July 28, 2024

Sorry for the late reply. It is mathematically correct, since it is just a third term for passing action samples for computing the logsumexp. In this code version, the log-sum-exp is computed using there terms:

Actions from the current policy. This is what I guess is clear.
Uniform actions.
Actions from the policy at the next state. Note that we can still use these next actions with the state since these are just action samples given to us. If we know the probabilities from which these actions are sampled, which is \pi(next_actions|next_obs), then the Q-function term should be Q(curr_state, next_action) - \log \pi(next_action|next_obs), where this is fine, since we sampled next actions from the policy at the next state but we are using these action samples to compute the log-sum-exp of the Q-function.

from cql.

dosssman commented on July 28, 2024

Thanks for the answer.

from cql.

SAC CQL: Potential mismatch between observations and actions fed to the Q network in CQL computation. about cql HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent