aviralkumar2907 / cql Goto Github PK
View Code? Open in Web Editor NEWCode for conservative Q-learning
Code for conservative Q-learning
Hello, @aviralkumar2907 . Thanks for sharing the CQL code.
For the MuJoCo script, I've noticed that seed
value is not actually used anywhere in the script.
https://github.com/aviralkumar2907/CQL/blob/master/d4rl/examples/cql_mujoco_new.py
Is this the script actually used to produce numbers for the paper?
If not so, could you share the script?
Thank you in advance!
I cannot find the path rlkit/torch/sac/cql.py
in the rlkit master
branch that i pulled.
Can you let me know which branch are you referring to?
Thanks!
I wont to create conda env CQL/d4rl/environment/linux-gpu-env.yml
I run this command
conda env create -f linux-gpu-env.yml
I got this error
`ResolvePackageNotFound:
This is a question regarding how CQL(rho) works in terms of code ๐.
In the CQL section (starting from line 235) within /CQL/d4rl/rlkit/torch/sac/cql.py
, we first computed:
cat_q1 = torch.cat(
[q1_rand, q1_pred.unsqueeze(1), q1_next_actions, q1_curr_actions], 1
)
cat_q2 = torch.cat(
[q2_rand, q2_pred.unsqueeze(1), q2_next_actions, q2_curr_actions], 1
)
and then used them to compute
min_qf1_loss = torch.logsumexp(cat_q1 / self.temp, dim=1,).mean() * self.min_q_weight * self.temp
min_qf2_loss = torch.logsumexp(cat_q2 / self.temp, dim=1,).mean() * self.min_q_weight * self.temp
I'm a bit confused about why the Q values of actions drawn from three distinct distributions can be used to compute this quantity:
q1_rand
: uniform distributionq1_pred
: dataset distributionq1_curr_actions
and q1_next_actions
: last-iteration policyHere are my questions:
I'm able to completely understand how CQL(H) works in the codebase though.
When I tried to run d4rl MuJoCo benchmark, this happens.
It looks like there is a version discrepancy of rlkit library between the published code and the real one used in the paper.
If updating the code bothers you, maybe you could give a pointer to me about which rlkit version is used in your experiments? Or, where can I download the rlkit library you used in the experiments?
cql_mujoco_new.py:124: SyntaxWarning: "is not" with a literal. Did you mean "!="? if (gpu_str is not ""): No personal conf_private.py found. doodad not detected Traceback (most recent call last): File "cql_mujoco_new.py", line 6, in <module> from rlkit.torch.sac.policies import TanhGaussianPolicy, MakeDeterministic File "/home/hsinyu/rlkit/rlkit/torch/sac/policies/__init__.py", line 1, in <module> from rlkit.torch.sac.policies.base import ( File "/home/hsinyu/rlkit/rlkit/torch/sac/policies/base.py", line 11, in <module> from rlkit.torch.core import torch_ify, elem_or_tuple_to_numpy ImportError: cannot import name 'elem_or_tuple_to_numpy' from 'rlkit.torch.core' (/home/hsinyu/rlkit/rlkit/torch/core.py)
Hi! I was unable to reproduce the result in kitchen-mixed-v0 using the same hyperparameters in D4RL mujoco tasks. Could you please provide the configurations for Kitchen?
Hey,
unlike in the paper, implementation has this part with subtracting the action probabilities from Q:
CQL/d4rl/rlkit/torch/sac/cql.py
Line 253 in d67dbe9
My guess that the effect would be to have less focus of the loss on a single high-Q action, should policy focus on such. But then we already have temperature parameter. Not sure author will answer, so anybody who knows, I'd appreciate your insights :)
In code "cql_mujoco_new.py", define expl_path_collector
https://github.com/aviralkumar2907/CQL/blob/master/d4rl/examples/cql_mujoco_new.py#L65
In "batch_rl_algorithm.py", use the expl_path_collector.collect_new_paths, when use the expl_path_collector.collect_new_paths(), should give the policy_fn args.
https://github.com/aviralkumar2907/CQL/blob/d67dbe9cf5d2b96e3b462b6146f249b3d6569796/d4rl/rlkit/core/batch_rl_algorithm.py#L110
Greetings.
Thank you for your amazing work on Offline RL, as well as for open-sourcing the code.
This present issue pertains to the computation for the lower bounding component of the SAC CQL:
## add CQL
random_actions_tensor = torch.FloatTensor(q2_pred.shape[0] * self.num_random, actions.shape[-1]).uniform_(-1, 1) # .cuda()
curr_actions_tensor, curr_log_pis = self._get_policy_actions(obs, num_actions=self.num_random, network=self.policy)
new_curr_actions_tensor, new_log_pis = self._get_policy_actions(next_obs, num_actions=self.num_random, network=self.policy)
q1_rand = self._get_tensor_values(obs, random_actions_tensor, network=self.qf1)
q2_rand = self._get_tensor_values(obs, random_actions_tensor, network=self.qf2)
q1_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf1)
q2_curr_actions = self._get_tensor_values(obs, curr_actions_tensor, network=self.qf2)
q1_next_actions = self._get_tensor_values(obs, new_curr_actions_tensor, network=self.qf1)
q2_next_actions = self._get_tensor_values(obs, new_curr_actions_tensor, network=self.qf2)
Namely, at line 236, , the actions new_curr_actions_tensor
of the policy for the next states in the batch, next_obs
, are computed by feeding the latter to the policy.
CQL/d4rl/rlkit/torch/sac/cql.py
Line 236 in d67dbe9
When computing the corresponding Q value, however, the next_curr_actions_tensor
are fed to the Q networks with what seems to be the observations at the current time step obs
:
CQL/d4rl/rlkit/torch/sac/cql.py
Line 241 in d67dbe9
Shouldn't it be next_obs
instead of obs
at those two lines 241 and 242?
Or is there a specific reason we might want to use actions of the next states to compute the Q value for the current observations batch (states) ?
(Sampling "incorrect" actions with regard to the current observations (states) on purpose ?)
Thank you for your time, and sorry for the inconvenience.
Hi Aviral,
Thanks for your sharing, but when I tried to run the Atari experiment, it seems like the config file is not in the repository.
Two examples files are provided for D4RL experiments, e.g., cql_mujoco_new.py and cql_antmaze_new.py. I am a little bit confused about these two files.
In cql_mujoco_new.py, gym is used to create an environment.
In cql_antmaze_new.py, HalfCheetah-v2 instance is used. Should we use an antmaze instance and the mujoco environment?
Hi Aviral,
when I tried to run the 'train.py' script for Atari games, I noticed an error when creating the quantile agent in 'atari/batch_rl/multi_head/quantile_agent.py': In line 117 you are passing the argument "minq_weight" into the init function of the rainbow agent, but this class has no argument "minq_weight". After I deleted the argument the code runs perfectly.
Best,
Timo
In the CQL trainer, the policy_loss is formulated before the QF_Loss is, but the QF_Loss backprops the policy network before policy_loss does, which causes a Torch error. Would the intended use be to optimize policy network on the policy_loss before formulating the QF_Loss (and still optimize the policy using the QF_Loss) or to not reparametrize the policy output when formulating the QF_Loss (eg line 201)?
I choose lagrange_thresh=5, policy_lr=3e-5, min_q_version=2, min_q_weight=1, max_q_backup=False
But only get -100 on hammer-cloned, 4 on relocate-human, -18 on relocate-cloned.
The paper reports 730, 14, and -4 separately. Do I miss other details?
Hi Aviral,
Thanks for sharing your code!
My concern is about logsumexp calculation in CQL(H) for D4RL. On page 29 of your paper, you mentioned your technique for computing logsumexp, which looks fine to me. However, the code seems to be a bit different in a number of ways:
Also, a perhaps unrelated question: How did you come up with this way of computing logsumexp? Splitting the sum into two separate expectations (instead of just using one) w.r.t. two different distributions is not so intuitive for me.
Hi, would it be possible to release the checkpoints for this implementation? Would be very grateful for this.
Under my understanding, if automatically tuning cql weight with Lagrange, then in the code, "alpha_prime" and "min_q_weight" should be the same thing, right?
CQL/d4rl/rlkit/torch/sac/cql.py
Line 241 in d67dbe9
q1_next_actions = self._get_tensor_values(obs, new_curr_actions_tensor, network=self.qf1)
is wrong.q1_next_actions = self._get_tensor_values(next_obs, new_curr_actions_tensor, network=self.qf1)
Hi Aviral,
In the paper, you claim CQL can be implemented with less than 20 lines of code, but it's really difficult to identify these "20 lines of code" from the current version of your project which is built upon other projects. Would you please point out which part of code exactly corresponds to the core of CQL? I really like your idea of CQL, both the theoretical part and its simplicity, but currently, it seems very hard to follow.
Best,
Zhi-Hong
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.