In some of my experiments I sometimes get NaN parameters when training using TRPO and

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Conjugate Gradient Optimization sometimes fails (with NaN parameters) about rllab HOT 7 CLOSED

rll commented on September 25, 2024

Conjugate Gradient Optimization sometimes fails (with NaN parameters)

from rllab.

Comments (7)

dementrock commented on September 25, 2024

Hi @alexbeloi, the step size is computed according to the TRPO paper: https://arxiv.org/pdf/1502.05477v4.pdf. You can find the formula in Appendix.C.

How negative is the computed value of descent_direction.dot(Hx(descent_direction)), and can you describe more about your setup? This could happen if the code has a bug so that if you compute the mean KL is nonzero (or not sufficiently close to zero) before taking the step. We've also observed it sometimes happen with recurrent networks, although adjusting the nonlinearity seems to have solved it.

from rllab.

alexbeloi commented on September 25, 2024

Hi @dementrock, it appears that mean KL is nonzero before taking the step because of something I'm doing. This issue came up when debugging the ISSampler with TRPO.

What I'm doing is taking (off-policy) stored paths, computing the agent_infos for those paths with respect to the current policy using _, agent_infos = policy.get_action(observations), and then those agent_infos get passed to old_dist_info_vars_list in the optimizer.

What I expected was that the on-policy agent_infos that I computed would be identical to the dist_info_vars = policy.dist_info_sym(obs_var, state_info_vars) evaluated by the optimizer before taking the step, so kl = dist.kl_sym(old_dist_info_vars, dist_info_vars) would be zero before the step, but this isn't the case.

Is there a difference between agent_info computed from _, agent_infos = policy.get_action(observations) and the evaluation of dist_info_vars = policy.dist_info_sym(obs_var, state_info_vars) for obs_var evaluated at observations?

from rllab.

alexbeloi commented on September 25, 2024

I feel there is some confusion on my part. Where does the NPO algorithm get values for old_dist_info_vars and dist_info_vars from?

from rllab.

alexbeloi commented on September 25, 2024

Oh wow, super silly bug on my part. The last line of is_sampler.py should return samples not return paths. This was the root of the issue.

from rllab.

dementrock commented on September 25, 2024

@alexbeloi Re difference between agent_infos and evaluating dist_info_vars: agent_infos may contain more entries than dist_info_vars, but for the common keys their values should be the same. Otherwise there is a bug somewhere.

Does replacing return paths with return samples solve the NaN issue?

from rllab.

alexbeloi commented on September 25, 2024

@dementrock yes, that one line fix solves the NaN issue. I made a pull request with the patch and a (now working) example of TRPO with ISSampler.

from rllab.

dementrock commented on September 25, 2024

Awesome, thanks!

from rllab.

Conjugate Gradient Optimization sometimes fails (with NaN parameters) about rllab HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent