Giter Site home page Giter Site logo

Comments (2)

cypypccpy avatar cypypccpy commented on August 30, 2024

Hello,

First, let me thank you for open-sourcing this great framework. However, I am unable to run the training without getting the following error:

Traceback (most recent call last):
  File "train.py", line 95, in <module>
    train()
  File "train.py", line 47, in train
    sarl.run(num_learning_iterations=iterations, log_interval=cfg_train["learn"]["save_interval"])
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/ppo.py", line 142, in run
    actions, actions_log_prob, values, mu, sigma = self.actor_critic.act(current_obs, current_states)
  File "/media/data/users/erez/repos/bi-dexhands/main/bi-dexhands/algorithms/rl/ppo/module.py", line 77, in act
    distribution = MultivariateNormal(actions_mean, scale_tril=covariance)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/home/ubuntu/miniconda3/envs/bidexhands/lib/python3.7/site-packages/torch/distributions/distribution.py", line 53, in __init__
    raise ValueError("The parameter {} has invalid values".format(param))
ValueError: The parameter loc has invalid values

I've been trying to train using task=ShadowHandPushBlock, and no matter what I tried (various versions of python/pytorch/cuda/nvidia drivers), this error always pops-up in an undeterminitic manner (i.e. not the same iteration).

The environment I was using, that resulted with the attached traceback:

  • Python 3.7
  • Pytorch 1.8.1
  • Torchvision 0.9.1
  • CudaToolkit 11.1.1

Hello @sheffier ,

Thanks for your report and appreciation! Yeah it's an isaacgym bug, we encountered it when we were using it. We believe that this is a bug caused by the instability of the isaacgym physics engine when dealing with collisions, so that the observations have a nan value, which leads to an error in the neural network training. Details can be found in isaacgym devtalk forum.

Unfortunately, since this is an isaacgym issue, I don't think there is a way to fully fix this bug right now. However, I think that two methods can be taken to deal with it temporarily: 1) When the nan value appears in the obs and reward in the input of the neural network, ignore it to ensure that the training can be performed normally temporarily. 2) Referring to the nvidia official recommendation, adjust the simulation parameters to reduce the probability of this bug. I feel that the following adjustments can reduce this bug better, you can try to see if it helps:

Open the environment parameter file and see the sim part: (such as https://github.com/PKU-MARL/DexterousHands/blob/main/bi-dexhands/cfg/shadow_hand_push_block.yaml)

sim:  
  substeps: 2 
  physx: 
    num_threads: 4 
    solver_type: 1  # 0: pgs, 1: tgs 
    num_position_iterations: 8 
    num_velocity_iterations: 0 
    contact_offset: 0.002 
    rest_offset: 0.0 
    bounce_threshold_velocity: 0.2 
    max_depenetration_velocity: 1000.0 
    default_buffer_size_multiplier: 5.0 
  flex: 
    num_outer_iterations: 5 
    num_inner_iterations: 20 
    warm_start: 0.8 
    relaxation: 0.75 

I found that lowering the num_position_iterations_ parameter helped with this, so adjusted it to below:

sim:
  substeps: 2
  physx:
    num_threads: 4
    solver_type: 1  # 0: pgs, 1: tgs
    num_position_iterations: 4
    num_velocity_iterations: 0
    contact_offset: 0.002
    rest_offset: 0.0
    bounce_threshold_velocity: 0.2
    max_depenetration_velocity: 1000.0
    default_buffer_size_multiplier: 5.0
  flex:
    num_outer_iterations: 5
    num_inner_iterations: 20
    warm_start: 0.8
    relaxation: 0.75

Hope this can help you.

from dexteroushands.

sheffier avatar sheffier commented on August 30, 2024

Thanks for the detailed answer! Much appreciated

from dexteroushands.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.