twni2016 / pomdp-baselines Goto Github PK

Simple (but often Strong) Baselines for POMDPs in PyTorch, ICML 2022

Home Page: https://sites.google.com/view/pomdp-baselines

License: MIT License

Python 86.94% Jupyter Notebook 10.66% Shell 2.40%

pomdp recurrent-neural-networks meta-rl robust-rl generalization deep-reinforcement-learning pytorch td3 sac discrete-sac

pomdp-baselines's Introduction

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs

Welcome to the POMDP world!

This repository provides some simple baselines for POMDPs, specifically the recurrent model-free RL, on the benchmarks in several subareas of POMDPs (including meta RL, robust RL, generalization in RL, temporal credit assignment) for the following paper accepted to ICML 2022:

Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs. By Tianwei Ni (Mila, CMU), Benjamin Eysenbach (CMU) and Ruslan Salakhutdinov (CMU).

[arXiv] [project site] [numeric results]

Interested in Transformer Model-Free RL?

Check out our recent work on When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment (NeurIPS 2023 oral) with the code based on this repository! It is shown to be especially powerful in long-term memory tasks.

Motivation

RL mostly studies on MDPs, why POMDPs?

While MDPs prevail in RL research, POMDPs prevail in the real world and life. In many real problems (robotics, healthcare, finance, human interaction), we inevitably face partial observability, e.g. noisy sensors and lack of sensors. Can we observe "states"? Where do "states" come from?

Moreover, in RL research, many problems can be cast as POMDPs: meta RL, robust RL, generalization in RL, and temporal credit assignment. Within a more suitable framework, we can develop better RL algorithms.

Why use recurrent model-free RL for POMDP? What about other methods?

It is an open research area on deep RL algorithms for POMDPs. Among them, recurrent model-free RL, developed with a long history, is simple to implement, easy to understand, and trained end-to-end. Nonetheless, there is a popular belief that it performs poorly in practice. This work revisits it and provides some guidelines on the design of its key components, to make it stronger.

There are many other (more complicated or specialized) methods for POMDPs and their subareas. We show recurrent model-free RL, if well designed, can often outperform some of these methods in their benchmarks. It could be served as a strong baseline to incentivize future work.

CHANGE LOG

Jul 2022: Move the code for the compared methods to a new branch
Jun 2022: Cleaned and refactored the code for camera ready.
May 2022: this work has been accepted to ICML 2022!
Mar 2022: introduce recurrent SAC-discrete for discrete action space and see this PR for instructions. As a baseline, it greatly improves sample efficiency, compared to a specialized method IMPALA+SR, on their long-term credit assignment benchmark.

A Minimal Example to Run Our Implementation

Here we provide a stand-alone minimal example with the least dependencies to run our implementation of recurrent model-free RL!

Only requires PyTorch and PyBullet, no need to install MuJoCo or roboschool, no external configuration file.

Simply open the Jupyter Notebook example.ipynb and it contains the training and evaluation procedure on a toy POMDP environment (Pendulum-V). It only costs < 20 min to run the whole process on a GPU.

Installation

First download this repository into your local directory (preferably on a cluster or a server) to <local_path>. Then we recommend using a virtual env to install all the dependencies. We provide the yaml file to install using miniconda:

conda env create -f environments.yml
conda activate pomdp

The environments.yml file includes all the dependencies (e.g. MuJoCo, PyTorch, PyBullet) used in our experiments (including the compared methods), where we use mujoco-py=2.1 as it is free to use without license.

However, to run robust RL and generalization in RL experiments, you have to install Roboschool. We found it hard to install Roboschool from scratch, therefore we provide a docker file roboschool.sif in google drive that contains Roboschool and the other necessary libraries, adapted from SunBlaze repo.

To download and activate the docker file by singularity (tested in v3.7) on a cluster (on a single server should be similar):

# download roboschool.sif from the google drive to envs/rl-generalization/roboschool.sif
# then run singularity shell
singularity shell --nv -H <local_path>:/home envs/rl-generalization/roboschool.sif

Then you can test it by import roboschool in a python3 shell.

Run Our Implementation of Recurrent Model-Free RL and the Compared Methods

Benchmarks / Environments

We support several benchmarks in different subareas of POMDPs (see envs/ for details), including

"Standard" POMDPs: occlusion benchmark in PyBullet
Meta RL: gridworld and MuJoCo benchmark
Robust RL: SunBlaze benchmark in Roboschool
Generalization in RL: SunBlaze benchmark in Roboschool
Temporal credit assignment: delayed rewards with pixel observation and discrete control

Before starting running any experiments, we suggest having a good plan of environment series based on difficulty level. As it is hard to analyze and varies from algorithm to algorithm, we provide some rough estimates:

Extremely Simple as a Sanity Check: Pendulum-V (also shown in our minimal example jupyter notebook) and CartPole-V (for discrete action space)
Simple, Fast, yet Non-trivial: Wind (require precise inference and control), Semi-Circle (sparse reward). Both are continuous gridworlds, thus very fast.
Medium: Cheetah-Vel (1-dim stationary hidden state), *-Robust (2-dim stationary hidden state), *-P (could be roughly inferred by 2nd order MDP)
Hard: *-Dir (relatively complicated dynamics), *-V (long-term inference), *-Generalize (extrapolation)

General Form of Commands

We use .yml file in configs/ folder for training, and then we can overwrite the config file with command-line arguments for our implementation.

To run our implementation, Markovian, and oracle, in <local_path> simply

export PYTHONPATH=${PWD}:$PYTHONPATH
python3 policies/main.py --cfg configs/<subarea>/<env_name>/<algo_name>.yml \
  [--env <env_name>  --oracle
   --algo {td3,sac,sacd} --(no)automatic_entropy_tuning --target_entropy <float> --entropy_alpha <float>
   --debug --seed <int> --cuda <int>
  ]

where algo specifies the algorithm name:

mlp correspond to Markovian policies
rnn correspond to our implementation of recurrent model-free RL

We have merged the prior methods above into our repository: please see the all-methods branch.

For the compared methods, we use their open-sourced implementation with their default hyperparameters.

"Standard" POMDP

{Ant,Cheetah,Hopper,Walker}-{P,V} in the paper, corresponding to configs/pomdp/<ant|cheetah|hopper|walker>_blt/<p|v>, which requires PyBullet. We also provide Pendulum environments for sanity check.

Take Ant-P as an example:

# Run our implementation
python policies/main.py --cfg configs/pomdp/ant_blt/p/rnn.yml --algo sac
# Run Markovian
python policies/main.py --cfg configs/pomdp/ant_blt/p/mlp.yml --algo sac
# Oracle: we directly use Table 1 results (SAC w/ unstructured row) in https://arxiv.org/abs/2005.05719 as it is well-tuned

We also support recurrent SAC-discrete for POMDPs with discrete action space. Take CartPole-V as an example:

python policies/main.py --cfg configs/pomdp/cartpole/v/rnn.yml --target_entropy 0.7

See this PR for detailed instructions and this PR for results on a long-term credit assignment benchmark.

Meta RL

{Semi-Circle, Wind, Cheetah-Vel} in the paper, corresponding to configs/meta/<point_robot|wind|cheetah_vel|ant_dir>. Among them, Cheetah-Vel requires MuJoCo, and Semi-Circle can serve as a sanity check. Wind looks simple but is not very easy to solve.

Take Semi-Circle as an example:

# Run our implementation
python policies/main.py --cfg configs/meta/point_robot/rnn.yml --algo td3
# Run Markovian
python policies/main.py --cfg configs/meta/point_robot/mlp.yml --algo sac
# Run Oracle
python policies/main.py --cfg configs/meta/point_robot/mlp.yml --algo sac --oracle

{Ant, Cheetah, Humanoid}-Dir in the paper, corresponding to configs/meta/<ant_dir|cheetah_dir|humanoid_dir>. They require MuJoCo and are hard to solve. Take Ant-Dir as an example:

# Run our implementation
python policies/main.py --cfg configs/meta/ant_dir/rnn.yml --algo sac
# Run Markovian
python policies/main.py --cfg configs/meta/ant_dir/mlp.yml --algo sac
# Run Oracle
python policies/main.py --cfg configs/meta/ant_dir/mlp.yml --algo sac --oracle

Robust RL

Use roboschool. {Hopper,Walker,Cheetah}-Robust in the paper, corresponding to configs/rmdp/<hopper|walker|cheetah>. First, activate the roboschool docker env as introduced in the installation section.

Take Cheetah-Robust as an example:

## In the docker environment:
# Run our implementation
python3 policies/main.py --cfg configs/rmdp/cheetah/rnn.yml --algo td3
# Run Markovian
python3 policies/main.py --cfg configs/rmdp/cheetah/mlp.yml --algo sac
# Run Oracle
python3 policies/main.py --cfg configs/rmdp/cheetah/mlp.yml --algo sac --oracle

Generalization in RL

Use roboschool. {Hopper|Cheetah}-Generalize in the paper, corresponding to configs/generalize/Sunblaze<Hopper|HalfCheetah>/<DD-DR-DE|RD-RR-RE>. First, activate the roboschool docker env as introduced in the installation section.

To train on Default environment and test on the all environments, use *DD-DR-DE*.yml; to train on Random environment and test on the all environments, use use *RD-RR-RE*.yml. Please see the SunBlaze paper for details.

Take running on SunblazeHalfCheetahRandomNormal-v0 as an example:

## In the docker environment:
# Run our implementation
python3 policies/main.py --cfg configs/generalize/SunblazeHalfCheetah/RD-RR-RE/rnn.yml --algo td3
# Run Markovian
python3 policies/main.py --cfg configs/generalize/SunblazeHalfCheetah/RD-RR-RE/mlp.yml --algo sac
# Run Oracle
python3 policies/main.py --cfg configs/generalize/SunblazeHalfCheetah/RD-RR-RE/mlp.yml --algo sac --oracle

Temporal Credit Assignment

{Delayed-Catch, Key-to-Door} in the paper, corresponding to configs/credit/<catch|keytodoor>. Note that this is discrete control on pixel inputs, so the architecture is a bit different from the default one.

To reproduce our results, please run:

python3 policies/main.py --cfg configs/credit/catch/rnn.yml
python3 policies/main.py --cfg configs/credit/keytodoor/rnn.yml

Atari

Although Atari environments are not this paper's focus, we provide an implementation to train on a game, following the Dreamerv2 setting. The hyperparameters are not well-tuned, so the results are not expected to be good.

# train on Pong (confirmed it can work on Pong)
python3 policies/main.py --cfg configs/atari/rnn.yml --env Pong

Misc

Draw and Download the Learning Curves

Please see plot_curves.md for details on plotting.

Details of Our Implementation of Recurrent Model-Free RL: Decision Factors, Best Variants, Code Features

Please see our_details.md for more information on:

How to tune the decision factors discussed in the paper in the configuration files
How to tune the other hyperparameters that are also important to training
Our best variants in each subarea and numeric results of learning curves

Acknowledgement

Please see acknowledge.md for details.

Citation

If you find our code useful to your work, please consider citing our paper:

@inproceedings{ni2022recurrent,
  title={Recurrent Model-Free {RL} Can Be a Strong Baseline for Many {POMDP}s},
  author={Ni, Tianwei and Eysenbach, Benjamin and Salakhutdinov, Ruslan},
  booktitle={International Conference on Machine Learning},
  pages={16691--16723},
  year={2022},
  organization={PMLR}
}

Contributing

Before pull request, please reformat your code:

# avoid trailing commas issue after kwargs
black . -t py35

Other Implementations

You may find other PyTorch implementations on recurrent model-free RL useful:

Recurrent Off-policy Baselines for Memory-based Continuous Control has recurrent TD3/SAC
Task-Agnostic Continual RL: In Praise of a Simple Baseline has recurrent SAC for task-agnostic continual RL
Tianshou has recurrent support
Stable Baselines3 has recurrent PPO
RLlib has recurrent PPO
CleanRL has recurrent PPO

Contact

If you have any questions, please create an issue in this repository or contact Tianwei Ni ([email protected])

pomdp-baselines's People

Contributors

Stargazers

Watchers

Forkers

rainwangphy sebastko hai-h-nguyen yuanzhep jingmouren rl-code-lib vancuong1216 sjyttkl muraatozbek fuyw numahha mengxingshifen1218 jieli18 ai-hub-deep-learning-fundamental idlebear adamxyang ec2604 dtch1997 a-nobel waybaba zchuning ufosky-ex smorad cgxuvector freekang warlock1993 mhshen takuyahiraoka sandervanl kunni918 hebrotem yu45020 marisgg donghoon-shin astonisinghsr milosen sherman-shi jens321 rubensolozabal ch-meijie

pomdp-baselines's Issues

Hidden state for subsequence

For the temporal credit assignment problem, I see that you're randomly choosing a start position from sampled episodes. Let's say the subseq is s4, s5..., s10. What do you pass the hidden state h3 as to the LSTM/GRU?

Also, when choosing end position, you add the context length to the randomly chosen start positions. However, it is possible that the context length is greater than the episode length right? In that case, your end position is overflowing into the next episode. How are you masking those out?

Thanks,
kb

Potential bug?

For SACD, can you explain why you do this
https://github.com/twni2016/pomdp-baselines/blob/main/policies/models/policy_rnn.py#L235
instead of this (which I think is correct)?
https://github.com/twni2016/pomdp-baselines/blob/main/policies/models/policy_rnn.py#L230

Handle cases when the episode length < 2

Hello,

I wonder do you have any thoughts on how to handle that case? Currently, you assume it never happens and the code will raise an error for such cases. However, in some domains like the MiniGrid Lava Crossing, for instance, that might happen quite often. Thanks!

MuJoCo：hyperparameters used for PPO_GRU and A2C_GRU

Dear author,

I have read your paper on MuJoCo experiments and I am particularly interested in the hyperparameters used for PPO_GRU and A2C_GRU. I would greatly appreciate it if you could provide me with the code or detailed information regarding these hyperparameters.

In my own implementation, specifically for PPO_GRU, I utilized the code from https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail, and set num-steps to 128, num-processes to 16, and num-mini-batch to 16. However, the results I obtained were significantly worse than the ones reported in your paper.

I kindly request your assistance in understanding if there are any additional parameters or considerations that I might have overlooked. Your expert guidance would be immensely valuable to me.

Here are my complete parameters:

parser.add_argument(
        '--algo', default='ppo', help='algorithm to use: a2c | ppo | acktr')
    parser.add_argument(
        '--lr', type=float, default=3e-4, help='learning rate (default: 7e-4)')
    parser.add_argument(
        '--eps',
        type=float,
        default=1e-5,
        help='RMSprop optimizer epsilon (default: 1e-5)')
    parser.add_argument(
        '--alpha',
        type=float,
        default=0.99,
        help='RMSprop optimizer apha (default: 0.99)')
    parser.add_argument(
        '--gamma',
        type=float,
        default=0.99,
        help='discount factor for rewards (default: 0.99)')
    parser.add_argument(
        '--use-gae',
        action='store_true',
        default=True,
        help='use generalized advantage estimation')
    parser.add_argument(
        '--gae-lambda',
        type=float,
        default=0.95,
        help='gae lambda parameter (default: 0.95)')
    parser.add_argument(
        '--entropy-coef',
        type=float,
        default=0.00,
        help='entropy term coefficient (default: 0.01)')
    parser.add_argument(
        '--value-loss-coef',
        type=float,
        default=0.5,
        help='value loss coefficient (default: 0.5)')
    parser.add_argument(
        '--max-grad-norm',
        type=float,
        default=0.5,
        help='max norm of gradients (default: 0.5)')
    parser.add_argument(
        '--seed', type=int, default=1, help='random seed (default: 1)')
    parser.add_argument(
        '--cuda-deterministic',
        action='store_true',
        default=False,
        help="sets flags for determinism when using CUDA (potentially slow!)")
    parser.add_argument(
        '--num-processes',
        type=int,
        default=16,
        help='how many training CPU processes to use (default: 16)')
    parser.add_argument(
        '--num-steps',
        type=int,
        default=128,
        help='number of forward steps in A2C (default: 5)')
    parser.add_argument(
        '--ppo-epoch',
        type=int,
        default=10,
        help='number of ppo epochs (default: 4)')
    parser.add_argument(
        '--num-mini-batch',
        type=int,
        default=16,
        help='number of batches for ppo (default: 32)')
    parser.add_argument(
        '--clip-param',
        type=float,
        default=0.2,
        help='ppo clip parameter (default: 0.2)')
    parser.add_argument(
        '--log-interval',
        type=int,
        default=1,
        help='log interval, one log per n updates (default: 10)')
    parser.add_argument(
        '--save-interval',
        type=int,
        default=100,
        help='save interval, one save per n updates (default: 100)')
    parser.add_argument(
        '--eval-interval',
        type=int,
        default=1,
        help='eval interval, one eval per n updates (default: None)')
    parser.add_argument(
        '--num-env-steps',
        type=int,
        default=1.5e6,
        help='number of environment steps to train (default: 10e6)')
    parser.add_argument(
        '--env-name',
        default='AntBLT-P-v0',
        help='environment to train on (default: PongNoFrameskip-v4)')
    parser.add_argument(
        '--log-dir',
        default='./tmp/gym/',
        help='directory to save agent logs (default: /tmp/gym)')
    parser.add_argument(
        '--save-dir',
        default='./trained_models/',
        help='directory to save agent logs (default: ./trained_models/)')
    parser.add_argument(
        '--no-cuda',
        action='store_true',
        default=False,
        help='disables CUDA training')
    parser.add_argument(
        '--use-proper-time-limits',
        action='store_true',
        default=True,
        help='compute returns taking into account time limits')
    parser.add_argument(
        '--recurrent-policy',
        action='store_true',
        default=True,
        help='use a recurrent policy')
    parser.add_argument(
        '--use-linear-lr-decay',
        action='store_true',
        default=True,
        help='use a linear schedule on the learning rate')
    args = parser.parse_args()

Q Overestimation

I'm rerunning velocity baselines in the POMDP directory and I'm observing exploding Q values fairly often. I was wondering if this is something you experienced during training. TD3 seems to avoid overestimation bias but the returns seem low. Any tips to get more stable returns across trials without massive batch sizes?

Double Episode in minibatch

There is a bug in how episodes are saved vs how they are retrieved in the SeqReplayBuffer class. Episodes are stored according to their actual length, but are retrieved based on their maximum length.

For example, let's say my environment has a maximal episode length is 100, and terminates upon success. My agent is doing well and ending subsequent episodes after 20 and 30 steps. These episodes are stored in the buffer directly after each other, and the third episode will start at index 50.

However, when sampling episodes, the random_episodes method does not account for this, and assumes in its first for loop that every episode is the maximum episode length (self._sampled_seq_len). When the first episode is sampled, it will contain indices 0-100, and will thus also contain episode 2 and 3.

I propose this can be fixed by changing the incrementing of self._top and self._size in the add_episode method. In these lines, change + seq_len to + self._sampled_seq_len

Plan for recurrent soft Q-learning and DRQN?

As the title says.

TypeError at POMDPWrapper

Dear,

I encountered the following error when trying to execute your code:

Traceback (most recent call last):
  File "<path>\pomdp-baselines\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "<path>\pomdp-baselines\policies\learner.py", line 420, in collect_rollouts
    obs = ptu.from_numpy(self.train_env.reset(seed=self.seed))  # reset
  File "<path>\pomdp-baselines\venv\lib\site-packages\gym\wrappers\time_limit.py", line 68, in reset
    return self.env.reset(**kwargs)
  File "<path>\pomdp-baselines\venv\lib\site-packages\gym\wrappers\order_enforcing.py", line 42, in reset
    return self.env.reset(**kwargs)
  File "<path>\pomdp-baselines\venv\lib\site-packages\gym\wrappers\env_checker.py", line 45, in reset
    return env_reset_passive_checker(self.env, **kwargs)
  File "<path>\pomdp-baselines\venv\lib\site-packages\gym\utils\passive_env_checker.py", line 192, in env_reset_passive_checker
    result = env.reset(**kwargs)
  File "<path>\pomdp-baselines\envs\pomdp\wrappers.py", line 32, in reset
    return self.get_obs(state)
  File "<path>\pomdp-baselines\envs\pomdp\wrappers.py", line 28, in get_obs
    return state[self.partially_obs_dims].copy()
TypeError: tuple indices must be integers or slices, not list

At the error line, state = ([-0.001 -0.02 -0.023 0.049], {}) is a tuple of a list and a dictionary: , and self.partially_obs_dims = [1, 3] is a list.

I executed the following command: python policies/main.py -cfg configs/pomdp/cartpole/v/rnn.yml --algo sacd

I am running Python 3.9.7, with the following requirements (installing your conda environment did not work for me on Windows):

absl-py==1.4.0
astunparse==1.6.3
box2d-py==2.3.5
cachetools==5.3.1
certifi==2023.5.7
charset-normalizer==3.2.0
cloudpickle==2.2.1
contourpy==1.1.0
cycler==0.11.0
filelock==3.12.2
flatbuffers==23.5.26
fonttools==4.41.0
gast==0.4.0
google-auth==2.22.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.56.0
gym==0.26.2
gym-notices==0.0.8
h5py==3.9.0
idna==3.4
importlib-metadata==6.8.0
importlib-resources==6.0.0
Jinja2==3.1.2
joblib==1.3.1
keras==2.10.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
libclang==16.0.6
Markdown==3.4.3
MarkupSafe==2.1.3
matplotlib==3.7.2
mpmath==1.3.0
networkx==3.1
numpy==1.25.1
oauthlib==3.2.2
opencv-python==4.8.0.74
opt-einsum==3.3.0
packaging==23.1
pandas==2.0.3
Pillow==10.0.0
protobuf==3.19.6
psutil==5.9.5
pyasn1==0.5.0
pyasn1-modules==0.3.0
pybullet==3.2.5
pycolab==1.2
pygame==2.1.0
swig==4.1.1
sympy==1.12
tensorboard==2.10.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.10.1
tensorflow-estimator==2.10.0
tensorflow-io-gcs-filesystem==0.31.0
termcolor==2.3.0
threadpoolctl==3.2.0
torch==2.0.1
typing_extensions==4.7.1
tzdata==2023.3
urllib3==1.26.16
Werkzeug==2.3.6
wrapt==1.15.0
zipp==3.16.2

bug report

I found that line 126 of scad.py use action before action has been defined.

Same image encoder for RNN and Actor/Critic shortcut?

According to Figure 7 of the paper, separate modules are used to create observation embeddings for the RNN and shortcut observations for the actor/critic MLP head respectively.

However, in the code it appears that the same image encoder is used to create image embeddings for the RNN and MLP (e.g. the same self.image_encoder is used in get_obs_embedding() and get_shortcut_obs_embedding() here).

Was there any reason for this design choice?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.