Giter Site home page Giter Site logo

pku-alignment / safe-policy-optimization Goto Github PK

View Code? Open in Web Editor NEW
291.0 6.0 38.0 292.79 MB

NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms

Home Page: https://safe-policy-optimization.readthedocs.io/en/latest/index.html

License: Apache License 2.0

Python 99.35% Makefile 0.65%
safe benchmarks constrained-reinforcement-learning reinforcement-learning-algorithms safe-reinforcement-learning

safe-policy-optimization's Introduction

Organization License codecov Documentation Status

Citing Safe Policy Optimization

If you find Safe Policy Optimization useful, please cite it in your publications.

@article{ji2023safety,
  title={Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark},
  author={Ji, Jiaming and Zhang, Borong and Zhou, Jiayi and Pan, Xuehai and Huang, Weidong and Sun, Ruiyang and Geng, Yiran and Zhong, Yifan and Dai, Juntao and Yang, Yaodong},
  journal={arXiv preprint arXiv:2310.12567},
  year={2023}
}

What's New:

  • Feel free to open an issue if you encounter any problem in Mac or Windows.
  • We have release Documentation.
  • The benchmark results of SafePO can be viewed at Wandb Report.

Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:

Correctness. For a benchmark, it is critical to ensure its correctness and reliability. To achieve this goal, we examine the implementation of SafePO carefully. Firstly, each algorithm is implemented strictly according to the original paper (e.g., ensuring consistency with the gradient flow of the original paper, etc). Secondly, for algorithms with a commonly acknowledged open-source code base, we compare our implementation with those line by line, in order to double-check the correctness. Finally, we compare SafePO with existing benchmarks (e.g., Safety-Starter-Agents and RL-Safety-Algorithms) outperforms other existing implementations.

Extensibility. SafePO enjoys high extensibility thanks to its architecture. New algorithms can be integrated to SafePO by inheriting from base algorithms and only implementing their unique features. For example, we integrate PPO by inheriting from policy gradient and only adding the clip ratio variable and rewriting the function that computes the loss of policy. In a similar way, algorithms can be easily added to SafePO.

Logging and Visualization. Another important functionality of SafePO is logging and visualization. Supporting both TensorBoard and WandB, we offer code for the visualizations of more than 40 parameters and intermediate computation results, for the purpose of inspecting the training process. Common parameters and metrics such as KL-divergence, SPS (step per second), and variance of cost are visualized universally. During training, users are able to inspect the changes of every parameter, collect the log file, and obtain saved checkpoint models. The complete and comprehensive visualization allows easier observation, model selection, and comparison.

Documentation. In addition to its code implementation, SafePO comes with an extensive documentation. We include detailed guidance on installation and propose solutions to common issues. Moreover, we provide instructions on simple usage and advanced customization of SafePO. Official information concerning maintenance, ethical and responsible use are stated clearly for reference.

Overview of Algorithms

Here we provide a table of Safe RL algorithms that the benchmark includes.

note: Four more classic RL algorithms are also included in the benchmark, namely PG, NaturalPG, TRPO, and PPO.

Algorithm Proceedings&Cites Official Code Repo Official Code Last Update Official Github Stars
PPO-Lag Tensorflow 1 GitHub last commit GitHub stars
TRPO-Lag Tensorflow 1 GitHub last commit GitHub stars
CUP Neurips 2022 (Cite: 6) Pytorch GitHub last commit GitHub stars
FOCOPS Neurips 2020 (Cite: 27) Pytorch GitHub last commit GitHub stars
CPO ICML 2017(Cite: 663)
PCPO ICLR 2020(Cite: 67) Theano
RCPO ICLR 2019 (Cite: 238)
CPPO-PID Neurips 2020(Cite: 71) Pytorch GitHub last commit GitHub stars
MACPO Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
MAPPO-Lag Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
HAPPO (Purely reward optimisation) ICLR 2022 (Cite: 10) Pytorch GitHub last commit GitHub stars
MAPPO (Purely reward optimisation) Preprint(Cite: 98) Pytorch GitHub last commit GitHub stars

Supported Environments: Safety-Gymnasium

For more details, please refer to Safety-Gymnasium.

Category Task Agent Example
Safe Navigation Goal[012] Point, Car, Doggo, Racecar, Ant SafetyPointGoal1-v0
Button[012]
Push[012]
Circle[012]
Safe Velocity Velocity HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid SafetyAntVelocity-v1
Safe Multi-Agent MultiGoal[012] Multi-Point, Multi-Ant SafetyAntMultiGoal1-v0
Multi-Agent Velocity 6x1HalfCheetah, 2x3HalfCheetah, 3x1Hopper, 2x1Swimmer, 2x3Walker2d, 2x4Ant, 4x2Ant, 9|8Humanoid Safety2x4AntVelocity-v0
Safe Isaac Gym FreightFrankaCloseDrawer FreightFranka FreightFrankaCloseDrawer
FreightFrankaPickAndPlace
ShadowHandCatchOver2Underarm_Safe_finger ShadowHands ShadowHandCatchOver2Underarm_Safe_finger
ShadowHandCatchOver2Underarm_Safe_joint
ShadowHandOver_Safe_finger
ShadowHandOver_Safe_joint

note:

  • Safe Velocity and Safe Isaac Gym tasks support both single-agent and multi-agent algorithms.
  • Safe Navigation tasks support single-agent algorithms.
  • Safe MultiGoal tasks support multi-agent algorithms.
  • Safe Isaac Gym tasks do not support evaluation after training yet.
  • As Isaac Gym is not holding in PyPI, you should install it manually, then ensure that Isaac Gym works on your system by running one of the examples from the python/examples directory, like joint_monkey.py.
  • ❗️As Safe MultiGoal and Safe Isaac Gym tasks have not been uploaded in PyPI due to too large package size, please install Safety-Gymnasium manually to run those two tasks, by using following commands:
conda create -n safepo python=3.8
conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .

Selected Tasks

Base Environments Description Demo
ShadowHandOver These environments involve two fixed-position hands. The hand which starts with the object must find a way to hand it over to the second hand.
ShadowHandCatchOver2Underarm This environment is made up of half ShadowHandCatchUnderarm and half ShadowHandCatchOverarm, the object needs to be thrown from the vertical hand to the palm-up hand

We implement some different constraints to the base environments, including Safe finger and Safe joint. For more details, please refer to Safety-Gymnasium

Pre-requisites

To use SafePO-Baselines, you need to install environments. Please refer to Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here.

Conda-Environment

conda create -n safepo python=3.8
conda activate safepo
# because the cuda version, we recommend you install pytorch manual.
pip install -e .

Getting Started

Efficient Commands

To verify the performance of SafePO, you can run the following:

conda create -n safepo python=3.8
conda activate safepo
make benchmark

We also support simple benchmark commands for single-agent and multi-agent algorithms:

conda create -n safepo python=3.8
conda activate safepo
make simple-benchmark

The above commands will run all algorithms in sampled environments to get a quick overview of the performance of the algorithms.

Please notice that these commands would reinstall Safety-Gymnasium from PyPI. To run Safe Isaac Gym and Safe MultiGoal, please reinstall it manully from source by:

conda activate safepo
wget https://github.com/PKU-Alignment/safety-gymnasium/archive/refs/heads/main.zip
unzip main.zip
cd safety-gymnasium-main
pip install -e .

Single-Agent

Each algorithm file is the entrance. Running ALGO.py with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:

cd safepo/single_agent
python ppo_lag.py --task SafetyPointGoal1-v0 --seed 0

To run a benchmark parallelly, for example, you can use the following commands to run PPO-Lag, TRPO-Lag in SafetyAntVelocity-v1, SafetyHalfCheetahVelocity-v1:

cd safepo/single_agent
python benchmark.py --tasks SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2

Commands above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/.

Multi-Agent

We also provide a safe MARL algorithm benchmark on the challenging tasks of Safety-Gymnasium Safe Multi-Agent Velocity, Safe Isaac Gym and Safe MultiGoal tasks. HAPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.

To train a multi-agent algorithm:

cd safepo/multi_agent
python macpo.py --task Safety2x4AntVelocity-v0 --experiment benchmark

You can also train on Isaac Gym based environment if you have installed Isaac Gym.

cd safepo/multi_agent
python macpo.py --task ShadowHandOver_Safe_joint --experiment benchmark

Experiment Evaluation

After running the experiment, you can use the following command to plot the results:

cd safepo
python plot.py --logdir ./runs/benchmark

To evaluate the performance of the algorithm, you can use the following command:

cd safepo
python evaluate.py --benchmark-dir ./runs/benchmark

Machine Configuration

We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44. All of our experiments are run on Linux platform. If you encounter any problem in Mac or Windows, please feel free to open an issue.

Ethical and Responsible Use

SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.

PKU-Alignment Team

The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.

safe-policy-optimization's People

Contributors

chauncygu avatar friedmainfunction avatar gaiejj avatar gengyiran avatar ivan-zhong avatar muchvo avatar zmsn-2077 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

safe-policy-optimization's Issues

Something about IPO

Hello, thanks for your excellent work. For the IPO algorithm, I wonder that if it is suit for the environment whose action space is discrete?
Because of that the interior point algorithm is not suit for the optimization problem whose decisision is discrete.

Safexp-PointGoal1-v0 vs SafetyPointGoal1-v0

Dear,

what is the difference between Safexp-PointGoal1-v0 vs SafetyPointGoal1-v0?

I find that, result from SafetyPointGoal1-v0 is generally smaller than Safexp-PointGoal1-v0

question about env reset

obs, _ = env.reset()
obs = torch.as_tensor(obs, dtype=torch.float32, device=device)
ep_ret, ep_cost, ep_len = (
    np.zeros(args.num_envs),
    np.zeros(args.num_envs),
    np.zeros(args.num_envs),
)
# training loop
for epoch in range(epochs):
    rollout_start_time = time.time()
    # collect samples until we have enough to update
    for steps in range(local_steps_per_epoch):  

Why did your code only perform env.reset() at the beginning, rather than starting at each epoch?

Question about logger value

                if done or time_out:
                    rew_deque.append(ep_ret[idx])
                    cost_deque.append(ep_cost[idx])
                    len_deque.append(ep_len[idx])
                    logger.store(
                        **{
                            "Metrics/EpRet": np.mean(rew_deque),
                            "Metrics/EpCost": np.mean(cost_deque),
                            "Metrics/EpLen": np.mean(len_deque),
                        }
                    )
                    ep_ret[idx] = 0.0
                    ep_cost[idx] = 0.0
                    ep_len[idx] = 0.0

I'm confused about this np.mean(cost_deque), this would make the EpCost for different epochs correlate, making ep_costs = logger.get_stats("Metrics/EpCost") different from the Jc definition in the safe RL paper. The purpose of this is to utilize the data from many previous episodes to average out the newly added data, which helps improve training stability, and will make the drawn training curve look smoother?

Doubt about the updating method of Lagrange Multipliers

from safepo.common.lagrange import Lagrange

nu = 1.0
nu_lr = 0.1
ep_cost = 35
lagrange = Lagrange(
cost_limit=25.0,
lagrangian_multiplier_init=1.0,
lagrangian_multiplier_lr=0.1,
)

print("Before update:")
print(f"Learning Rate: {lagrange.lambda_optimizer.param_groups[0]['lr']}")

lagrange.update_lagrange_multiplier(ep_cost)

learn_lag = lagrange.lagrangian_multiplier

nu += nu_lr * (ep_cost - 25.0)

print(f"Lagrange multiplier: {learn_lag}")
print(f"Nu: {nu}")

There are two methods: one is to treat it as a learnable parameter and use the Adam optimizer, and the other is to directly assign values. Due to the adaptive learning rate of the Adam optimizer, the multipliers obtained by the two are not consistent, and the multiplier changes rapidly in the case of direct assignment under the same learning rate.

I understand that the original code of PPO Lagrangian uses the former method, while the original code of FOCOPS and CUP seems to use the latter method. Should it be distinguished or can the former be used uniformly?

why?

Traceback (most recent call last):
File "train.py", line 54, in
if mpi_tools.mpi_fork(args.cores,use_number_of_threads=use_number_of_threads):
File "/content/drive/MyDrive/save/Safe-Policy-Optimization/safepo/common/mpi_tools.py", line 97, in mpi_fork
subprocess.check_call(args, env=env)
File "/usr/lib/python3.7/subprocess.py", line 363, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpirun', '-np', '2', '--use-hwthread-cpus', '/usr/bin/python3', 'train.py', '--env-id', 'Safexp-PointGoal1-v0', '--algo', 'ppo-lag', '--cores', '2', '--seed', '0']' returned non-zero exit status 1.
[5e506354ec7b:06923] *** Process received signal ***
why i have this question?
please help me,thank you

Process conflict casused abnormal termination

  1. Follow the installation instructions to install
  2. An exception occurred while executing this command:
    python safepo/train.py --env_id Safexp-PointGoal1-v0 --algo ppo_lagrangian --cores 4
  3. Exception prompt:

Traceback (most recent call last):
File "safepo/train.py", line 61, in
model = Runner(
File "/home/jiayiguan/opt/paper_carla/Safe-Policy-Optimization/safepo/common/runner.py", line 36, in init
self.default_kwargs = get_defaults_kwargs_yaml(algo=algo,env_id=env_id)
File "/home/jiayiguan/opt/paper_carla/Safe-Policy-Optimization/safepo/common/utils.py", line 15, in get_defaults_kwargs_yaml
return kwargs[kwargs_name]
KeyError: 'defaults'

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[52236,1],0]
Exit code: 1

Question about the torch.size of loss_pi in focops implement

I found in focops.py :
ratio = torch.exp(log_prob - log_prob_b)
temp_kl = torch.distributions.kl_divergence(
distribution, old_distribution_b
).sum(-1, keepdim=True)
loss_pi = (temp_kl - (1 / FOCOPS_LAM) * ratio * adv_b) * (
temp_kl.detach() <= dict_args['target_kl']
).type(torch.float32)
loss_pi = loss_pi.mean()
Assuming minibatch size=64, temp_kl.shape=(64,1) due to keepdim=True used in calculating temp_kl, but other variables in loss_pi = (64,) which makes loss_pi.shape=(64,64) instead of (64,1) or (64,). So, Why not keep the same dimensions?

Question about rescale in cpo

    b_flat = get_flat_gradients_from(self.ac.pi.net)

    ep_costs = self.logger.get_stats('EpCosts')[0]
    c = ep_costs - self.cost_limit
    c /= (self.logger.get_stats('EpLen')[0] + eps)  # rescale
    self.logger.log(f'c = {c}')
    self.logger.log(f'b^T b = {b_flat.dot(b_flat).item()}')

Why we needs to rescale here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.