haarnoja / sac Goto Github PK

Soft Actor-Critic

License: Other

Python 98.30% Dockerfile 1.70%

sac's Introduction

This repository is no longer maintained. Please use our new Softlearning package instead.

Soft Actor-Critic

Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains. The algorithm is based on the paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor presented at ICML 2018.

This implementation uses Tensorflow. For a PyTorch implementation of soft actor-critic, take a look at rlkit by Vitchyr Pong.

See the DIAYN documentation for using SAC for learning diverse skills.

Getting Started

Soft Actor-Critic can be run either locally or through Docker.

Prerequisites

You will need to have Docker and Docker Compose installed unless you want to run the environment locally.

Most of the models require a Mujoco license.

Docker installation

If you want to run the Mujoco environments, the docker environment needs to know where to find your Mujoco license key (mjkey.txt). You can either copy your key into <PATH_TO_THIS_REPOSITY>/.mujoco/mjkey.txt, or you can specify the path to the key in your environment variables:

export MUJOCO_LICENSE_PATH=<path_to_mujoco>/mjkey.txt

Once that's done, you can run the Docker container with

docker-compose up

Docker compose creates a Docker container named soft-actor-critic and automatically sets the needed environment variables and volumes.

You can access the container with the typical Docker exec-command, i.e.

docker exec -it soft-actor-critic bash

See examples section for examples of how to train and simulate the agents.

To clean up the setup:

docker-compose down

Local installation

To get the environment installed correctly, you will first need to clone rllab, and have its path added to your PYTHONPATH environment variable.

Clone rllab

cd <installation_path_of_your_choice>
git clone https://github.com/rll/rllab.git
cd rllab
git checkout b3a28992eca103cab3cb58363dd7a4bb07f250a0
export PYTHONPATH=$(pwd):${PYTHONPATH}

Download and copy mujoco files to rllab path: If you're running on OSX, download https://www.roboti.us/download/mjpro131_osx.zip instead, and copy the .dylib files instead of .so files.

mkdir -p /tmp/mujoco_tmp && cd /tmp/mujoco_tmp
wget -P . https://www.roboti.us/download/mjpro131_linux.zip
unzip mjpro131_linux.zip
mkdir <installation_path_of_your_choice>/rllab/vendor/mujoco
cp ./mjpro131/bin/libmujoco131.so <installation_path_of_your_choice>/rllab/vendor/mujoco
cp ./mjpro131/bin/libglfw.so.3 <installation_path_of_your_choice>/rllab/vendor/mujoco
cd ..
rm -rf /tmp/mujoco_tmp

Copy your Mujoco license key (mjkey.txt) to rllab path:

cp <mujoco_key_folder>/mjkey.txt <installation_path_of_your_choice>/rllab/vendor/mujoco

Clone sac

cd <installation_path_of_your_choice>
git clone https://github.com/haarnoja/sac.git
cd sac

Create and activate conda environment

cd sac
conda env create -f environment.yml
source activate sac

The environment should be ready to run. See examples section for examples of how to train and simulate the agents.

Finally, to deactivate and remove the conda environment:

source deactivate
conda remove --name sac --all

Examples

Training and simulating an agent

To train the agent

python ./examples/mujoco_all_sac.py --env=swimmer --log_dir="/root/sac/data/swimmer-experiment"

To simulate the agent (NOTE: This step currently fails with the Docker installation, due to missing display.)

python ./scripts/sim_policy.py /root/sac/data/swimmer-experiment/itr_<iteration>.pkl

mujoco_all_sac.py contains several different environments and there are more example scripts available in the /examples folder. For more information about the agents and configurations, run the scripts with --help flag. For example:

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

python ./examples/mujoco_all_sac.py --help
usage: mujoco_all_sac.py [-h]
                         [--env {ant,walker,swimmer,half-cheetah,humanoid,hopper}]
                         [--exp_name EXP_NAME] [--mode MODE]
                         [--log_dir LOG_DIR]

Benchmark Results

Benchmark results for some of the OpenAI Gym v2 environments can be found here.

Credits

The soft actor-critic algorithm was developed by Tuomas Haarnoja under the supervision of Prof. Sergey Levine and Prof. Pieter Abbeel at UC Berkeley. Special thanks to Vitchyr Pong, who wrote some parts of the code, and Kristian Hartikainen who helped testing, documenting, and polishing the code and streamlining the installation process. The work was supported by Berkeley Deep Drive.

Reference

@article{haarnoja2017soft,
  title={Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
  author={Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},
  booktitle={Deep Reinforcement Learning Symposium},
  year={2017}
}

sac's People

Contributors

Stargazers

Watchers

Forkers

hartikainen g-wang giteverything arahuja vlad17 giuliavezzani tandychao guillaumematheron ben-eysenbach johndpope domingoesteban sumitsk janmatas tonnyyan lzhao4ever sluo1989 dunchen shubhampachori12110095 convexsetgithub hyeonwoonoh wellbeing18 kornbergfresnel willlwang kamyargh frankpsch daominglyu afcarl panpanyunshi eridgd uscresl ebimor y4cj4sul3 tedrepo toanngosy flirtguru codeaudit samangel93 kimforwin rcorona sarahisyoung benlansdell beckybai elitalobo amirunpri2018 collector-m xuanc6 rahulindoria5 kismuz shu13720902 chaonan99 gdcollect richardrl mrsyee diyano nickbaguley jixh jeffei rlagent hongxin001 hilariouss mathloser jaykimbravekjh liangy1969 mliu-dark-knight ml-lab azmtag yizhouzhao ra2630 lyang1996 pankayaraj jskdr gaimjkp zhangzhao4444 yanxg githubbeinner youngjunseo fiberleif wang6609 lvzw1895 yangyutu zhaoguangxiang alexlioralexli lijuan77 kuangenzhang huangyumeng featuremachine bravetodo dosssman ritou11 theling wechto virajmehta bzp92 davarihn yangangren wqynew ageliss taylor-liu sheldon-zhou1997 lucifer2288

sac's Issues

potential recursive call in get_actions(self, observations) from sac/policies/gmm.py

Is calling super(GMMPolicy, self).get_actions(observations) (line 158) the expected behavior here when self._is_deterministic is false as it seems to call the function itself?

what is "sandbox"

Traceback (most recent call last):
File "/home/xtq/sac/examples/mujoco_all_sac.py", line 15, in
from sac.algos import SAC
File "/home/xtq/sac/sac/algos/init.py", line 2, in
from .diayn import DIAYN
File "/home/xtq/sac/sac/algos/diayn.py", line 10, in
from sac.policies.hierarchical_policy import FixedOptionPolicy
File "/home/xtq/sac/sac/policies/init.py", line 1, in
from .nn_policy import NNPolicy
File "/home/xtq/sac/sac/policies/nn_policy.py", line 6, in
from sandbox.rocky.tf.policies.base import Policy
ImportError: No module named 'sandbox'

a mathematical problem ..

I derived Equation 12, but the result is not the same as Equation 13 in your paper. In my derivation, I didn't get the first item in Equation 13, I don't know where it is wrong.
can you help me..？

SAC and cross-maze ant

Hi,

Can SAC run with the cross-maze variation for ant? With the default parameters, the command:

"python ./examples/mujoco_all_sac.py --env=ant --domain=ant --task=cross-maze --policy=gmm --log_dir=data/ant_cross-experiment"

does not throw any errors but "Launches 0 experiments." I was able to track the problem down to rllab's _ivariants_sorted function in class VariantGenerator in misc/instrument.py which returns a generator for an empty list when the cross-maze task is specified (it seems to work for Multidirection) as a opposed to a list with a dictionary containing run parameters.

Am I doing something wrong or not including a run flag?

Other than that, thanks for sharing the code!

About markovian environments

Hi,
thanks for the thorough implementation and making this code available, it really helps to understand the internal mechanisms of the SAC algorithm.

I have a question regarding the code in sac/sac/envs/gym_env.py -
At the file's header - you comment: " Rllab implementation with a HACK. See comment in GymEnv.init().", and then in the init() method, you write:

# HACK: Gets rid of the TimeLimit wrapper that sets 'done = True' when
# the time limit specified for each environment has been passed and
# therefore the environment is not Markovian (terminal condition depends
# on time rather than state).

I understand the point here, but I'm not sure I followed the implementation, as it seems to be an internal Gym code and is not found in the SAC code found in this repository.

Can you explain exactly what are you doing with the TimeLimit wrapper?
If you omit the done flag, do you still terminate the episode?

Specifically - in Gym's registration.py file the env class is wrapped with:

if env.spec.max_episode_steps is not None:
    from gym.wrappers.time_limit import TimeLimit
    env = TimeLimit(env, max_episode_steps=env.spec.max_episode_steps)

Furthermore, in the time_limit.py file -

def step(self, action):
    assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
    observation, reward, done, info = self.env.step(action)
    self._elapsed_steps += 1
     if self._elapsed_steps >= self._max_episode_steps:
         info['TimeLimit.truncated'] = not done
         done = True
     return observation, reward, done, info

If you omit these lines of code - how does the environment resets itself when the max_episode_steps flag is raised?

Thanks!

Lior

The comprehension of the policy limitations in SAC

I very admire SAC you created.
I have one guess about SAC's policy, and I would like to your confirm:

We assume that q function obeys Boltzmann distribution, but it seems difficult to code that the policy obeys Boltzmann distribution. Therefore, we actually code that the policy obeys the most commonly used Gaussian distribution.
However, when there are more than one good actions in the same state,q function is multimodal, and the Gaussian distribution tends to be flat and thus becomes weak. As the article description https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.

Looking forward to your reply!

maximization bias

Hello, I'm not sure whether this is an issue or not but I've been looking at your implementation for half an hour, and I think there might be a maximization bias in the implementation. Specifically, you used the same set of experience to update two q-tables. The paper says two independent q-tables will benefit training.
I've tested my thought out on a similar code base and the owner agreed with my view so far. I've opened a stack overflow question here. Could you say something about this? I think I'll test the implementation as well.
Thanks in advance.

TypeError: init() got an unexpected keyword argument 'event_ndims'

I followed the installation instructions and ran the example command bwlow and got a Type Error.

python ./examples/mujoco_all_sac.py --env=swimmer

I had to change import .variants to import variants in mujoco_all_sac.yI think this is fine because I still get variants.file = '[mypath]/sac/examples/variants.py'

Then, I got this type error:

2018-07-05 17:46:10.885203 PDT | Setting seed to 5
using seed 5
WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object.
[2018-07-05 17:46:14,736] Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object.
Traceback (most recent call last):
File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 137, in
run_experiment(sys.argv)
File "/home/coline/Research2018/affordances/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment
method_call(variant_data)
File "./examples/mujoco_all_sac.py", line 137, in run_experiment
observations_preprocessor=observations_preprocessor)
File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 58, in init
self.build()
File "/home/coline/Research2018/affordances/sac/sac/policies/latent_space_policy.py", line 122, in build
event_ndims=self._Da)
File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 280, in init
self.build()
File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in build
for i in range(1, num_coupling_layers + 1)
File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 311, in
for i in range(1, num_coupling_layers + 1)
File "/home/coline/Research2018/affordances/sac/sac/distributions/real_nvp_bijector.py", line 96, in init
name=name)
TypeError: init() got an unexpected keyword argument 'event_ndims'

Dependency issue while building docker

I am having trouble creating the docker container due to numpy dependency issue.
When I run the docker-compose command I get the following error:

Step 23/32 : RUN conda env create -f /root/sac/environment.yml     && conda env update
 ---> Running in c746b229bafa
Using Anaconda Cloud api site https://api.anaconda.org
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ................
Solving package specifications: ..........................................
Error: Could not find some dependencies for numpy ==1.13.0: blas * mkl, blas * openblas

Thanks

Why sac is off policy?

Hi and thank you for such genius algorithm.
I wonder how you consider sac as off-policy algorithm. As far as i checked both in code and paper all moves are taken by current policy which is excactly the definition of on-policy algorithms.

Reward scale

Some factors of reward scaling can generates instabilities, like described in #9 .

For alleviating this issue wouldn't it be a good idea to divide log_prob by reward_scale instead of multiplying the reward by it? Algorithmically speaking I think this would have the same effect.

Hyperparameter Advice

Hi Tuomas. I'm trying out your SAC implementation on some of the continuous gym environments and I'm curious if you have any recommendations for how to best tune the hyperparameters. Using the defaults and a temperature of 1, for instance, leads to some wildly oscillating policy performance on LunarLanderContinuous or InvertedPendulum. The policy may generate very good returns, then suddenly in the next entry in progress.csv terrible returns, and oscillates up and down without stabilizing. Does that suggest the temperature parameter needs to be tuned, or are some of the other default hyperparameters not ideal for these sorts of tasks?

An example of the episode return for lunar lander against samples:

Thanks!

NNDiscriminatorFunction error

Hi,

I was able to install and run the sample SAC code. However, while executing python examples/mujoco_all_diayn.py --env=half-cheetah --log_dir=data/demo, I got the following errors:

value_function.py", line 50, in __init__ Parameterized.__init__(self) NameError: name 'Parameterized' is not defined

This was resolved by adding this import to value_function.py: from sandbox.rocky.tf.core.parameterized import Parameterized. However, I'm getting another error at this point:

  File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 137, in <module>
    run_experiment(sys.argv)
  File "/private/home/sramakri/Projects/diayn/rllab/scripts/run_experiment_lite.py", line 121, in run_experiment
    method_call(variant_data)
  File "examples/mujoco_all_diayn.py", line 221, in run_experiment
    num_skills=variant['num_skills'],
  File "/private/home/sramakri/Projects/diayn/sac/sac/value_functions/value_function.py", line 69, in __init__
    self._output_t = self.get_output_for(*self._input_pls)
  File "/private/home/sramakri/Projects/diayn/sac/sac/misc/mlp.py", line 179, in get_output_for
    output_nonlinearity=self._output_nonlinearity,
AttributeError: 'NNDiscriminatorFunction' object has no attribute '_output_nonlinearity'

I'm not sure how to resolve this error because self._output_nonlinearity is defined for the parent class MLPFunction but not the child class NNDiscriminatorFunction, where get_output_for is called.

paper/code conflict: using minimum Q in policy gradient

The Soft Actor-Critic paper (arXiv v2) says, in the last paragraph on page 5:

We then use the minimum of the Q-functions for the value gradient in Equation 6 and policy gradient in Equation 13

However, the code in sac/algos/sac.py uses only one of Q functions in the policy gradient loss. It does use the minimum in the value gradient loss.

Is there a reason for the discrepancy? Thanks!

Replication of Paper

Are the current hyperparameters the same ones that the paper reports? It seems like a lot of the trials on Half-Cheetah are breaking 6000 in 200 episodes, and 10000 by 1000 episodes, but I'm having a hard time recreating these results.

But, thanks for your code, it's very helpful!

Inquiries about the benchmark results

I checked the benchmark results as provided in the github and try to plot the results. However, I noticed that the results are quite different with the results in the paper. Why is that?

For example, the results in the paper for Ant environment shows final performance at almost 6000. However, the raw data given in the benchmark shows only around 4000

Thank you in advance for your time

DIAYN result reproduction & additional charts

Hi!

I am currently trying to verify my DIAYN implementation and I was wondering if there are any additional results available that are not provided within the original paper or the website the paper links to? More specifically, I was wandering if there are Figure 2 (c) (page 5 DIAYN) Training dynamics equivalents for HalfCheetah, Hopper, Ant and other environments that are not InvertedPendulum or MountainCar?

I know that verifying DIAYN goes way beyond just looking at Training Dynamics metrics as one must also determine if the learned skills are actually diverse, but I think having the previously mentioned charts would be a great first step when testing for reproducibility.

Double Q for DIAYN

Hi,
Forgive me if this is already explained/implemented (part time grad student, pretty new to this):

On reading through the DIAYN code/initial reading of the paper, it seems to not use the double q that is present in SAC. What is the reason for this?

I was also surprised that it seems that DIAYN completely overrides the actor/critic training functions of SAC as opposed to extending them.

Docker Image won't create due to conda's environment.yml inconsistencies

when executing docker-compose up from my Macbook Pro, DockerFile would fail on step 26/35 for non satisfied libs for numpy=1.13.0 dep. (blas,mkl etc.)

so I made some modifications to make it work:
environment.yml

under channels section I added - https://conda.anaconda.org/conda-forge/ to be the last url.
under dependencies add - blas on top of others. and removed - matplotlib since for some reason it freeazed the process.

Dockerfile

change Anaconda2*.sh file to latest Anaconda2*.sh`
change:
RUN conda env create -f /root/sac/environment.yml \ && conda env update
to:
RUN conda env create -n sac
RUN conda env update
RUN conda export -f /root/sac/environment.yml

I believe this will save someone's time in future and strongly suggest to test and commit those changes @haarnoja

for discrete env

I read the paper DIAYN just now, and can't understand how to train the DIAYN in an env with discrete actions, because SAC is for continuous env. But in the paper, some experiments are based on mountain car and inverted pendulum. Thank you

Network weights becoming NaN

When running with default parameter settings (except 16 elements in the GMM, 4 gradient steps per iteration) on the Ant domain and multi-direction task, the network weights become NaN. Is there a learning rate or reward scale I should be using instead of the default?

could there be a problem with the initialization of GaussianPolicy ?

GaussianPolicy inherits from NNPolicy and in the end of GaussianPolicy constructor (init) there is a call to the parent of NNPolicy...
shouldnt it be :
super(GaussianPolicy,self).__init__(env_spec)

I observed its repeating in latent_space_policy and gmm_policy as well (although there is a comment in gmm_policy about it).

so not sure its a mistake, probably my misunderstanding, but sending anyway.
Thanks for sharing the code :)

Sparse Reward Environments

Did you happen to see SAC's performance on sparse-reward environments?

I know the DIAYN paper trained on sparse rewards, but I was wondering if vanilla SAC (in your expts) had any luck solving things like Continuous MountainCar.

How to run DIAYN on softlearning repo ?

README of this project says that sac is not updated. However, softlearning does not compatible with examples/mujoco_all_diayn.py. Nor does similar DIAYN training code exist in softlearning.

What I can do if I want to run DIAYN on softlearning ?

TD3 vs SAC

Hi,
First, thanks for sharing the repo.
I am really confused by the performance comparison between SAC and TD3.
In TD3's results, TD3 beats SAC in every environment evaluated with max avg. return after 1M timesteps (Table 1). However, in your SAC paper (Fig.1 ) it could be observed that almost in no environment TD3 beats SAC.
Is this because of different noises added in your and their experiments? Could you kindly provide some insights into this observation?

is chainer dep necessary?

With my cudnn7 install the chainer dependency breaks the local install steps. Removing it from the environment.yml doesn't seem to break anything

Stupid issue

I implemented my own version of sac and the log probability of policy went above 0 sometimes when using the version given in paper.

According to what I read here (Pg6) , I think the squashing correction should be added not subtracted, since the determinant of Jacobian is multiplied when calculating pdf.
But then this incentivises the agent to just set actions to 1 to get low log pi

I am pretty sure I am missing something here. Can you please explain how did you arrive at the squashing correction given in the paper?

haarnoja / sac Goto Github PK

sac's Introduction

Soft Actor-Critic

Getting Started

Prerequisites

Docker installation

Local installation

Examples

Training and simulating an agent

Benchmark Results

Credits

Reference

sac's People

Contributors

Stargazers

Watchers

Forkers

sac's Issues

Recommend Projects

Recommend Topics

Recommend Org