rll / rllab Goto Github PK

rllab is a framework for developing and evaluating reinforcement learning algorithms, fully compatible with OpenAI Gym.

License: Other

Python 88.13% Ruby 0.64% Mako 0.20% Shell 0.20% CSS 0.44% JavaScript 1.48% HTML 0.78% Jupyter Notebook 8.00% Dockerfile 0.12%

rllab's Introduction

rllab is no longer under active development, but an alliance of researchers from several universities has adopted it, and now maintains it under the name garage.

We recommend you develop new projects, and rebase old ones, onto the actively-maintained garage codebase, to promote reproducibility and code-sharing in RL research. The new codebase shares almost all of its code with rllab, so most conversions only need to edit package import paths and perhaps update some renamed functions.

garage is always looking for new users and contributors, so please consider contributing your rllab-based projects and improvements to the new codebase! Recent improvements include first-class support for TensorFlow, TensorBoard integration, new algorithms including PPO and DDPG, updated Docker images, new environment wrappers, many updated dependencies, and stability improvements.

rllab

rllab is a framework for developing and evaluating reinforcement learning algorithms. It includes a wide range of continuous control tasks plus implementations of the following algorithms:

rllab is fully compatible with OpenAI Gym. See here for instructions and examples.

rllab only officially supports Python 3.5+. For an older snapshot of rllab sitting on Python 2, please use the py2 branch.

rllab comes with support for running reinforcement learning experiments on an EC2 cluster, and tools for visualizing the results. See the documentation for details.

The main modules use Theano as the underlying framework, and we have support for TensorFlow under sandbox/rocky/tf.

Documentation

Documentation is available online: https://rllab.readthedocs.org/en/latest/.

Citing rllab

If you use rllab for academic research, you are highly encouraged to cite the following paper:

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, Pieter Abbeel. "Benchmarking Deep Reinforcement Learning for Continuous Control". Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.

Credits

rllab was originally developed by Rocky Duan (UC Berkeley / OpenAI), Peter Chen (UC Berkeley), Rein Houthooft (UC Berkeley / OpenAI), John Schulman (UC Berkeley / OpenAI), and Pieter Abbeel (UC Berkeley / OpenAI). The library is continued to be jointly developed by people at OpenAI and UC Berkeley.

Slides

Slides presented at ICML 2016: https://www.dropbox.com/s/rqtpp1jv2jtzxeg/ICML2016_benchmarking_slides.pdf?dl=0

rllab's People

Contributors

Stargazers

Watchers

Forkers

hitluobin amoliu wutongtong animesh-garg riashat codeaudit jmrinaldi scroyston lenovor baiyancheng20 hujian92 floodsung tigerneil fage2016 euwen cangjiaxaun atousatorabi wanjinchang bigcapitalist zhmz90 mowayao sxjscience gandalfvn jacknlliu xugesen turinglife davinirjr sohojoe lambdaji rdspring1 poseidon1214 justicelee binderwang robustfengbin einsnull wangxiao5791509 thebirdie synpon wowml wanfosi yenchenlin trigrass2 simon0xzx yuanqunyong cknd shyamalschandra subercui oronanschel lelegan febert ivehui jackeylu oztc flowgrad soledad89 hmartiro gitter-badger flyers kensun0 alexbeloi binbinbian arasharchor multipath medusagit zhangzongliang memoiry singulaire sarvghotra appcoreopc buduo15 tensorrep tonyan yiiwood matlab379 csdlrl rojas70 g-wang strawlab jaekyeom anuroopsriram seth-park davheld zhongwen arieel-ost fundou rejuvyesh robomate zeyuan1987 iamduyang nithyanandkota neverspill anuragajay vyraun tracy-ming hshyunsookim greydanus korymath yashwant takumn ytian81

rllab's Issues

How to use uniform control policy?

I want to run my new task in random using uniform_control_policy to get a reference, but I could not figure out what algo should I use. I try to rewrite the batch_plot but I got so many errors. Is there an elegant way to run my task in random?

template for conjugate gradient implementation

Hi, I am wondering which template you used for conjugate gradient implementation?
I saw some comment like "Demmel p 312" in krylov.py, I am assuming it's from a book.

Run "scripts/submit_gym.py" fails

When i ran example/trpo_gym.py, i got the results. Then i ran "scripts/submit_gym.py", but i got following error:
raise error.Error("[%s] You didn't have any recorded training data in {}. Once you've used 'env.monitor.start(training_dir)' to start recording, you need to actually run some rollouts. Please join the community chat on https://gym.openai.com if you have any issues.".format(env_id, training_dir))
gym.error.Error: [%s] You didn't have any recorded training data in Pendulum-v0. Once you've used 'env.monitor.start(training_dir)' to start recording, you need to actually run some rollouts. Please join the community chat on https://gym.openai.com if you have any issues.
So why?

install without Anaconda

I don't use Anaconda. Is it possible to install without Anaconda? I see the package dependence in environment.py file, and wonder if similar requirement file is available for a pip install.

UPDATE: never mind. I installed Anaconda and rllab works just fine. I guess this is a low priority request.

Great project.

fail to install on Centos

It seems rllab only support mac os and ubuntu. How to install it on centos server?

./scripts/setup_linux.sh

#!/bin/bash
# Make sure that conda is available

hash conda 2>/dev/null || {
    echo "Please install anaconda before continuing. You can download it at https://www.continuum.io/downloads. Please use the Python 2.7 installer."
    exit 0
}

echo "Installing system dependencies"
echo "You will probably be asked for your sudo password."
sudo apt-get update
sudo apt-get install -y python-pip python-dev swig cmake build-essential
sudo apt-get build-dep -y python-pygame
sudo apt-get build-dep -y python-scipy

# Make sure that we're under the directory of the project
cd "$(dirname "$0")/.."

echo "Creating conda environment..."
conda env create -f environment.yml
conda env update

echo "Conda environment created! Make sure to run \`source activate rllab3\` whenever you open a new terminal and want to run programs under rllab."

trouble replicating trpo results

I tried to quickly reproduce the cartpole balancing results reported in the paper.

I took the examples/trpo_cartpole.py script and adjusted basic parameters to those given in the paper, keeping to default rllab objects otherwise (see below). Running this, I get a lifetime average summed reward of 3200, quite a bit outside the reported 4869.8 ± 37.6. Looks as if learning is somewhat unstable with the provided learning rate (see plot below).

Can you take a look whether I'm missing something?

from rllab.algos.trpo import TRPO
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy

env = normalize(CartpoleEnv())

policy = GaussianMLPPolicy(
    env_spec=env.spec,
    hidden_sizes=(100, 50, 25)  # main text, section 5
)

baseline = LinearFeatureBaseline(env_spec=env.spec)  # suppl. section 2

algo = TRPO(
    env=env,
    policy=policy,
    baseline=baseline,
    batch_size=50000,  # suppl. section 2, Table 2
    max_path_length=500, # ""
    n_itr=500, # ""
    discount=0.99, # ""
    step_size=0.05,  # suppl. section 2, table 4 (?)
)

reward per episode:

Installer pyprind

Hello, I installed rllab following the instructions here.
I am on a MacBook Pro running macOS Sierra version 10.12

Unfortunately I still had issues running rllab, with some necessary packages not being installed.

Other colleagues have experienced similar problems which can be resolved by manually installing the necessary packages e.g. pip install pyprind theano lasagne
However we thought it best to raise this issue in case there are any bugs with the installer that can be mended.

Creating a setup.py so one can install rllab

Is there an initiative to create a setup.py file? I've drafted an example file here: https://gist.github.com/paulhendricks/f5da8cfdd91be93dbf4b03648d947ac3

This would be useful for running experiments and examples without needing all the rllab package files present. If this is useful, I can take a stab at implementing this and submit a pull request.

Question about results in the paper

Hi, I recently tried to reproduce the experiment result in your paper and I found some results are somehow different from the results in the paper. Did you use default parameters for all algorithms when you did the experiment?

CategoricalGRUPolicy InputLayer dimensionality

Hi,

I guess I'm a bit confused by this:

l_input = L.InputLayer(
                shape=(None, None, input_dim),
                name="input")

CategoricalGRUPolicy is a vectorized policy, so a VectorizedSampler is created, it creates 12 duplicate environments and sample from them. So assume the observation_space is (10), then what vec_env can observe is shaped as (12, 10): (n_env, observation_space.n)

But the InputLayer for CategoricalGRUPolicy is shaped as (None, None, input_dim)
Where's the extra dimension coming from?

The observation_space is Box type, and the flatten_n in get_actions(): flat_obs = self.observation_space.flatten_n(observations) doesn't add a dimension to it.

run trpo_cartpole fail on Mac

RLLAB works fine on the sentOS server, but it does not work on my MAC. When I ran trpo_cartpole.py in examples it stuck here and made no progress:

python trpo_cartpole.py

/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
"downsample module has been moved to the theano.tensor.signal.pool module.")

I press Ctrl + C and it shows:

^CTraceback (most recent call last):
File "trpo_cartpole.py", line 12, in
hidden_sizes=(32, 32)
File "/Users/lchenat/Desktop/FYT/rllab/rllab/policies/gaussian_mlp_policy.py", line 115, in init
outputs=[mean_var, log_std_var],
File "/Users/lchenat/Desktop/FYT/rllab/rllab/misc/ext.py", line 135, in compile_function
*_kwargs
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/compile/function.py", line 320, in function
output_keys=output_keys)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/compile/pfunc.py", line 479, in pfunc
output_keys=output_keys)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 1776, in orig_function
output_keys=output_keys).create(
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 1456, in init
optimizer_profile = optimizer(fgraph)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 101, in call
return self.optimize(fgraph)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 89, in optimize
ret = self.apply(fgraph, *args, *_kwargs)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 230, in apply
sub_prof = optimizer.optimize(fgraph)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 89, in optimize
ret = self.apply(fgraph, _args, *_kwargs)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 2223, in apply
sub_prof = gopt.apply(fgraph)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 1879, in apply
nb += self.process_node(fgraph, node)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/opt.py", line 1772, in process_node
replacements = lopt.transform(node)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/tensor/opt.py", line 5825, in constant_folding
no_recycling=[])
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/op.py", line 970, in make_thunk
no_recycling)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/op.py", line 879, in make_c_thunk
output_storage=node_output_storage)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1200, in make_thunk
keep_lock=keep_lock)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1143, in compile
keep_lock=keep_lock)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1595, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cmodule.py", line 1142, in module_from_key
module = lnk.compile_cmodule(location)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1506, in compile_cmodule
preargs=preargs)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/gof/cmodule.py", line 2182, in compile_str
p_out = output_subprocess_Popen(cmd)
File "/Users/lchenat/anaconda2/lib/python2.7/site-packages/theano/misc/windows.py", line 78, in output_subprocess_Popen
out = p.communicate()
File "/Users/lchenat/anaconda2/lib/python2.7/subprocess.py", line 800, in communicate
return self._communicate(input)
File "/Users/lchenat/anaconda2/lib/python2.7/subprocess.py", line 1417, in _communicate
stdout, stderr = self._communicate_with_poll(input)
File "/Users/lchenat/anaconda2/lib/python2.7/subprocess.py", line 1471, in _communicate_with_poll
ready = poller.poll()
KeyboardInterrupt

I already upgraded my RLLAB to the latest version.

Tried to reset environment which is not done. While the monitor is active ...

I'm using the github latest versio of gym, and also the latest rllab, and run the following code:

stub(globals())

    # env = normalize(GymEnv("Point-v0"))
    env = normalize(GymEnv("CartPole-v0"))

    policy = CategoricalMLPPolicy(
        env_spec=env.spec,
    )

    baseline = LinearFeatureBaseline(env_spec=env.spec)

    algo = TRPO(
        env=env,
        policy=policy,
        baseline=baseline,
        batch_size=4000,
        whole_paths=True,
        max_path_length=100,
        n_itr=1,
        discount=0.99,
        step_size=0.01,
    )

    run_experiment_lite(
        algo.train(),
        # Number of parallel workers for sampling
        n_parallel=1,
        # Only keep the snapshot parameters for the last iteration
        snapshot_mode="last",
        seed=1,
        # plot=True,
    )

Then sometimes I saw these errors

In [35]: algo = rl_prac.train_point()
python /home/weiliu/packages/rllab/scripts/run_experiment_lite.py  --log_dir '/home/weiliu/packages/rllab/data/local/experiment/experiment_2016_11_11_12_21_32_0019'  --use_cloudpickle 'False'  --snapshot_mode 'last'  --args_data 'gANjcmxsYWIubWlzYy5pbnN0cnVtZW50ClN0dWJNZXRob2RDYWxsCnEAKYFxAX1xAihYBgAAAF9fYXJnc3EDKGNybGxhYi5taXNjLmluc3RydW1lbnQKU3R1Yk9iamVjdApxBCmBcQV9cQYoWAYAAABrd2FyZ3NxB31xCChYBQAAAG5faXRycQlLAVgJAAAAc3RlcF9zaXplcQpHP4R64UeuFHtYCAAAAGJhc2VsaW5lcQtoBCmBcQx9cQ0oaAd9cQ5YCAAAAGVudl9zcGVjcQ9jcmxsYWIubWlzYy5pbnN0cnVtZW50ClN0dWJBdHRyCnEQKYFxEX1xEihYBAAAAF9vYmpxE2gEKYFxFH1xFShoB31xFlgDAAAAZW52cRdoBCmBcRh9cRkoaAd9cRpYCAAAAGVudl9uYW1lcRtYCwAAAENhcnRQb2xlLXYwcRxzWAsAAABwcm94eV9jbGFzc3EdY3JsbGFiLmVudnMuZ3ltX2VudgpHeW1FbnYKcR5YBAAAAGFyZ3NxHyl1YnNoHWNybGxhYi5lbnZzLm5vcm1hbGl6ZWRfZW52Ck5vcm1hbGl6ZWRFbnYKcSBoHyl1YlgKAAAAX2F0dHJfbmFtZXEhWAQAAABzcGVjcSJ1YnNoHWNybGxhYi5iYXNlbGluZXMubGluZWFyX2ZlYXR1cmVfYmFzZWxpbmUKTGluZWFyRmVhdHVyZUJhc2VsaW5lCnEjaB8pdWJYCgAAAGJhdGNoX3NpemVxJE2gD1gIAAAAZGlzY291bnRxJUc/764UeuFHrlgGAAAAcG9saWN5cSZoBCmBcSd9cSgoaAd9cSloD2gQKYFxKn1xKyhoE2gUaCFoInVic2gdY3JsbGFiLnBvbGljaWVzLmNhdGVnb3JpY2FsX21scF9wb2xpY3kKQ2F0ZWdvcmljYWxNTFBQb2xpY3kKcSxoHyl1YlgPAAAAbWF4X3BhdGhfbGVuZ3RocS1LZFgLAAAAd2hvbGVfcGF0aHNxLohoF2gUdWgdY3JsbGFiLmFsZ29zLnRycG8KVFJQTwpxL2gfKXViWAUAAAB0cmFpbnEwKX1xMXRxMlgIAAAAX19rd2FyZ3NxM31xNHViLg=='  --seed '1'  --n_parallel '1'  --exp_name 'experiment_2016_11_11_12_21_32_0019'
/home/weiliu/anaconda3/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
using seed 1
2016-11-11 12:47:49.070577 EST | Setting seed to 1
using seed 1
[2016-11-11 12:47:49,122] Making new env: CartPole-v0
2016-11-11 12:47:49.800481 EST | [experiment_2016_11_11_12_21_32_0019] Populating workers...
2016-11-11 12:47:49.800802 EST | [experiment_2016_11_11_12_21_32_0019] Populated
0%                          100%
[                              ][2016-11-11 12:47:49,853] Starting new video recorder writing to /home/weiliu/packages/rllab/data/local/experiment/experiment_2016_11_11_12_21_32_0019/gym_log/openaigym.video.0.29157.video000000.mp4
[2016-11-11 12:47:50,883] Starting new video recorder writing to /home/weiliu/packages/rllab/data/local/experiment/experiment_2016_11_11_12_21_32_0019/gym_log/openaigym.video.0.29157.video000001.mp4
[#                             ] | ETA: 00:01:05[2016-11-11 12:47:52,234] Starting new video recorder writing to /home/weiliu/packages/rllab/data/local/experiment/experiment_2016_11_11_12_21_32_0019/gym_log/openaigym.video.0.29157.video000008.mp4
[#####                         ] | ETA: 00:00:15[2016-11-11 12:47:52,944] Starting new video recorder writing to /home/weiliu/packages/rllab/data/local/experiment/experiment_2016_11_11_12_21_32_0019/gym_log/openaigym.video.0.29157.video000027.mp4
[######                        ] | ETA: 00:00:18Traceback (most recent call last):
  File "/home/weiliu/packages/rllab/scripts/run_experiment_lite.py", line 137, in <module>
    run_experiment(sys.argv)
  File "/home/weiliu/packages/rllab/scripts/run_experiment_lite.py", line 124, in run_experiment
    maybe_iter = concretize(data)
  File "/home/weiliu/packages/rllab/rllab/misc/instrument.py", line 1213, in concretize
    return method(*args, **kwargs)
  File "/home/weiliu/packages/rllab/rllab/algos/batch_polopt.py", line 120, in train
    paths = self.sampler.obtain_samples(itr)
  File "/home/weiliu/packages/rllab/rllab/algos/batch_polopt.py", line 28, in obtain_samples
    scope=self.algo.scope,
  File "/home/weiliu/packages/rllab/rllab/sampler/parallel_sampler.py", line 125, in sample_paths
    show_prog_bar=True
  File "/home/weiliu/packages/rllab/rllab/sampler/stateful_pool.py", line 150, in run_collect
    result, inc = collect_once(self.G, *args)
  File "/home/weiliu/packages/rllab/rllab/sampler/parallel_sampler.py", line 94, in _worker_collect_one_path
    path = rollout(G.env, G.policy, max_path_length)
  File "/home/weiliu/packages/rllab/rllab/sampler/utils.py", line 12, in rollout
    o = env.reset()
  File "/home/weiliu/packages/rllab/rllab/envs/normalized_env.py", line 52, in reset
    ret = self._wrapped_env.reset()
  File "/home/weiliu/packages/rllab/rllab/envs/gym_env.py", line 92, in reset
    return self.env.reset()
  File "/home/weiliu/packages/gym/gym/core.py", line 139, in reset
    self.monitor._before_reset()
  File "/home/weiliu/packages/gym/gym/monitoring/monitor.py", line 271, in _before_reset
    self.stats_recorder.before_reset()
  File "/home/weiliu/packages/gym/gym/monitoring/stats_recorder.py", line 65, in before_reset
    raise error.Error("Tried to reset environment which is not done. While the monitor is active for {}, you cannot call reset() unless the episode is over.".format(self.env_id))
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

If you suspect this is an IPython bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at [email protected]

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

I don't seem to have seen such error before, when I used 'pip install gym'...

Writing an Environment

Hi,

Background: I am an rising freshman undergraduate, and I am doing a research internship this summer in reinforcement learning and its application to crystallography. Specifically, I hope to write an environment compatible with ddpg for conducting experiments, so after training upon thousands of episodes the machines could accurately calculate a "UB matrix" using the fewest measurements.

Just one question:

I noticed that your algorithm includes fixed action (control) types. For training, actions are usually just selecting a reflection -- possible reflections are calculated and stored in an array, and the algorithm simply chooses one. However, if a peak is not found (if the observation is below a "background" threshold) then the action becomes a scan across two dimensions. Here, the program selects the coordinates on which to measure, and the coordinates are chosen from a continuum. Is such an action selection scheme possible in your framework, and how should I write this?

Thanks!!

Minor documentation issue with vanilla policy gradients

There seems to be a minor bias/variance typo. In the docs on vanilla policy gradients, it says:

When viewing the discount factor as a variance reduction factor for the undiscounted objective, this alternative gradient estimator has less bias, at the expense of having a larger variance

It seems like it should be the reverse: reducing variance but at the expense of larger bias. For instance in the OpenAI docs it says:

If the trajectories are very long (i.e., T is high), then the preceding formula will have excessive variance. Thus people generally use a discount factor, which reduces variance at the cost of some bias. The biased policy gradient formula is [...]

Though to be honest, I have very little intuition on how to tell which estimators have lower variance. I interpret the smaller variance compared to the undiscounted objective due to how the discounted version decreases the values that are in the advantage values (where "advantage" is assumed to mean anything that gets multiplied with the grad-log probability of the policy). Intuitively, we would want smaller advantage values in magnitude...

The other thing that may not be totally clear is why the gradient in the vanilla policy gradients has that extra 1/T term, because we want the expectation over the sum of T terms, right? The 1/N is understandable because that's like we have N elements so we take the average. I guess the 1/T gets absorbed into the constant when doing gradient updates?

Unable to import mujoco envs

Hi. I tried to import a mujoco environment present in the master branch, and I get an OS error. I have mujoco set-up already and can create the environment from the openAI gym bundle of mujoco-py.

The following two sets of environment creation work.

import gym
gym_env = gym.make('Hopper-v1')

from rllab.envs.gym_env import GymEnv
rllab_env1 = GymEnv("Hopper-v1")

However, this doesn't work:

import rllab.mujoco_py
from rllab.envs.mujoco.hopper_env import HopperEnv

Error message:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-5-88221ef8d105> in <module>()
----> 1 import rllab.mujoco_py
      2 from rllab.envs.mujoco.hopper_env import HopperEnv

/home/aravind/Programs/rllab/rllab/mujoco_py/__init__.py in <module>()
----> 1 from .mjviewer import MjViewer
      2 from .mjcore import MjModel
      3 from .mjcore import register_license
      4 import os
      5 from mjconstants import *

/home/aravind/Programs/rllab/rllab/mujoco_py/mjviewer.py in <module>()
----> 1 import glfw
      2 from mjlib import mjlib
      3 from ctypes import pointer, byref
      4 import ctypes
      5 import mjcore

/home/aravind/Programs/rllab/rllab/mujoco_py/glfw.py in <module>()
    134 
    135 
--> 136 _glfw = _load_library()
    137 if _glfw is None:
    138     raise ImportError("Failed to load GLFW3 shared library.")

/home/aravind/Programs/rllab/rllab/mujoco_py/glfw.py in _load_library()
     76     else:
     77         raise RuntimeError("unrecognized platform %s"%sys.platform)
---> 78     return ctypes.CDLL(libfile)
     79 
     80 

/home/aravind/anaconda2/envs/rllab/lib/python2.7/ctypes/__init__.pyc in __init__(self, name, mode, handle, use_errno, use_last_error)
    363 
    364         if handle is None:
--> 365             self._handle = _dlopen(self._name, mode)
    366         else:
    367             self._handle = handle

OSError: /home/aravind/Programs/rllab/vendor/mujoco/libglfw.so.3: cannot open shared object file: No such file or directory

It's not an issue with ipython, since I get the same error even outside. Thanks :)

Difference in performance between normal and stubbed modes

Hi,

I'm noticing a considerable difference in performance between the standard mode of training using algo.train() and using experiment_lite along with the stubbed methods. What are possible reasons and how can performance be improved when using the vanilla algo.train() method?

By performance, I mean both the speed as well as the learning curve (rate of learning). All hyper parameters are the same between the two experiments I tried.

As per my understanding, stubbed mode doesn't offer as much flexibility in doing auxiliary tasks other than just training a policy. Hence, the normal mode seems more user friendly and general. Is this understanding accurate, or am I missing something? Is there a more detailed documentation than the one available at https://rllab.readthedocs.io/en/latest/

Thanks!

Conjugate Gradient Optimization sometimes fails (with NaN parameters)

In some of my experiments I sometimes get NaN parameters when training using TRPO and TNPG algorithms, this leads to the file containing ConjugateGradientOptimizer where it appears that under some circumstances the value passed to np.sqrt in line 168-170 is negative (specifically descent_direction.dot(Hx(descent_direction)) is negative), this defines initial_step_size which is then set to NaN.

Is there any citation available for this initial step size?

The variable naming for the terms in descent_direction.dot(Hx(descent_direction)) suggests that this is an inner product with respect to a Hessian (which would be positive semi-definite), but I'm not sure that's the case.

DDPG on-policy?

Hi,

In the ddpg.py, I assume you're following this paper by Silver et.al. If so, your algorithm doesn't seem to mirror theirs. Over here, your are going on-policy for the actor. But DDPG is an off-policy approach due to exploration noise being added (which I can't seem to find in your code). If its completely deterministic, there is no scope for exploration. Unless you're following a stochastic policy in which case, doesnt it defeat the purpose of DDPG?

Am I missing something?

running the trpo_cartpole.py file with the argument adaptive_std = True gives the value error.

There is a parameter adaptive_std in the file https://github.com/rllab/rllab/blob/master/rllab/policies/gaussian_mlp_policy.py. The goal of this is to create a neural network architecture for learning std. I defined a policy with adaptive_std paramener true as in the following code and run it.

from rllab.algos.trpo import TRPO
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
env = normalize(CartpoleEnv())
policy = GaussianMLPPolicy(
env_spec=env.spec,
# The neural network policy should have two hidden layers, each with 32 hidden units.
hidden_sizes=(32, 32),
adaptive_std=True
)
baseline = LinearFeatureBaseline(env_spec=env.spec)
algo = TRPO(
env=env,
policy=policy,
baseline=baseline,
batch_size=4000,
max_path_length=100,
n_itr=40,
discount=0.99,
step_size=0.01,
)
algo.train()

I got the following error.

2016-05-09 16:28:47.563510 PDT | Populating workers...
2016-05-09 16:28:47.675703 PDT | Populated
Traceback (most recent call last):
File "/home/drl/rllab/examples/trpo_cartpole.py", line 28, in
algo.train()
File "/home/drl/rllab/rllab/algos/batch_polopt.py", line 83, in train
self.init_opt()
File "/home/drl/rllab/rllab/algos/npo.py", line 69, in init_opt
dist_info_vars = self.policy.dist_info_sym(obs_var, state_info_vars)
File "/home/drl/rllab/rllab/policies/gaussian_mlp_policy.py", line 113, in dist_info_sym
mean_var, log_std_var = L.get_output([self._l_mean, self._l_log_std], obs_var)
File "/home/drl/anaconda2/envs/rllab/lib/python2.7/site-packages/lasagne/layers/helper.py", line 169, in get_output
raise ValueError("get_output() was called with a single input "
ValueError: get_output() was called with a single input expression on a network with multiple input layers. Please call it with a dictionary of input expressions instead.

It will be really helpful if you can help me with the reason behind this error.

Normalized the discrete action space

the code in ''normalized_env.py'' did not implement the Normalized discrete action space

DDPG has no function of plotting?

When i use ddpg algorithm, I set plot=True. But the evaluation run after each iteration did't appear.
So what's the problem?

Can CategoricalGRUPolicy taking an initial hidden state

Hi,

I'm working with the sandbox/rocky/tf/policies/ directory

and the GRU Policies as I can recall, must have an initial hidden state (normally initialized randomly). However, is it possible for me to pass in the initial hidden state?

The algorithm I'm trying to run is VPG.

NameError: name 'instrument2' is not defined

rllab/misc/logger.py:256 has the following line:

    from rllab.misc import instrument2

I'm getting a NameError error at this statement. It looks like this was introduced in this recent commit. It doesn't look like any of the branches have a rllab/misc/instrument2.py module. Was this file left out of this commit, or am I missing something?

More complex neural nets

Hello,

I did not find any "non-hackish" way to implement custom complex Neural Net architectures in RLLab.
For example, it could be interesting to be able to do the following:

Use convolutional neural nets to process images as input (there is already a class for that in RLLab, but the GaussionMLPPolicy does not offer this option for example)
These CNN possibly need a pre-training (as auto-encoders for example), and should not be completely encapsulated inside the policy

I started implementing a draft of this for the Gaussian MPL Policy here:
TheBirdie@9d623ee
It does not required deep changes in the policy code, and yet allows more flexibility.

A second step could be to have more flexibility regarding the policy input (observation) shape.

For example, one may be willing to observe several kinds of data, with a different processing for each:

An image processing with a CNN
Other data processed with classical fully connected Neural Nets
However, I did not find any possibility to split the inputs nicely. Currently, the policy receives a flatten array with all the information concatenated.
Is that something that sounds interesting?

Thank you

problem of vpg

I only encountered this problem recently. When I tried to run a script that I had ran several times before I suddenly got into the debugger:

2016-10-09 16:05:12.691857 HKT | [col_vpg_1] itr #286 | fitting baseline...
2016-10-09 16:05:25.779144 HKT | [col_vpg_1] itr #286 | fitted
2016-10-09 16:05:25.789398 HKT | [col_vpg_1] itr #286 | optimizing policy
2016-10-09 16:05:26.128741 HKT | [col_vpg_1] itr #286 | saving snapshot...
2016-10-09 16:05:26.134122 HKT | [col_vpg_1] itr #286 | saved
2016-10-09 16:05:26.135219 HKT | ----------------------- ----------------
2016-10-09 16:05:26.135341 HKT | Iteration 286
2016-10-09 16:05:26.135428 HKT | AverageDiscountedReturn -0.169555864807
2016-10-09 16:05:26.135505 HKT | AverageReturn -4.70588235294
2016-10-09 16:05:26.135584 HKT | ExplainedVariance [-1.09076318]
2016-10-09 16:05:26.135662 HKT | NumTrajs 17
2016-10-09 16:05:26.135740 HKT | Entropy 1.39721952261
2016-10-09 16:05:26.135813 HKT | Perplexity 4.04394023599
2016-10-09 16:05:26.135893 HKT | StdReturn 23.3584968943
2016-10-09 16:05:26.135972 HKT | MaxReturn 4.0
2016-10-09 16:05:26.136051 HKT | MinReturn -98.0
2016-10-09 16:05:26.136132 HKT | AveragePolicyStd 0.978515148814
2016-10-09 16:05:26.136230 HKT | LossBefore 0.00863600655567
2016-10-09 16:05:26.136317 HKT | LossAfter 0.00624103735871
2016-10-09 16:05:26.136399 HKT | MeanKL 0.00021358694111
2016-10-09 16:05:26.136477 HKT | MaxKL 0.0413081072869
2016-10-09 16:05:26.136556 HKT | ----------------------- ----------------
0% 100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00

/home/data/lchenat/rllab-master/rllab/misc/special.py(53)explained_variance_1d()
51 if abs(1 - np.var(y - ypred) / (vary + 1e-8)) > 1e5:
52 import ipdb; ipdb.set_trace()
---> 53 return 1 - np.var(y - ypred) / (vary + 1e-8)
54
55

ipdb>

And I found that these lines of code have been deleted from the github version. Does it mean something wrong with the computation?

Example in documentation appears incorrect

Hello. I've just started using rllab, and went through the documentation today. The example of REINFORCE given here appears to be incorrect, or I don't fully understand the implementation.
https://rllab.readthedocs.io/en/latest/user/implement_algo_basic.html

Under the heading Constructing the computational graph, a surrogate function is defined and it's gradient taken. However, isn't the surrogate function itself the gradient? This is as per the pseudo code given in the Preliminary section. Why do we need to compute the gradient again?

Kindly update the documentation or clarify. Thanks!

New BatchPolopt won't run

I tried to run a BatchPolopt-based algorithm, and it hung on this line. This seems to be a typo or something, but this sampler class is new enough that I'm not sure exactly what is supposed to happen here.

How to render the environment?

Hi, can I render the box2D tasks during the training? I could not find relevant parameters in algo class.

run trop_cartpole_stub fails

when I ran trop_cartpole_stub.py , I got nothing except using seed 1 ,using seed 1. My env is Ubantu 14.04 gpu:gtx970 cuda:7.5 .

specifying initial weights to mlp policy

Is there a way you could expose the weight initialization for the MLP in policies use it (e.g. GaussianMLPPolicy).

I have weights that have been trained in a supervised setting with which I would like to initialize a policy.

baseline problem in the vanilla policy gradient example

Hi everyone,
When I am running the rllab/examples/vpg_2.py example, I found that the baseline is actually a zero vector. It seems that we need to explicitly call the baseline.fit method before calling the baseline.predict method. @dementrock
Thanks.

Port "Ant-Maze" environment to OpenAI Gym

I'm quite interested in the "Ant-Maze" environments, however I didn't find it in OpenAI Gym. I'd like to use the environment as gym.Env. Is there any suggestion?

trpo_gym.py fails when trying CartPole-v0 on mac

I am trying to run the OpenAI gym example on a different environment than Pendulum-v0. However, I keep getting this error:

  raise AttributeError('Cannot get attribute %s from %s' % (item, self.proxy_class))
AttributeError: Cannot get attribute __stub_cache from <class 'rllab.policies.gaussian_mlp_policy.GaussianMLPPolicy'>

Any idea how I can fix this?

Saving Trained Networks

Hi,

How would one save a trained agent (say, ddpg)? Thank you,

Problem running OpenAI Gym environments with image observations

When trying to run experiments with an OpenAI Gym environment that returns image observations (e.g. CarRacing-v0), using the same configuration as the ddpg example and only changing the environment, there are a number of places where execution will break in either the policy, algorithm, or q function networks. The code seems to assume observations will be a flattened vector, whereas the CarRacing environment returns image arrays of the shape (96, 96, 3).

This does not seem to happen with e.g. the trpo_gym example, which builds its network using the MLP class.

Some tricks to install pygame...

May want to include some brief notes on install pygame for OSX as it is not exactly a straight foward pip install...

anaconda

any plans to make this work without anaconda?

using the first_order_optimizer with TRPO gives error

I tried to run the Cart Pole experiment with the adam-update. The code is as following:

from rllab.algos.trpo import TRPO
from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize
from rllab.policies.gaussian_gru_policy import GaussianGRUPolicy
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.optimizers.first_order_optimizer import FirstOrderOptimizer
env = normalize(CartpoleEnv())
policy = GaussianMLPPolicy(
env_spec=env.spec,
adaptive_std=True,
# The neural network policy should have two hidden layers, each with 32 hidden units.
)
baseline = LinearFeatureBaseline(env_spec=env.spec)
algo = TRPO(
env=env,
policy=policy,
baseline=baseline,
batch_size=4000,
max_path_length=100,
n_itr=200,
discount=0.99,
step_size=0.01,
optimizer=FirstOrderOptimizer(),
)
algo.train()

However, I was not able to run the code and I got the following error:

Traceback (most recent call last):
File "/home/drl/rllab/examples/trpo_cartpole.py", line 30, in
algo.train()
File "/home/drl/rllab/rllab/algos/batch_polopt.py", line 95, in train
self.optimize_policy(itr, samples_data)
File "/home/drl/rllab/rllab/algos/npo.py", line 110, in optimize_policy
mean_kl = self.optimizer.constraint_val(all_input_values)
AttributeError: 'FirstOrderOptimizer' object has no attribute 'constraint_val'.

I found out that conjugateGradientOptimizer has this attribute but I got a different error when I put the constrain_val function onto the FirstOrderOptimizer class.

I will appreciate if you can tell me what is the objective of this constraint_val function in the optimizer call.

Thank you

Error while running trpo_gym.py

Hi,

I am trying to run examples/trpo_gym.py on a remote server. I get the following error log:


Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN Version is too old. Update to v5, was 4004.)
python /home/iitm/sahil/rllab/scripts/run_experiment_lite.py  --n_parallel '1'  --snapshot_mode 'last'  --exp_name 'experiment_2016_09_29_02_46_48_0001'  --seed '1'  --log_dir '/home/iitm/sahil/rllab/data/local/experiment/experiment_2016_09_29_02_46_48_0001'  --args_data 'Y2NvcHlfcmVnCl9yZWNvbnN0cnVjdG9yCnAxCihjcmxsYWIubWlzYy5pbnN0cnVtZW50ClN0dWJNZXRob2RDYWxsCnAyCmNfX2J1aWx0aW5fXwpvYmplY3QKcDMKTnRScDQKKGRwNQpTJ19fYXJncycKcDYKKGcxCihjcmxsYWIubWlzYy5pbnN0cnVtZW50ClN0dWJPYmplY3QKcDcKZzMKTnRScDgKKGRwOQpTJ2FyZ3MnCnAxMAoodHNTJ3Byb3h5X2NsYXNzJwpwMTEKY3JsbGFiLmFsZ29zLnRycG8KVFJQTwpwMTIKc1Mna3dhcmdzJwpwMTMKKGRwMTQKUydiYXNlbGluZScKcDE1CmcxCihnNwpnMwpOdFJwMTYKKGRwMTcKZzEwCih0c2cxMQpjcmxsYWIuYmFzZWxpbmVzLmxpbmVhcl9mZWF0dXJlX2Jhc2VsaW5lCkxpbmVhckZlYXR1cmVCYXNlbGluZQpwMTgKc2cxMwooZHAxOQpTJ2Vudl9zcGVjJwpwMjAKZzEKKGNybGxhYi5taXNjLmluc3RydW1lbnQKU3R1YkF0dHIKcDIxCmczCk50UnAyMgooZHAyMwpTJ19vYmonCnAyNApnMQooZzcKZzMKTnRScDI1CihkcDI2CmcxMAoodHNnMTEKY3JsbGFiLmVudnMubm9ybWFsaXplZF9lbnYKTm9ybWFsaXplZEVudgpwMjcKc2cxMwooZHAyOApTJ2VudicKcDI5CmcxCihnNwpnMwpOdFJwMzAKKGRwMzEKZzEwCih0c2cxMQpjcmxsYWIuZW52cy5neW1fZW52Ckd5bUVudgpwMzIKc2cxMwooZHAzMwpTJ2Vudl9uYW1lJwpwMzQKUydJbnZlcnRlZFBlbmR1bHVtLXYxJwpwMzUKc3Nic3Nic1MnX2F0dHJfbmFtZScKcDM2ClMnc3BlYycKcDM3CnNic3Nic1MnYmF0Y2hfc2l6ZScKcDM4Ckk0MDAwCnNTJ2Rpc2NvdW50JwpwMzkKRjAuOTg5OTk5OTk5OTk5OTk5OTkKc1Mnc3RlcF9zaXplJwpwNDAKRjAuMDEKc1Mnbl9pdHInCnA0MQpJNTAKc2cyOQpnMjUKc1MncG9saWN5JwpwNDIKZzEKKGc3CmczCk50UnA0MwooZHA0NApnMTAKKHRzZzExCmNybGxhYi5wb2xpY2llcy5nYXVzc2lhbl9tbHBfcG9saWN5CkdhdXNzaWFuTUxQUG9saWN5CnA0NQpzZzEzCihkcDQ2CmcyMApnMQooZzIxCmczCk50UnA0NwooZHA0OApnMjQKZzI1CnNnMzYKZzM3CnNic1MnaGlkZGVuX3NpemVzJwpwNDkKKEk4Ckk4CnRwNTAKc3Nic1MnbWF4X3BhdGhfbGVuZ3RoJwpwNTEKZzEKKGcyMQpnMwpOdFJwNTIKKGRwNTMKZzI0CmcyNQpzZzM2ClMnaG9yaXpvbicKcDU0CnNic3NiUyd0cmFpbicKcDU1Cih0KGRwNTYKdHA1NwpzUydfX2t3YXJncycKcDU4CihkcDU5CnNiLg=='
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN Version is too old. Update to v5, was 4004.)
using seed 1
using seed 1
[2016-09-29 02:46:52,213] Making new env: InvertedPendulum-v1
2016-09-29 02:46:52.789599 IST | [experiment_2016_09_29_02_46_48_0001] Populating workers...
2016-09-29 02:46:52.789729 IST | [experiment_2016_09_29_02_46_48_0001] Populated
0%                          100%
[                              ][2016-09-29 02:46:52,993] Starting new video recorder writing to /home/iitm/sahil/rllab/data/local/experiment/experiment_2016_09_29_02_46_48_0001/gym_log/openaigym.video.0.24658.video000000.mp4
Xlib:  extension "GLX" missing on display ":99".
[2016-09-29 02:46:52,997] GLFW error: 65542, desc: GLX: GLX extension not found
Xlib:  extension "GLX" missing on display ":99".
Traceback (most recent call last):
  File "/home/iitm/sahil/rllab/scripts/run_experiment_lite.py", line 115, in <module>
    run_experiment(sys.argv)
  File "/home/iitm/sahil/rllab/scripts/run_experiment_lite.py", line 102, in run_experiment
    maybe_iter = concretize(data)
  File "/home/iitm/sahil/rllab/rllab/misc/instrument.py", line 1018, in concretize
    return method(*args, **kwargs)
  File "/home/iitm/sahil/rllab/rllab/algos/batch_polopt.py", line 250, in train
    paths = self.sampler.obtain_samples(itr)
  File "/home/iitm/sahil/rllab/rllab/algos/batch_polopt.py", line 32, in obtain_samples
    scope=self.algo.scope,
  File "/home/iitm/sahil/rllab/rllab/sampler/parallel_sampler.py", line 114, in sample_paths
    show_prog_bar=True
  File "/home/iitm/sahil/rllab/rllab/sampler/stateful_pool.py", line 142, in run_collect
    result, inc = collect_once(self.G, *args)
  File "/home/iitm/sahil/rllab/rllab/sampler/parallel_sampler.py", line 89, in _worker_collect_one_path
    path = rollout(G.env, G.policy, max_path_length)
  File "/home/iitm/sahil/rllab/rllab/sampler/utils.py", line 11, in rollout
    o = env.reset()
  File "/home/iitm/sahil/rllab/rllab/envs/normalized_env.py", line 54, in reset
    ret = self._wrapped_env.reset()
  File "/home/iitm/sahil/rllab/rllab/envs/gym_env.py", line 90, in reset
    return self.env.reset()
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 134, in reset
    self.monitor._after_reset(observation)
  File "/usr/local/lib/python2.7/dist-packages/gym/monitoring/monitor.py", line 267, in _after_reset
    self.video_recorder.capture_frame()
  File "/usr/local/lib/python2.7/dist-packages/gym/monitoring/video_recorder.py", line 105, in capture_frame
    frame = self.env.render(mode=render_mode)
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 185, in render
    return self._render(mode=mode, close=close)
  File "/usr/local/lib/python2.7/dist-packages/gym/envs/mujoco/mujoco_env.py", line 112, in _render
    self._get_viewer().render()
  File "/usr/local/lib/python2.7/dist-packages/gym/envs/mujoco/mujoco_env.py", line 121, in _get_viewer
    self.viewer.start()
  File "/usr/local/lib/python2.7/dist-packages/mujoco_py/mjviewer.py", line 168, in start
    raise Exception('glfw failed to initialize')
Exception: glfw failed to initialize
[2016-09-29 02:46:52,999] GLFW error: 65537, desc: The GLFW library is not initialized
[2016-09-29 02:46:52,999] GLFW error: 65537, desc: The GLFW library is not initialized
[2016-09-29 02:46:52,999] Could not close renderer for InvertedPendulum-v1: _type_ must have storage info
[2016-09-29 02:46:52,999] GLFW error: 65537, desc: The GLFW library is not initialized
[2016-09-29 02:46:52,999] GLFW error: 65537, desc: The GLFW library is not initialized
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/local/lib/python2.7/dist-packages/gym/utils/closer.py", line 67, in close
    closeable.close()
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 200, in close
    self.render(close=True)
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 176, in render
    return self._render(close=close)
  File "/usr/local/lib/python2.7/dist-packages/gym/envs/mujoco/mujoco_env.py", line 106, in _render
    self._get_viewer().finish()
  File "/usr/local/lib/python2.7/dist-packages/mujoco_py/mjviewer.py", line 324, in finish
    glfw.destroy_window(self.window)
  File "/usr/local/lib/python2.7/dist-packages/mujoco_py/glfw.py", line 809, in destroy_window
    window_addr = ctypes.cast(ctypes.pointer(window),
TypeError: _type_ must have storage info
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "/usr/local/lib/python2.7/dist-packages/gym/utils/closer.py", line 67, in close
    closeable.close()
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 200, in close
    self.render(close=True)
  File "/usr/local/lib/python2.7/dist-packages/gym/core.py", line 176, in render
    return self._render(close=close)
  File "/usr/local/lib/python2.7/dist-packages/gym/envs/mujoco/mujoco_env.py", line 106, in _render
    self._get_viewer().finish()
  File "/usr/local/lib/python2.7/dist-packages/mujoco_py/mjviewer.py", line 324, in finish
    glfw.destroy_window(self.window)
  File "/usr/local/lib/python2.7/dist-packages/mujoco_py/glfw.py", line 809, in destroy_window
    window_addr = ctypes.cast(ctypes.pointer(window),
TypeError: _type_ must have storage info

Any idea what is happening? As far as I understand it, it is not able to render the game-frontend. However, since Im running this on a remote server, I do not mind not rendering the front end. How would I turn off rendering?

Thanks,

Problem running openai gym environment than the Pendulum

Hi,

I tested the example provided to use openai gym environments. The default example works fine. But the problem is that when I test it on another environment like "CartPole-v0" I get following error message:

env = normalize(GymEnv('CartPole-v0'))
File "/home/rllab/rllab/rllab/envs/normalized_env.py", line 23, in init
assert isinstance(env.action_space, Box)
AssertionError

It is worth nothing that I can run the scenario using openai gym with no problem.

Thanks in advanced,
Ali

Does rllab support gpu?

One more question, whether the rllab framework supports gpu. If so, how can I run in gpu mode.

trpo_cartpole_stub.py fails

When I try to run the trpo_cartpole_stub.py it runs for an hour or so and then runs out of memory. Some investigation shows that it never returns from this call
data = pickle.loads(base64.b64decode(args.args_data)) in run_experiment_lite.py.
top shows that one of the processes(python run_experiment_lite) is running about 15% cpu only during the long wait until it crashes. other python run_experiment_lite processes are 0%

python 2.7 on dual xeon w 12Gbyte ram, ubuntu 14.04, anaconda python

Question about orientation in gather_env

I am a bit confused about how the orientation is calculated in gather_env.py

On line 289 of gather_env.py:

ori = self.inner_env.model.data.qpos[self.class.ORI_IND]

For example, Ant class has ORI_IND 6, and in ant.xml the first joint is the free 'root' joint, then ori should be the value on k axis, since for free joint qpos should be: (x,y,z,a,i,j,k). But ori seems to be the orientation of the agent from the code. Do I misunderstand something?

How to contribute to rllab ?

Hi, I would like to contribute to rllab. Could you please tell me where to start from ? I mean can you give me some task or list of tasks to start contributing ?

Background: I am a final year b.tech student with basic knowledge of DeepRL. I have implemented some basic algorithms before. Last summer vacation I did a research internship in this field.

Thanks in advance.

Some numpy functions produce fatal error with rllab

Hi, When using some functions from rllab, I'm unable to call a few numpy function. See below for an illustration.

from rllab.algos.trpo import TRPO
import numpy as np

Mu = np.array([0.0, 0.0])
Sigma = np.diag(np.array([1.0, 1.0]))
np.random.multivariate_normal(Mu, Sigma)
Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.

Of course, the numpy functions work correctly if I am not importing rllab function. Any idea why this might be the case, and any hints on how to fix this. Unfortunately, what I'm doing requires me to both draw multivariate Gaussians and also use rllab functions at the same time. Thanks!

Run trpo_swimmer in stub mode

''python example/trpo_swimmer.py'' works well. In the default setting, after 40 iterations it produces 55.72 average reward.

When I try to run trpo_swimmer.py in the ''stub'' mode (I simply add ''stub(globals())'' at the begining and replace ''algo.train()'' with ''run_experiment_lite(...)" just following ddpg_cartpole and ddpg_cartpole_stub), it still work. However, in the same default setting, it produces 49.59 average reward. I try different random SEED the difference remained.

I'm wondering why the difference exists?

TRPO implementation

I am perplexed by the TRPO implementation. It's just a wrapper around the NPO code. The TRPO paper does acknowledge that

TRPO is related to prior methods (e.g. natural policy gradient) but makes several changes, most notably by using a ﬁxed KL divergence rather than a ﬁxed penalty coefﬁcient.

but I don't see that; you use the KL divergence in the NPO code. Is there a description of the difference between your NPO and TRPO somewhere?

ValueError: I/O operation on closed file

I'm not sure what I'm doing wrong here. I haven't changed anything from the master branch and the following simple implementation gives me this error. Would appreciate some help. Thanks!

from rllab.baselines.linear_feature_baseline import LinearFeatureBaseline
from rllab.envs.gym_env import GymEnv
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.envs.normalized_env import normalize
from rllab.algos.vpg import VPG
import numpy as np

env = normalize(GymEnv("Reacher-v1"))
policy = GaussianMLPPolicy(env_spec=env.spec, hidden_sizes=(16,16))
baseline = LinearFeatureBaseline(env_spec=env.spec)

# set properties
algo.batch_size = 50  # N
algo.max_path_length=50   # T
algo.discount=0.99  # gamma
algo.n_itr=250  # n_iter

algo.train()

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-f78c549ec9f8> in <module>()
----> 1 algo.train()

/home/aravind/Programs/rllab/rllab/algos/batch_polopt.pyc in train(self)
    249             with logger.prefix('itr #%d | ' % itr):
    250                 paths = self.sampler.obtain_samples(itr)
--> 251                 samples_data = self.sampler.process_samples(itr, paths)
    252                 self.log_diagnostics(paths)
    253                 self.optimize_policy(itr, samples_data)

/home/aravind/Programs/rllab/rllab/algos/batch_polopt.pyc in process_samples(self, itr, paths)
    146             )
    147 
--> 148         logger.log("fitting baseline...")
    149         self.algo.baseline.fit(paths)
    150         logger.log("fitted")

/home/aravind/Programs/rllab/rllab/misc/logger.pyc in log(s, with_prefix, with_timestamp)
    112     if not _log_tabular_only:
    113         # Also log to stdout
--> 114         print(out)
    115         for fd in _text_fds.values():
    116             fd.write(out + '\n')

/home/aravind/anaconda2/envs/rllab/lib/python2.7/site-packages/ipykernel/iostream.pyc in write(self, string)
    315 
    316             is_child = (not self._is_master_process())
--> 317             self._buffer.write(string)
    318             if is_child:
    319                 # newlines imply flush in subprocesses

ValueError: I/O operation on closed file

A bug in TfEnv

I'm creating my own custom environment:

class MyEnv(Env):
   def __init__(self):
      super(MyEnv, self).__init__()

   @property
    def vectorized(self):
        return True

    def candy(self):
        return "there's a candy!"

env = TfEnv(normalize(MyEnv()))
print(getattr(env, 'vectorized', False))
print(env.wrapped_env.candy())

The result is weirdly False.

I really couldn't figure out why even by reading normalize() TfEnv() and ProxyEnv()'s source code. It seems like I should be able to have access to the underlying wrapped class, so I added a candy() method and call it, but it would tell me that "NormalizedEnv" object has no attribute candy.

This is weird because the first line of NormalizedEnv is:

class NormalizedEnv(ProxyEnv, Serializable):
    def __init__(
            self,
            env):
        ProxyEnv.__init__(self, env)

and for ProxyEnv it's:

class ProxyEnv(Env):
    def __init__(self, wrapped_env):
        self._wrapped_env = wrapped_env

    @property
    def wrapped_env(self):
        return self._wrapped_env

Why is this happening?

Request: save/resume training

how difficult would be to implement:

save training progress
load training progress and continue training
load training progress and test (no training)