stevekapturowski / tensorflow-rl Goto Github PK

Implementations of deep RL papers and random experimentation

License: Apache License 2.0

Python 100.00%

a3c pgq trpo dqn openai-gym tensorflow reinforcement-learning

tensorflow-rl's Introduction

Tensorflow-RL

Tensorflow based implementations of A3C, PGQ, TRPO, DQN+CTS, and CEM originally based on the A3C implementation from https://github.com/traai/async-deep-rl. I extensively refactored most of the code and beyond the new algorithms added several additional options including the a3c-lstm architecture, a fully-connected architecture to allow training on non-image-based gym environments, and support for continuous action spaces.

The code also includes some experimental ideas I'm toying with and I'm planning on adding the following implementations in the near future:

*currently in progress

Notes

You can find a number of my evaluations for the A3C, TRPO, and DQN+CTS algorithms at https://gym.openai.com/users/steveKapturowski. As I'm working on lots of refactoring at the moment it's possible I could break things. Please open an issue if you discover any bugs.
I'm in the process of swapping out most of the multiprocessing code in favour of distributed tensorflow which should simplify a lot of the training code and allow to distribute actor-learner processes across multiple machines.
There's also an implementation of the A3C+ model from Unifying Count-Based Exploration and Intrinsic Motivation but I've been focusing on improvements to the DQN variant so this hasn't gotten much love

Running the code

First you'll need to install the cython extensions needed for the hog updates and CTS density model:

./setup.py install build_ext --inplace

To train an a3c agent on Pong run:

python main.py Pong-v0 --alg_type a3c -n 8

To evaluate a trained agent simply add the --test flag:

python main.py Pong-v0 --alg_type a3c -n 1 --test --restore_checkpoint

DQN+CTS after 80M agent steps using 16 actor-learner threads

A3C run on Pong-v0 with default parameters and frameskip sampled uniformly over 3-4

Requirements

python 2.7
tensorflow 1.2
scikit-image
Cython
pyaml
gym

tensorflow-rl's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger scientist1642 sunjieee benjamesbabala likeucode net-mist mightychaos cauldnz hal2001 collector-m carlwm noobfang kastnerkyle jimmcmahon amano-ginji sangjin-park williamd4112 wanjun0511 tranlm royf mydeeplearning meelement johndpope syx528911137 nd2506 siddharthvaria shubhampachori12110095 layne-wang maveriq mengwoods daominglyu lihaor jialianlee robin970822 kelvict tohigher fdsmlhn whiplash01 ivantha arunkumarramanan mrleedom silverweihit zhongyunuestc elreyfahad lezhang-thu xiaohuojianchendiwen

tensorflow-rl's Issues

segmentation fault with fast_cts.pyx

Hi Steve, I feel like asking you another newbie question.

I am having a hard time running fast_cts module. I think I successfully compiled the cython module, but when using the module (fast_cts), I get a Segmentation Fault error. I tracked down to see which line causes it in fast_cts.pyx:

def update(self, obs)
    # the line below
    obs = resize(obs, (self.height, self.width), preserve_range=True)
    obs = np.floor((obs*self.num_bins)).astype(np.int32)
    
    log_prob, log_recoding_prob = self._update(obs)
    return self.exploration_bonus(log_prob, log_recoding_prob)

It's the line marked with "# the line below" and the error occurs precisely when the resized output gets assigned to obs variable. This is my first time dealing with Cython so I am not so sure whether the error has to do with:

a) the Cython compilation (Ubuntu 16.04 LTS)
or
b) the Cython code itself (since it's segmentation fault, it must have to do with memory)

Let me know if you need more specific details to share any advice.

Plus, could I ask what percentage of speedup you achieved with fast_cts, compared to the original cts? With the original cts, I get 64 timesteps / second...

Thanks.

clang error when running setup.py

➜  tensorflow-rl git:(master) ✗ python3 setup.py install build_ext --inplace
running install
running build
running build_ext
building 'utils.fast_cts' extension
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/include/python3.5m -c utils/fast_cts.c -o build/temp.macosx-10.11-x86_64-3.5/utils/fast_cts.o
utils/fast_cts.c:444:10: fatal error: 'numpy/arrayobject.h' file not found
#include "numpy/arrayobject.h"
         ^
1 error generated.
error: command 'clang' failed with exit status 1

PseudoCountQLearner

In CTS-DQN, why we update the CTS model by using the next frame but not the same frame as used by action selection?
refer to https://github.com/steveKapturowski/tensorflow-rl/blob/master/algorithms/intrinsic_motivation_actor_learner.py#L417

Code hardwired to Atari and emulator environments

In many places in the code, there's a hardwired expectation that the environment will either be from ALE, or else it will be from some other type of video game. This limits the usefulness of the RL algorithms, even though there's nothing in their definition which stops them being useful in other contexts. In particular, there's no reason why an environment has to be something that can be rendered to a screen, e.g. the new GuessingGame-v0 environment provided with OpenAI Gym.

about fast_cts.pyx

In fast_cts.pyx, there is the following code:

        for i in range(self.height):
            for j in range(self.width):
                context[0] = obs[i, j-1] if j > 0 else 0
                context[1] = obs[i-1, j] if i > 0 else 0
                context[2] = obs[i-1, j-1] if i > 0 and j > 0 else 0
                context[3] = obs[i-1, j+1] if i > 0 and j < self.width-1 else 0

which confuses me, as it's inconsistent with the code from cts_density_model.py, where the same logic appears as:

		for i in range(self.factors.shape[0]):
			for j in range(self.factors.shape[1]):
				context[3] = obs[i, j-1] if j > 0 else 0
				context[2] = obs[i-1, j] if i > 0 else 0
				context[1] = obs[i-1, j-1] if i > 0 and j > 0 else 0
				context[0] = obs[i-1, j+1] if i > 0 and j < self.factors.shape[1]-1 else 0

Please note the order has been different: context[3 ... 0] and context[0 ... 3].
Thanks!

Learning rate calculation of adam wrong?

tensorflow-rl/algorithms/actor_learner.py

Line 319 in bcc9b2a

    
           opt_st.lr.value =  1.0 * opt_st.lr.value * (1 - self.b2**T)**0.5 / (1 - self.b1**T)

Tensorboard summaries not appearing

No scalar data was found.

How did you guys get the code to work with tensorboard? the summary part seems to be not working when training.

.......................................................................
python main.py Pong-v0 --alg_type a3c -n 8

Can't use MountainCarContinuous-v0 with trpo-continuous

Hi, thanks for your quick response to the previous issue I submitted. I've been trying out training with the MountainCarContinuous-v0 environment, and have been able to run it with all of the continuous algorithms other than trpo-continuous, which gives me the following error.

[2017-05-24 17:04:45] INFO [MainThread:222] Error reported to Coordinator: <type 'exceptions.ValueError'>, all the input array dimensions except for the concatenation axis must match exactly
[2017-05-24 17:04:45,735] Error reported to Coordinator: <type 'exceptions.ValueError'>, all the input array dimensions except for the concatenation axis must match exactly
Process TRPOLearner-1:
Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/Users/abiolalapite/Documents/Code/ThirdParty/tensorflow-rl/algorithms/actor_learner.py", line 254, in run
    self.train()
  File "/Users/abiolalapite/Documents/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 358, in train
    self._run_master()
  File "/Users/abiolalapite/Documents/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 337, in _run_master
    values = self.predict_values(worker_data)
  File "/Users/abiolalapite/Documents/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 229, in predict_values
    'timestep': np.array(data['timestep'])})
  File "/Users/abiolalapite/Documents/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 221, in preprocess_value_state
    return np.hstack([data['state'], data['timestep'].reshape(-1, 1, 1)])
  File "/Users/abiolalapite/intellij-tf/lib/python2.7/site-packages/numpy/core/shape_base.py", line 288, in hstack
    return _nx.concatenate(arrs, 1)
ValueError: all the input array dimensions except for the concatenation axis must match exactly```

no 'target_network' for A3CLearner when testing

When run the test funcion of A3C algorithm (using --test option), the error of "AttributeError: 'A3CLearner' object has no attribute 'target_network'" occurs. I have checked the code and found that the 'target_network' is only available for value_based_actor_learner.
So, is this a bug or I make some mistake?

Explanation on DQN needed

Hello! Can you please clarify on what you meant in the README by "DQN+CTS after 80M agent steps using 16 actor-learner threads"? DQN isn't a distributed algorithm, it uses a single thread. Did you mean to write A3C instead of DQN? Thank you very much!

reproducing your stellar result on Montzuma's Revenge

Hi Steve, I am trying to reproduce the ~3600 score you achieved on Montezuma's Revenge with your dqn-cts model (as per the gif image on README).

With 30M steps counting, the model does not seem to learn. It very occasionally gets the key (+100 points) and that's all. I ran your code as it is and did not modify a single line.

Could I ask if you can reproduce 3600 "on average" with your dqn-cts?
Also would you say I should try some other hyperparameter settings other than the ones you set as default?

I look forward to your advice.

Best wishes,

Cannot install(swig.exe error)

Pip failed. SWIG error

Collecting tensorflow-rl
  Using cached tensorflow_rl-0.2.2-py3-none-any.whl
Requirement already satisfied: gym>=0.10.5 in e:\python310\lib\site-packages (from tensorflow-rl) (0.26.2)
Collecting tensorboardX>=1.4
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/60/9f/d532d37f10ac7af136d4c2ba71e1fe7af0f3cc0cc076dfc05826171e9737/tensorboardX-2.6-py2.py3-none-any.whl (114 kB)
Collecting Box2D>=2.3.2
  Using cached https://pypi.tuna.tsinghua.edu.cn/packages/cc/7b/ddb96fea1fa5b24f8929714ef483f64c33e9649e7aae066e5f5023ea426a/Box2D-2.3.2.tar.gz (427 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: numpy>=1.13 in e:\python310\lib\site-packages (from tensorflow-rl) (1.24.2)
Requirement already satisfied: cloudpickle>=1.2.0 in e:\python310\lib\site-packages (from gym>=0.10.5->tensorflow-rl) (2.2.1)
Requirement already satisfied: gym-notices>=0.0.4 in e:\python310\lib\site-packages (from gym>=0.10.5->tensorflow-rl) (0.0.8)
Requirement already satisfied: protobuf<4,>=3.8.0 in e:\python310\lib\site-packages (from tensorboardX>=1.4->tensorflow-rl) (3.19.6)
Requirement already satisfied: packaging in e:\python310\lib\site-packages (from tensorboardX>=1.4->tensorflow-rl) (23.0)
Building wheels for collected packages: Box2D
  Building wheel for Box2D (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [17 lines of output]
      Using setuptools (version 67.6.0).
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\Box2D
      copying library\Box2D\Box2D.py -> build\lib.win-amd64-cpython-310\Box2D
      copying library\Box2D\__init__.py -> build\lib.win-amd64-cpython-310\Box2D
      creating build\lib.win-amd64-cpython-310\Box2D\b2
      copying library\Box2D\b2\__init__.py -> build\lib.win-amd64-cpython-310\Box2D\b2
      running build_ext
      building 'Box2D._Box2D' extension
      swigging Box2D\Box2D.i to Box2D\Box2D_wrap.cpp
      swig.exe -python -c++ -IBox2D -small -O -includeall -ignoremissing -w201 -globals b2Globals -outdir library\Box2D -keyword -w511 -D_SWIG_KWARGS -o Box2D\Box2D_wrap.cpp Box2D\Box2D.i
      Box2D\Box2D.i(44) : Error: Unknown directive '%exception'.
      error: command 'C:\\path\\swig.exe' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for Box2D
  Running setup.py clean for Box2D
Failed to build Box2D
Installing collected packages: Box2D, tensorboardX, tensorflow-rl
  Running setup.py install for Box2D ... error
  error: subprocess-exited-with-error

  × Running setup.py install for Box2D did not run successfully.
  │ exit code: 1
  ╰─> [19 lines of output]
      Using setuptools (version 67.6.0).
      running install
      E:\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
        warnings.warn(
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-cpython-310
      creating build\lib.win-amd64-cpython-310\Box2D
      copying library\Box2D\Box2D.py -> build\lib.win-amd64-cpython-310\Box2D
      copying library\Box2D\__init__.py -> build\lib.win-amd64-cpython-310\Box2D
      creating build\lib.win-amd64-cpython-310\Box2D\b2
      copying library\Box2D\b2\__init__.py -> build\lib.win-amd64-cpython-310\Box2D\b2
      running build_ext
      building 'Box2D._Box2D' extension
      swigging Box2D\Box2D.i to Box2D\Box2D_wrap.cpp
      swig.exe -python -c++ -IBox2D -small -O -includeall -ignoremissing -w201 -globals b2Globals -outdir library\Box2D -keyword -w511 -D_SWIG_KWARGS -o Box2D\Box2D_wrap.cpp Box2D\Box2D.i
      Box2D\Box2D.i(44) : Error: Unknown directive '%exception'.
      error: command 'C:\\path\\swig.exe' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> Box2D

note: This is an issue with the package mentioned above, not pip.
hint: See above for output from the failure.

a question on the implementation of exploration bonus

Hi, I have a hard time understanding the line 64 of intrinsic_motivation.py where the pseudocount is defined:

pseudocount = (1 - recoding_prob) / np.maximum(prob_ratio - 1, 1e-10)

According to the paper, shouldn't it be:

pseudocount = 1 / np.maximum(prob_ratio - 1, 1e-10)

pseudocount = prob * (1 - recording_prob) / (recoding_prob - prob)

Thank you so much for writing up the reference implementations of latest RL papers. They are purely awesome!

Wrong check for 'reward_threshold' property in cem_actor_learner.py

In the train method of the CEMLearner, there's the following check on line 86-89:

if elite_mean_reward > self.emulator.env.spec.reward_threshold:
    consecutive_successes += 1
else:
    consecutive_successes = 0

Unfortunately, the reward_threshold often evaluates to None (e.g. with Pendulum-v0) and consequently the inequality check succeeds, leading to premature halting of the CEM training.

Training slowing down dramatically

Did anyone face the issue of the training process slowing down? For example, training one DQN-CTS worker on Montezuma's Revenge runs at about 220 iter/sec after 100.000 steps and 35 iter/sec after 400.000. Any thoughts? Thank you.

About actor_learner.py

Look at the following code:

        with self.monitored_environment(), session_context as self.session:
            self.synchronize_workers()

            if self.is_train:
                self.train()
            else:
                self.test()

After trying several times, I felt the "with ... as" will exit even if self.train() is still running ...
self.train() is related to PseudoCountQLearner's train() function.
I tried to catch tf.errors.OutOfRangeError, which tensorflow will not re-raise.
But it seems it is not the answer. I have a feeling that tensorflow exits "with ... as" as no new training data are in its queue anymore. This might be due to the fact that PseudoCountQLearner have to compute MC mixed return. So PseudoCountQLearner waits for the end of the episode. But tensorflow cannot wait until that end.

All in all, I found no reason why the "with ... as" happened to exit earlier that self.train().
Thanks!

python main.py CartPole-v0 --alg_type trpo --td_lambda 1.0 --cg_damping .05 --episodes_per_batch 25 -n 2 -v 0 --arch FC --trpo_max_rollout 1000 --max_kl .05 --history_length 1 --frame_skip 1 --activation tanh --num_epochs 40

This seems to work, with training proceeding as expected and concluding successfully. However, when I try to evaluate the trained model by running

python main.py CartPole-v0 --alg_type trpo -n 1 --test --restore_checkpoint

I get the following error.

[2017-05-25 16:16:16,587] Error reported to Coordinator: <type 'exceptions.ValueError'>, Cannot feed value of shape (1, 4, 4) for Tensor u'policy_network_0/input:0', which has shape '(?, 84, 84, 4)'
Process TRPOLearner-1:
Traceback (most recent call last):
  File "/home/abiolalapite/.pyenv/versions/2.7.13/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/actor_learner.py", line 256, in run
    self.test()
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/actor_learner.py", line 181, in test
    a = self.choose_next_action(s)[0]
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/algorithms/trpo_actor_learner.py", line 148, in choose_next_action
    return self.policy_network.get_action(self.session, state)
  File "/home/abiolalapite/Code/ThirdParty/tensorflow-rl/networks/policy_v_network.py", line 78, in get_action
    self.logits], feed_dict=feed_dict)
  File "/home/abiolalapite/.pyenv/versions/py2713/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 778, in run
    run_metadata_ptr)
  File "/home/abiolalapite/.pyenv/versions/py2713/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 961, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 4, 4) for Tensor u'policy_network_0/input:0', which has shape '(?, 84, 84, 4)'