Giter Site home page Giter Site logo

alfredvc / paac Goto Github PK

View Code? Open in Web Editor NEW
205.0 21.0 60.0 46.36 MB

Open source implementation of the PAAC algorithm presented in Efficient Parallel Methods for Deep Reinforcement Learning

Home Page: https://arxiv.org/abs/1705.04862

License: Other

Python 100.00%
reinforcement learning reinforcement-learning paac machine-learning atari tensorflow open-source

paac's Introduction

Efficient Parallel Methods for Deep Reinforcement Learning

This repository contains an open source implementation of the PAAC algorithm presented in Efficient Parallel Methods for Deep Reinforcement Learning.

PAAC is a conceptually simple advantage actor-critic algorithm designed to run efficiently on a GPU, offering A3C like performance in under 12 hours of training.

breakout gif pong gif qbert gif space invaders gif

Runing via docker (recommended)

  1. Follow the instructions to install nvidia-docker
  2. Clone this repository
  3. Run the container with nvidia-docker run -it -v <absolute-path>/paac:/root/paac -p 6006:6006 alfredvc/tf1-ale.

A CPU version of the docker container is also provided and can be run with docker run -it -v <absolute-path>/paac:/root/paac -p 6006:6006 alfredvc/tf1-ale:cpu. When running on the CPU pass the device flag -d '/cpu:0' to the training script.

Runing locally

Requirements

Training the agent

To train an agent to play, for example, pong run

  • python3 train.py -g pong -df logs/

For pong, the agent will begin to learn after about 5 million frames, and will learn an optimal policy after about 15 million frames.

Training can be stopped, for example by using Ctrl+c, and then resumed again by running python3 train.py -g pong -df logs/.

On a setup with an Intel i7-4790k CPU and an Nvidia GTX 980 Ti GPU with default settings, you can expect around 3000 timesteps (global steps) per second. Training for 80 million timesteps requires under 8 hours.

Qbert

qbert learning graph

Visualizing training

  1. Open a new terminal
  2. Attach to the running docker container with docker exec -it CONTAINER_NAME bash
  3. Run tensorboard --logdir=<absolute-path>/paac/logs/tf.
  4. In your browser navigate to localhost:6006/

If running locally, skip step 2.

Testing the agent

To test the performance of a trained agent run python3 test.py -f logs/ -tc 5 Output:

Performed 5 tests for seaquest.
Mean: 1704.00
Min: 1680.00
Max: 1720.00
Std: 14.97

Generating gifs

Gifs can be generated from stored network weights, for example a gif of the agent playing breakout can be generated with

python3 test.py -f pretrained/breakout/ -gn breakout

This may take a few minutes.

Pretrained models

Pretrained models for some games can be found here. These models can be used as starting points for training on the same game, other games, or to generate gifs.

Adapting the code

This codebase was designed to be easily modified to new environments and new neural network architectures.

Adapting to a new environment

The codebase currently contains a single environment, namely atari_emulator.py. To train on a new environment, simply create a new class that inherits from BaseEnvironment and modify environment_creator.py to create an instance of your new environment.

Adapting to new neural network architectures

The codebase contains currently two neural network architectures, the architecture used in Playing Atari with Deep Reinforcement Learning, and the architecture from Human-level control through deep reinforcement learning. Both adapted to an actor-critic algorithm. To create a new architecture follow the pattern demonstrated in NatureNetwork and NIPSNetwork. Then create a new class that inherits from both the PolicyVNetwork andYourNetwork. For example: NewArchitecturePolicyVNetwork(PolicyVNetwork, YourNetwork). Then use this class in train.py.

Citing PAAC

If you use PAAC in your research, we ask that you please cite our paper:

@ARTICLE{2017arXiv170504862C,
   author = {{Clemente}, A.~V. and {Castej{\'o}n}, H.~N. and {Chandra}, A.
	},
    title = "{Efficient Parallel Methods for Deep Reinforcement Learning}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1705.04862},
 primaryClass = "cs.LG",
 keywords = {Computer Science - Learning},
     year = 2017,
    month = may,
   adsurl = {http://adsabs.harvard.edu/abs/2017arXiv170504862C},
  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
}

The paper has been accepted as a poster to The Multi-disciplinary Conference on Reinforcement Learning and Decision Making (citation to come).

Disclaimer

The code in this repository is not the code used to generate the results from the paper, but should give similar results. Some changes have been made:

  • Gradient clipping default value changed from 40.0 to 3.0.
  • Entropy regularization constant default changed from 0.01 to 0.02.
  • Using OpenAI Gym results in an increase in training time of 33%. This is because converting the image from RGB to Grayscale in python is slow.

paac's People

Contributors

alfredvc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paac's Issues

Low Seaquest avg score compared to A3C

Looking at a handful A3C implementations and results on Seaquest, they appear to score around 50K:

PAAC however, reaches a plateau around 2K according to our tests (similar to your paper). Visual inspection of the policy shows that the submarine does not resurface. While a common difficulty of the game, A3C appears to be able to overcome it (maybe this could be due to a modification in OpenAI Gym since their Atari setup has some differences with ALE).

We've looked at various explorations (e-greedy, boltzmann, bayesian dropout), with no improvement at the moment.

Do you seen any particular reason PAAC would underperform in this case ? LSTM might help, but from the two OpenAI Gym pointers above, it seems it should not be critical for Seaquest.

add LSTM layer

Hello,

May I ask a naive question, did you try to implement LSTM on this architecture? Or you already did it and find it is not efficient (maybe time consuming?) as people think.

In any case thanks for not such harware-demanding idea/architecture.

Best,
Chih-Chieh

Adapting paac for CartPole

Hi,
Thanks for the great implementation, I am currently learning RL and I am trying to adapt paac for a simple use case of CartPole. I made modifications to paac code to include a new environment for CartPole and also modified network for NIPS network to a simple Linear Network. In essence, I am trying to reproduce the A3C implementation of CartPole from https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py. Running paac on CartPole never seems to be converging to higher rewards, the Maximum reward I get is around 30. I understand that every environment needs tuning of the hyperparameters but I don't know what else can I try to make it work for a Simple use case of CartPole. The reference implementation of A3C at https://github.com/jaara/AI-blog/blob/master/CartPole-A3C.py coverages to successful rewards after few thousand steps but the paac implementation never moves beyond 30. Can you recommend anything else I can do to make it work or am I missing any fundamental settings? The changes I have already tried are

  1. Change learning rate, lower learning rates seem to do better
  2. Change network model to multiple layers 128->64->16 (with relu) and other configuration 512->256->128->64->16
  3. Run it for longer duration ( more than 30 mins)
  4. Change entropy to higher value (this one actually causes NaN's in gradients)

The paac model is capable of solving much more complicated environments and I am suprised that it is struggling with the classic and simplest Cartpole problem. I expected paac to solve the CartPole problem much faster than CPU based A3C

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.