Giter Site home page Giter Site logo

procgen's Introduction

Competition Starter Code: https://github.com/AIcrowd/neurips2020-procgen-starter-kit

Competition Rules: https://www.aicrowd.com/challenges/neurips-2020-procgen-competition

Overview

I focused on making large changes to the learning algorithm to improve generalization and sample efficiency on the ProcGen benchmark, instead of tweaking the CNN architecture or tuning hyperparameters. I thought this was the fastest way for me to get better at reinforcement learning, even though it meant ignoring simple and small optimizations that could've increased my score.

For reference, here's how standard PPO performs on the BigFish environment. For this PPO agent, the only changes I made from the starter code were switching Tensorflow to PyTorch and changing the epsilon hyperparameter in the Adam Optimizer (RLLib doesn't use the same value as OpenAI Baselines, and it performed better when I changed it to the OpenAI Baselines value).

ppo

Approach 1: Automatic Data Augmentation

Code is in the ucbfinal branch.

Key changes implemented in: algorithms/ucb_drac_agent/ucb_ppo_policy.py

My first idea was to add data augmentation on top of the RLLib PPO implementation, since the competition restricts the agent to 8M frames of training on each game. Looking at the results from this paper, Automatic Data Augmentation for Generalization in Deep Reinforcement Learning, it seems that different ProcGen games work best with different data augmentations (although crop usually works best), so I followed the authors' approach and implemented the Upper Confidence Bounds algorithm to automatically choose 1 of 8 data augmentations (crop, grayscale, random convolution, color jitter, cutout color, cutout, rotate, and flip) during training. I also changed the loss function according to the authors' approach. The idea is to add two regularization terms (one for the policy function, and the other for the value function), so that the policy and value functions produce similar results with and without the data augmentation.

My implementation initially performed slightly better than PPO on BigFish.

ucb

It performed even better after reducing the minibatch size from 1024 to 256, and reducing the number of SGD iterations from 3 to 2. I think reducing the minibatch helped because it gave the UCB algorithm more iterations to meta-learn the optimal data augmentation. However, on the official test servers, the smaller minibatches makes it hit the 2 hour training time quota after about 6M frames, even after reducing the number of SGD iterations.

ucb-smallermb

Approach 2: Phasic Policy Gradient

Code is in the default ppg branch.

Key changes implemented in:
algorithms/ppg_torch/custom_postprocessing.py
algorithms/ppg_torch/custom_sgd.py
algorithms/ppg_torch/custom_torch_policy.py
algorithms/ppg_torch/ppg_torch_policy.py
models/impala_cnn_torch.py

After reading OpenAI's new Phasic Policy Gradient paper, I wanted to implement it myself to understand it better and see if I could make further improvements to the sample efficiency. Their agents were trained on 100M timesteps for each game and looking at their results, most games only had minor improvements in the first 8M frames. I didn't use data augmentation because I thought the replay buffer in PPG might be a good substitute.

The main problem I encountered were the sudden drops in the mean reward during training, so I think there's a bug in my auxiliary phase implementation, since this is the biggest difference from PPO and OpenAI's paper shows stable training. It might also be because I implemented the detached variant of PPG for simplicity, instead of the dual variant.

ppg-better

This was as far as I got before the competition deadline. The next step would be to continue debugging the auxiliary phase. After that, I was going to try combining data augmentation with PPG. A third approach I would've liked to try is using an optical flow network on the last two frames to produce an additional input to the CNN at every frame, in order to teach the agent about the velocities of game entities. Intuitively, it can't learn velocities right now because each observation is a single frame (this might be enough in specific situations, like if the entity always moves at a constant speed and orientation of the entity precisely indicates direction). Velocity is important to the gameplay in most of the ProcGen games, and an optical flow network is more suited to velocity detection than basic framestacking.

Setup

The Dockerfile is used for my official competition submission but isn't necessary.

I tested my code on an AWS p3.2xlarge instance, using this AMI:

Deep Learning AMI (Ubuntu 18.04) Version 36.0 - ami-063585f0e06d22308

First, activate this default Conda environment on the image (I actually use PyTorch, but that gets installed with pip):

source activate tensorflow2_latest_p37

Then install the Python packages:

pip install -r requirements.txt --no-cache-dir -f https://download.pytorch.org/whl/torch_stable.html

Training

Change these environment variables in run.sh depending on your available resources:

  export RAY_MEMORY_LIMIT=60129542144
  export RAY_CPUS=8
  export RAY_STORE_MEMORY=30000000000

To change the ProcGen game environment, change this line in experiments/ppg.yaml:

env_name: bigfish

To start training:

./run.sh --train

Your model and checkpoints will be saved in ~/ray_results/procgen-ppg

Rollout

Uncomment and modify these environment variables in run.sh:

  # export EPISODES=5
  # replace with your own checkpoint path
  # export CHECKPOINT=~/ray_results/procgen-ppg/PPG_procgen_env_wrapper_0_2020-11-10_18-16-19qlw86nzo/checkpoint_447/checkpoint-447

To rollout the agent:

./run.sh --rollout

procgen's People

Contributors

christopherhesse avatar gokulbnr avatar jyotishp avatar laxatives avatar paksha avatar skbly7 avatar spmohanty avatar wkwan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.