Giter Site home page Giter Site logo

shigeppp / a2c-sil-tf2 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from alirezakazemipour/a2c-sil-tf2

0.0 0.0 0.0 31.35 MB

TensorFlow2 implementation of Self-Imitation Learning (SIL) with Synchronous Advantage Actor-Critic (A2C).

Home Page: https://arxiv.org/abs/1806.05635

License: MIT License

Python 100.00%

a2c-sil-tf2's Introduction

PRs Welcome

A2C-SIL-TF2

This repository is a TensorFlow2 implementation of Self-Imitation Learning (SIL) with Synchronous Advantage Actor-Critic (A2C) as the backbone of action-selection policy.

In short, the SIL builds on the intuition that good past experiences should be exploited more and consequently driving to deep exploration. Thus, the SIL is responsible to improve the exploitation of good past decisions and is an auxiliary module that can be added to any action-selection policy.

The paper "Self-Imitation Learning" by Oh et al, combines the SIL with the synchronous version of the Advantage Actor-Critic method and showed that the SIL is applicable to any other Actor-Critic methods like Proximal Policy Optimization (PPO).

This repository follows the same procedure as the paper and produced results are obtained by combining the SIL with the A2C.

Demo

Results

X-axis corresponds episode numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

The rest of the training plots are at the end of the current Readme file.

Important Notes About This Code!

While the current code tries to put the ideas of the SIL into practice, there are some differences between it and the original implementation:

  • The SIL paper claims that it stores past experiences in a Replay Buffer and uses only useful and good memories without any constraints or heuristics about domain-specific configurations.
    However, in the original implementation, transitions of an episode are added to the replay buffer if their corresponding episode contains at least a single transition with positive reward. To be specific, this part of the original code:
def update_buffer(self, trajectory):
    positive_reward = False
    for (ob, a, r) in trajectory:
        if r > 0:
            positive_reward = True
            break
    if positive_reward:
        self.add_episode(trajectory)

This heuristic is only valid for the Atari domain with its current DeepMind proposed preprocessing and decreases the generality of the SIL.

A simple example is enough to underline that the above code is only an aid for faster training of Atari games with a specific preprocessing on rewards and decreases the generality: Suppose we hand engineer the rewards of an Atari game like Pong so that we subtract a -2 from every timestep reward. This intervention moves rewards' range from [-1, 1] to [-3, -1] without losing the expressiveness of expected optimal behavior that should be encoded within the rewards. But, the above heuristic fails on this new environment and should be tuned again (= loss of generality)!

As a result, the current code avoids applying such domain knowledge about the positivity of rewards.

  • To speed up the training, the current code benefits from using and LSTMCell and the lighter network architecture introduced in the A3C paper instead of the larger DQN network.
  • The current code uses Adam as its optimizer and no major violation was seen during training time, thus the clippings of the Advantage function, Log Probabilities, and Critic Loss of the original code (shown below) that uses RMSprop, were not required:
clipped_nlogp = tf.stop_gradient(tf.minimum(nlogp, self.max_nlogp) - nlogp) + nlogp
            
adv = tf.stop_gradient(tf.clip_by_value(self.R - tf.squeeze(model_vf), 0.0, clip))

delta = tf.clip_by_value(v_estimate - v_target, -self.clip, 0) * mask

Not required.

  • The Learning Rate is different from the paper.
  • One caveat of the current code is that it can not be executed on Colab since there is an unknown issue that the total amount of available RAM is consumed despite lowering the replay buffer size as much as possible, thus the whole training was done on paperspace.com without any problem about the RAM or etc.

Table of Hyperparameters

Hyperparameters used for Pong and Freeway environments.

Parameter Value
lr 2.5e-4
alpha (Exponent for PER) 0.6
Beta (Bias Correction for PER) 0.4
entropy coefficient for A2C 0.01
critic coefficient for A2C 0.5
max grad norm 0.5
num of sil updates 4
sil batch size 512
crtitc coefficient for SIL 0.01
entropy coefficient for SIL 0
num parallel workers 8
rollout length 80 ÷ num parallel workers
memory size 1e+5

Dependencies

  • comet_ml == 3.15.3
  • gym == 0.17.3
  • numpy == 1.19.2
  • opencv_contrib_python == 4.4.0.44
  • psutil == 5.5.1
  • tensorflow == 2.6.0
  • tensorflow_probability == 0.13.0
  • tqdm == 4.50.0

Usage

How To Run

usage: main.py [-h] [--env_name ENV_NAME]
               [--total_iterations TOTAL_ITERATIONS] [--mem_size MEM_SIZE]
               [--interval INTERVAL] [--do_test] [--render]
               [--train_from_scratch]

Variable parameters based on the configuration of the machine or user's choice

optional arguments:
  -h, --help            show this help message and exit
  --env_name ENV_NAME   Name of the environment.
  --total_iterations TOTAL_ITERATIONS The total number of iterations.
  --mem_size MEM_SIZE   The SIL's memory size.
  --interval INTERVAL   The interval specifies how often different parameters should be saved and printed, counted by iterations.
  --do_test             The flag determines whether to train the agent or play with it.
  --render              The flag determines whether to render each agent or not.
  --train_from_scratch  The flag determines whether to train from scratch or continue previous tries.
  • In order to train the agent with default arguments, execute the following command:
python3 main.py --train_from_scratch
  • If you want to keep training your previous run, execute the following (remove --train_from_scratch flag from the previous command):
python3 main.py

Pre-Trained Weights

  • There are pre-trained weights of the agents that were shown in the Results section playing, if you want to test them by yourself, please do the following:
  1. Choose your desired environments' weights from Results/Weights/env_name/*
  2. Create a folder named Models in the root directory of the project and make sure it is empty.
  3. Create another folder with an arbitrary name inside Models folder. For example:
mkdir Models/ Models/temp_folder
  1. Put your weights.h5 and stats.json files in your temp_folder.
  2. Run the above commands and use --do_test flag:
python3 main.py --do_test

Tested Environments

  • PongNoFrameskip-v4
  • FreewayNoFrameskip-v4
  • MontezumaRevengeNoFrameskip-v4

Structure

.
├── Brain
│   ├── brain.py
│   ├── experience_replay.py
│   ├── __init__.py
│   ├── model.py
│   └── segment_tree.py
├── Common
│   ├── config.py
│   ├── __init__.py
│   ├── logger.py
│   ├── play.py
│   ├── runner.py
│   └── utils.py
├── LICENSE
├── main.py
├── README.md
├── requirements.txt
└── Results
    ├── Gifs
    │   └── Freeway
    │       └── Freeway.gif
    ├── Plots
    │   └── Freeway
    │       ├── Max_Episode_Reward.svg
    │       ├── Running_Entropy.svg
    │       ├── Running_Explained_Variance.svg
    │       ├── Running_Grad_Norm.svg
    │       ├── Running_last_10_Reward.svg
    │       ├── Running_PG_Loss.svg
    │       ├── Running_Total_Episode_Reward.svg
    │       └── Running_Value_Loss.svg
    └── Weights
        └── Freeway
            ├── stats.json
            └── weights.h5
  1. Agent package includes the neural network structure, the core of agent decision-making paradigm, and the agent's experience replay.
  2. Common includes minor codes that are common for most RL codes and do auxiliary tasks like logging, wrapping Atari environments and... .
  3. Results is the directory that Gifs, plot images of the current Readme file, and pre-trained weights have been stored at.

References

  1. Self-Imitation Learning, Oh, et al., 2018
  2. Asynchronous Methods for Deep Reinforcement Learning, Mnih et al., 2016

Acknowledgement

  1. self-imitation-learning by @junhyukoh
  2. self-imitation-learning-pytorch by @TianhongDai
  3. a2c by @OpenAI

Results (Cont'd)

X-axis corresponds episode numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds episode numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds iteration numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds iteration numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds iteration numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds iteration numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

X-axis corresponds iteration numbers Blue curve has 0.1 for the value of Bias Correction for Prioritized Experience Replay

a2c-sil-tf2's People

Contributors

alirezakazemipour avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.