germain-hug / deep-rl-keras Goto Github PK

View Code? Open in Web Editor NEW

528.0 25.0 149.0 2.91 MB

Keras Implementation of popular Deep RL Algorithms (A3C, DDQN, DDPG, Dueling DDQN)

Python 100.00%

a3c reinforcement-learning keras gym openai policy-gradient ddpg keras-rl dqn ddqn

deep-rl-keras's People

Contributors

Stargazers

Watchers

Forkers

decoderkurt robbzhang w0lv3r1nix deeperunderstanding waterhorse1 wwx13 hominhtri1996 okwrtdsh zynk13 substage sushantjha8 expcwei scapeqin allensmile yyht seanhsieh ayeps ujwal2910 guolaoban lukucz yidingyu laojiang012 wbwatkinson voitrex hititan mburakg rl-conversation machinecf tatsubori ozerelkerem brightenzl hilariouss tobidet skdeng mauzeyj joepfortunato lakshadeep running-on-faith sjq19960802 thinhhoang95 bamboosong trinhvo tianqiyu pepsalehi joekkim kdawar1 cevans3098 biyue111 thangdn3 dannisfeng g277321 superjeary frabenny thomasgoatly cuevhv osahi001 thswind ploxoy huythong267 kashifme224 rehan-ai julianpalladino spraphul war3gu sanjinzhi bmpcc6k eddieburning brajard omsrisagar mzy2240 ynulihao luoxz-ai guiltyer k-yonhon sk398497 allmybaby majadoon daniloaleixo anay191 nhu2000 largezhu szprestonhuang xianglu96 erichuang2013 parasnaren ai-hub-deep-learning-fundamental ouwing233 davyjones585 atakey android88 achuwilson xxchenchen nikunj-gupta stephennfernandes strivetome jooit cizi2018 yaxuniu rulrulesforever myxyzy

deep-rl-keras's Issues

Adanet?

Hi All, It seems that more and more, Tensorflow-derived projects could benefit from each other. For example, using the tensorflow/AdaNet AutoML to find the best TF model, In addition, ability to speed up training on colab TPU's by using the built in function to convert tf.keras models to TPU optimized ones. Would it be possible to convert the project to tf.keas to use with AdaNet to rip the above benefits?

Save/Load/resume training?

Hello and thanks for sharing your code!
Can you please let me know if there is a way to save model state/weights in order to test later or resume training?

TF 2.0?

Hi, any plans to update to Tensorflow 2 (and preferably tf.keras while at it)?

Quick question about DDPG

Thanks for your very clear code. I was reading through it, but couldn't understand one key step regarding training the agent:

Here is the code:

 # Apply Bellman Equation on batch samples to train our DDQN
        q = self.agent.predict(s)
        next_q = self.agent.predict(new_s)
        q_targ = self.agent.target_predict(new_s)

        for i in range(s.shape[0]):
            old_q = q[i, a[i]]
            if d[i]:
                q[i, a[i]] = r[i]
            else:
                next_best_action = np.argmax(next_q[i,:])
                q[i, a[i]] = r[i] + self.gamma * q_targ[i, next_best_action]
            if(self.with_per):
                # Update PER Sum Tree
                self.buffer.update(idx[i], abs(old_q - q[i, a[i]]))
        # Train on batch
        self.agent.fit(s, q)
        # Decay epsilon
        self.epsilon *= self.epsilon_decay

From my understanding, the Q function maps (state, action) pairs to rewards. However in the code above you assume that the Agent networks to return Q values. However a quick inspection of the Agent model, reveals that it actually returns actions, which makes sense since that's what the agent has to learn. Rewards are be calculated within env.step(action). So then, in the belmann equation you add r[i] + self.gamma * q_targ[i, next_best_action].
Isn't q_targ[i, next_best_action] an action? so how can you add it to a reward?

I am sure I am not seeing some detail in the code that makes this all works. Would you mind clarifying it for me? Thanks.

A3C issues

Hi, I noticed that A3C is having 2 issues:

CTRL-C won't stop the script. Had to kill my CMD process.
At the end, while rendering the test results it spits out:

(36env) c:\Users\user\py\Deep-RL-Keras>main.py --type A3C --env CartPole-v0 --nb_episodes 10000 --n_threads 2 Using TensorFlow backend. Score: 40.0: : 5050 episodes [00:04, 1259.48 episodes/s] Traceback (most recent call last): File "C:\Users\user\py\Deep-RL-Keras\main.py", line 114, in <module> main() File "C:\Users\user\py\Deep-RL-Keras\main.py", line 107, in main a = algo.policy_action(old_state) File "C:\Users\user\py\Deep-RL-Keras\A3C\a3c.py", line 61, in policy_action return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0] File "C:\Users\user\py\Deep-RL-Keras\A3C\agent.py", line 22, in predict return self.model.predict(self.reshape(inp)) File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training.py", line 1149, in predict x, _, _ = self._standardize_user_data(x) File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training.py", line 751, in _standardize_user_data exception_prefix='input') File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training_utils.py", line 128, in standardize_input_data 'with shape ' + str(data_shape)) ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (4, 4) Exception ignored in: <bound method Viewer.__del__ of <gym.envs.classic_control.rendering.Viewer object at 0x0000022E46D94BA8>> Traceback (most recent call last): File "c:\users\user\py\gym\gym\envs\classic_control\rendering.py", line 143, in __del__ File "c:\users\user\py\gym\gym\envs\classic_control\rendering.py", line 62, in close File "C:\Users\user\py\36env\lib\site-packages\pyglet\window\win32\__init__.py", line 305, in close File "C:\Users\user\py\36env\lib\site-packages\pyglet\window\__init__.py", line 770, in close ImportError: sys.meta_path is None, Python is likely shutting down

DQN: batch_shape = (None,) + tuple(shape) TypeError: 'int' object is not iterable

Hi everyone,
I modified the DQN algorithm in this repository to a multi-agent DQN approach for a wireless network environment. Actually, I wrote this code inspired by a repository on GitHub. Although the original code works well, when I change the environment, the following error occurs.
Traceback (most recent call last): File "D:/main -DQN.py", line 452, in <module> main() File "D:/main -DQN.py", line 432, in main algo = DQN( args) # n_clusters is the action dimension in DQN File "D:/main -DQN.py", line 158, in __init__ self.agent = Agent( args, self.tau) File "D:/main -DQN.py", line 246, in __init__ self.model = self.network() File "D:/main -DQN.py", line 254, in network inp = Input((self.state_dim)) File "C:\Users\AppData\Roaming\Python\Python37\site-packages\keras\engine\topology.py", line 1451, in Input batch_shape = (None,) + tuple(shape) TypeError: 'int' object is not iterable
The complete code is as follows:
`
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
import pandas as pd
import numpy as np
import sys
import os
import copy, json, argparse
from numpy import pi
from random import random, uniform, choices, randint, sample, randrange
import random
import math
from tqdm import tqdm
import keras.backend as K
from keras.optimizers import Adam
from keras.models import Model
from keras.layers import Dense, Flatten, Input
from collections import deque

class Environ:

def __init__(self, args):
    self.args=args
    self.state_dim= (self.args.A, )
    self.action_dim=args.C
    self.bs = complex((500 / 2), (500/ 2))
    self.S=(np.zeros(self.args.A)).reshape(-1)

def Location(self):
    rx = uniform(0, 500)
    ry = uniform(0, 500)
    Loc = complex(rx, ry)
    return Loc

def PathGain(self,Loc):
    d = abs(Loc- self.bs)
    d=d  **(-3)
    u = np.random.rand(1, 1)
    sigma = 1
    x = sigma * np.sqrt(-2 * np.log(u))
    h=  d* x
    return h

def reset(self):  # Reset the states
    s=np.zeros(self.args.A)
    return s.reshape(-1)

def RecievePower(self,UsersLoc):
    H=self.PathGain(UsersLoc)
    UsersRecievePower=self.args.P*H
    return UsersRecievePower

def TotalRate(self, actionRB_i,actionRB):
    interference = self.args.Noise
    Loc_i=self.Location()
    for j in range(self.args.A):
        if actionRB_i ==actionRB[j] :
            Loc_j = self.Location()
            RecievePower_j = self.RecievePower(Loc_j)
            interference = interference + RecievePower_j
        else:
            interference= interference
    RecievePower_i = self.RecievePower(Loc_i)
    SINR = interference / (interference-RecievePower_i)
    Rate =self.args.BW*( np.log2( SINR))
    return Rate

def computeQoS(self,actionRB,actionRB_i):
    TotalRate=self.TotalRate(actionRB,actionRB_i)
    if TotalRate >=self.args.Rmin:
        QoS=1.0
    else:
        QoS=0.0
    return QoS

def ComputeState(self,actionRB):
    QoS=np.zeros(self.args.A)
    for i in range(self.args.A):
        actionRB_i=actionRB[i]
        QoS[i] = self.computeQoS(actionRB,actionRB_i)
    S = np.zeros( self.args.A)
    for i in range(self.args.A):
        S[i]=QoS[i]
    self.S=S
    return self.S.reshape(-1)

def Reward(self,actionRB,actionRB_i):
    Rate = np.zeros(self.args.A)
    Satisfied_Users = 0
    for i in range(self.args.A):
        Rate[i] = self.TotalRate(actionRB, actionRB_i)
        Satisfied_Users = Satisfied_Users + self.computeQoS(actionRB)
    TotalRate = 0.0
    TotalPower = self.args.circuitPower
    for i in range(self.args.A):
        TotalRate = TotalRate + Rate[i]
        TotalPower = TotalPower + self.args.P
    if Satisfied_Users == self.args.A:
        reward = TotalRate / TotalPower
    else:
        reward = self.args.negative_cost
    return reward

def step(self,actionRB):
    next_s = self.ComputeState(actionRB)
    r = self.Reward(actionRB)
    done = False
    info = None
    return next_s, r, done, info

class Environment(object):

def __init__(self, gym_env, action_repeat):
    self.env = gym_env
    self.timespan = action_repeat
    self.gym_actions = 2  # range(gym_env.action_space.n)
    self.state_buffer = deque()

def get_action_size(self):
    return self.env.action_dim

def get_state_size(self):
    return self.env.state_dim

def reset(self):
    # Clear the state buffer
    self.state_buffer = deque()
    x_t = self.env.reset()
    s_t = np.stack([x_t for i in range(self.timespan)], axis=0)
    for i in range(self.timespan - 1):
        self.state_buffer.append(x_t)
    return s_t

def step(self, action):
    x_t1, r_t, terminal, info = self.env.step(action)
    previous_states = np.array(self.state_buffer)
    s_t1 = np.empty((self.timespan, *self.env.state_dim))
    s_t1[:self.timespan - 1, :] = previous_states
    s_t1[self.timespan - 1] = x_t1
    # Pop the oldest frame, add the current frame to the queue
    self.state_buffer.popleft()
    self.state_buffer.append(x_t1)
    return s_t1, r_t, terminal, info

def render(self):
    return self.env.render()

class DQN:
def init(self, args):
# Environment and DQN parameters
self.args=args
self.action_dim = self.args.C
self.state_dim = self.args.A
self.buffer_size = self.args.capacity
# Memory Buffer for Experience Replay
self.buffer = MemoryBuffer(self.buffer_size)
self.epsilon=self.args.eps
self.tau = 1.0
self.agent = Agent( args, self.tau)

def policy_action(self, s):
    if random() <= self.epsilon:
        return randrange(self.action_dim)
    else:
        return np.argmax(self.agent.predict(s)[0])

def train_agent(self):
    # Sample experience from memory buffer
    s, a, r, d, new_s, idx = self.buffer.sample_batch(self.batch_size)
    # Apply Bellman Equation on batch samples to train our DQN
    q  = self.agent.predict(s)
    next_q  = self.agent.predict(new_s)
    q_targ  = self.agent.target_predict(new_s)
    for i in range(s.shape[0]):
        if d[i]:
            q[i, a[i]] = r[i]
        else:
            next_best_action = np.argmax(next_q[i, :])
            q[i, a[i]] = r[i] + self.args.gamma * q_targ[i, next_best_action]
    # Train on batch
    self.agent.fit(s, q)
    # Decay epsilon
    self.epsilon *= self.args.eps_decay

def train(self, env, args, summary_writer):
    results = []
    tqdm_e = tqdm(range(self.args.nepisodes), desc='Score', leave=True, unit=" episodes")
    for e in tqdm_e:
        # Reset episode
        time, cumul_reward, done = 0, 0, False
        old_state = env.reset()

        while not done:
            # if args.render:
            #     env.render()
            # Actor picks an action (following the policy)
            a=[]
            for i in range(self.args.A):
                a[i]= self.policy_action(old_state)

            # Retrieve new state, reward, and whether the state is terminal
            new_state, r, done, _ = env.step(a)
            # Memorize for experience replay
            self.memorize(old_state, a, r, done, new_state)
            # Update current state
            old_state = new_state
            cumul_reward += r
            time += 1
            # Train DDQN and transfer weights to target network
            if(self.buffer.size() > args.batch_size):
                self.train_agent(self.args.batch_size)
                self.agent.transfer_weights()
       # Gather stats every episode for plotting
        if(args.gather_stats):
            mean, stdev = gather_stats(self, env)
            results.append([e, mean, stdev])

        # Export results for Tensorboard
        score = tfSummary('score', cumul_reward)
        summary_writer.add_summary(score, global_step=e)
        summary_writer.flush()

        # Display score
        tqdm_e.set_description("Score: " + str(cumul_reward))
        tqdm_e.refresh()

    return results

def memorize(self, state, action, reward, done, new_state):
    self.buffer.memorize(state, action, reward, done, new_state)

def save_weights(self, path):
    path += '_LR_{}'.format(self.args.learningrate)
    self.agent.save(path)

def load_weights(self, path):
    self.agent.load_weights(path)

class Agent:
def init(self, args, tau):
self.args=args
self.state_dim = self.args.A
self.action_dim = self.args.C
self.tau = tau
self.lr=self.args.learningrate
# Initialize Deep Q-Network
self.model = self.network()
self.model.compile(Adam(self.lr), 'mse')
# Build target Q-Network
self.target_model = self.network()
self.target_model.compile(Adam(self.lr), 'mse')
self.target_model.set_weights(self.model.get_weights())

def network(self):
    inp = Input((self.state_dim))

    if(len(self.state_dim) > 2):
        inp = Input((self.state_dim[1:]))
        x = conv_block(inp, 32, (2, 2), 8)
        x = conv_block(x, 64, (2, 2), 4)
        x = conv_block(x, 64, (2, 2), 3)
        x = Flatten()(x)
        x = Dense(256, activation='relu')(x)
    else:
        x = Flatten()(inp)
        x = Dense(64, activation='relu')(x)
        x = Dense(64, activation='relu')(x)

    x = Dense(self.action_dim, activation='linear')(x)
    return Model(inp, x)

def transfer_weights(self):
    W = self.model.get_weights()
    tgt_W = self.target_model.get_weights()
    for i in range(len(W)):
    #  updated based on Polyak averaging method
        tgt_W[i] = self.tau * W[i] + (1 - self.tau) * tgt_W[i]
    self.target_model.set_weights(tgt_W)

def fit(self, inp, targ):
    self.model.fit(self.reshape(inp), targ, epochs=1, verbose=0)

def predict(self, inp):
    return self.model.predict(self.reshape(inp))

def target_predict(self, inp):
    return self.target_model.predict(self.reshape(inp))

def reshape(self, x):
    if len(x.shape) < 4 and len(self.state_dim) > 2:
        return np.expand_dims(x, axis=-1)
    elif len(x.shape) < 3:
        return np.expand_dims(x, axis=-1)
    else:
        return x

def save(self, path):
    self.model.save_weights(path + '.h5')

def load_weights(self, path):
    self.model.load_weights(path)

class MemoryBuffer(object):
def init(self, buffer_size):
# Standard Buffer
self.buffer = deque()
self.count = 0
self.buffer_size = buffer_size

def memorize(self, state, action, reward, done, new_state):
    experience = (state, action, reward, done, new_state)
    # Check if buffer is already full
    if self.count < self.buffer_size:
        self.buffer.append(experience)
        self.count += 1
    else:
        self.buffer.popleft()
        self.buffer.append(experience)

def size(self):
    return self.count

def sample_batch(self, batch_size):
    batch = []
    if self.count < batch_size:
        idx = None
        batch = random.sample(self.buffer, self.count)
    else:
        idx = None
        batch = random.sample(self.buffer, batch_size)

    # Return a batch of experience
    s_batch = np.array([i[0] for i in batch])
    a_batch = np.array([i[1] for i in batch])
    r_batch = np.array([i[2] for i in batch])
    d_batch = np.array([i[3] for i in batch])
    new_s_batch = np.array([i[4] for i in batch])
    return s_batch, a_batch, r_batch, d_batch, new_s_batch, idx

def update(self, idx):
    self.buffer.update(idx)

def clear(self):
    self.buffer = deque()
    self.count = 0

def get_session():
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
return tf.Session(config=config)

def tfSummary(tag, val):
return tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=val)])

def gather_stats(agent, env):
score = []
for k in range(10):
old_state = env.reset()
cumul_r, done = 0, False
while not done:
a = agent.policy_action(old_state)
old_state, r, done, _ = env.step(a)
cumul_r += r
score.append(cumul_r)
return np.mean(np.array(score)), np.std(np.array(score))

def conv_block(inp, d=3, pool_size=(2, 2), k=3):
conv = conv_layer(d, k)(inp)
return MaxPooling2D(pool_size=pool_size)(conv)

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

def parse_args(args):
parser = argparse.ArgumentParser(description='Training parameters')
#
parser.add_argument('--out_dir', type=str, default='experiments', help="Name of the output directory")
parser.add_argument('--consecutive_frames', type=int, default=2, help="Number of consecutive frames (action repeat)")
parser.add_argument('--gather_stats', dest='gather_stats', action='store_true', help="Compute Average reward per episode (slower)")
parser.add_argument('--A', type=int, default='10', help="The number of agents")
parser.add_argument('--C', type=int, default='30', help="The number of Resources")
parser.add_argument('--Noise', type=float, default='0.00000000000001', help="The background noise")
parser.add_argument('--BW', type=int, default='180000', help="The bandwidth")
parser.add_argument('--Rmin', type=int, default='1000000', help="Agents' QoS")
parser.add_argument('--P', type=float, default='0.01', help="Agents' transmit power")
parser.add_argument('--circuitPower', type=float, default='0.05', help="The circuit Power")
parser.add_argument('--negative_cost', type=float, default='-1.0', help="The negative cost")
parser.add_argument('--capacity', type=int, default='500', help="Capacity of Replay Buffer")
parser.add_argument('--learningrate', type=float, default='0.01', help="The learning rate")
parser.add_argument('--eps', type=float, default='0.8', help="The epsilon")
parser.add_argument('--eps_decay', type=float, default='0.99', help="The epsilon decay")
parser.add_argument('--eps_increment', type=float, default='0.003', help="The epsilon increment")
parser.add_argument('--batch_size', type=int, default='8', help="The batch size")
parser.add_argument('--gamma', type=float, default='0.99', help="The discount factor")
parser.add_argument('--nepisodes', type=int, default='500', help="The number of episodes")
parser.add_argument('--nsteps', type=int, default='500', help="The number of steps")
parser.add_argument('--env', type=str, default='Environ', help="Wireless environment")
parser.add_argument('--gpu', type=str, default="", help='GPU ID')

args=parser.parse_args(args)

parser.set_defaults(render=False)
return args

def main(args=None):
# Parse arguments
if args is None:
args = sys.argv[1:]
args = parse_args(args)
# Check if a GPU ID was set
if args.gpu:
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
set_session(get_session())

summary_writer = tf.summary.FileWriter("/tensorboard_" + args.env)
# Initialize the wireless environment
users_env = Environ(args)
# print(users_env)

# Wrap the environment to use consecutive frames
env = Environment(users_env, args.consecutive_frames)
env.reset()

# Define parameters for the DDQN and DDPG algorithms
state_dim = env.get_state_size()
action_dim = users_env.action_dim
# The maximum and minimum values for precoding vectors
# act_range = 1
# act_min = 0

# Initialize the DQN algorithm for the clustering optimization
algo = DQN( args)  # n_clusters is the action dimension in DQN
# if args.step == "train":
    # Train
stats = algo.train(env, args, summary_writer)
# Export results to CSV
if(args.gather_stats):
    df = pd.DataFrame(np.array(stats))
    df.to_csv(args.out_dir + "/logs.csv", header=['Episode', 'Mean', 'Stddev'], float_format='%10.5f')
    # df.to_csv(args.type + "/logs.csv", header=['Episode', 'Mean', 'Stddev'], float_format='%10.5f')

    # Save weights and close environments
exp_dir = '{}/models_A_{}_C_{}_Rmin_{}/'.format(args.out_dir, args.A, args.C, args.Rmin)
# exp_dir = '{}/models/'.format(args.type)
if not os.path.exists(exp_dir):
    os.makedirs(exp_dir)
# Save DDQN
export_path = '{}_{}_NB_EP_{}_BS_{}'.format(exp_dir, "DQN", args.nepisodes, args.batch_size)
algo.save_weights(export_path)

if name == "main":
main()
`
Thanks in advance for your help.

Visualizing learning of DDPG with tensorboard

Has anyone come up with a way to visualize the training of a DDPG with tensorboard?
It's a bit tricky because does not make use of the model.fit, which can use Keras' Tensorboard callback.

Why we still need to define the action space

Hi I don't understand why we still need to define the action space as it is supposed to be infinite when using DDPG.

requirements

Hi . I want to learn more about RL and I think examples from this repo are great, but I have some trouble with requirements version. Can anyone provide a requirements.txt file ?
Thank you

DDQN.py function memorize: incorrect Q values ?

It seems like if I compare from [https://arxiv.org/pdf/1511.05952.pdf](PER paper):

Algorithm 1: line 11
TD error
delta(j) = Reward(j) + gamma(j) * Q_target(S_j, arg max_a Q(S_j, a)) - Q(S_j-1, A_j-1)

If I am not mistaken the j-1 subscript refers to current state in the implementation, i.e. state, action, reward, done all refer to j-1 . And new_state refers to j

Then line 125 in ddqn.py refers to arg max of itself not to the previous one:
q_val = self.agent.predict(state)
next_best_action = np.argmax(q_val)

should be

q_val = self.agent.predict(new_state)
next_best_action = np.argmax(q_val)

Can not run DDQN with PER

Hi,

Thanks for this great implementation.
I encountered an error when tried to run "python3 main.py --type DDQN --env CartPole-v1 --batch_size 64 --with_PER"

The error is:
/utils/memory_buffer.py", line 69, in sample_batch
batch.append((*data, idx))
TypeError: 'int' object is not iterable

Have anyone fixed this issue?

DDPG - LunarLanderContinuous

Great work on the implementation - very clean code and easy to follow

I have been running the LunarLanderContinuous Environment

python main.py --type DDPG --env LunarLanderContinuous-v2 --render

I have not been able to get it converge. I have been running for >4000 episodes, but I have not seen any improvements. score bounces around -320 to -480.... and clearly the lander is not making progress.

I am using a newer version of Keras (2.2.4) - I had errors initially, but was able to resolve them using the comments from @zynk13 (#2)
All parameters (lr, network structure) are the same as the original code

Anyone able to achieve good results with the lunar lander?

Thanks

A2C optimizer get_updaet

I am confused. Dose get_updates apply the gradient to the trainable variables directly? Or just return the gradient?

Can’t actor-critic method solve mountain car environment?

I tried your code on mountain car environment by A2C. It showed no progress (always -200) no matter how long it was trained. However, this problem can be easily solved by DQN or DDQN algorithm.

Actually I used my own program in mountain car and encountered the same problem. That is why I started to study your code. Can’t actor-critic method solve mountain car environment? If no, do you know the reason?

A3C issue when using environments with <= 2 dims.

`    def buildNetwork(self):
        """ Assemble shared layers
        """
        inp = Input((self.env_dim))
        # If we have an image, apply convolutional layers
        if(len(self.env_dim) > 2):
            x = Reshape((self.env_dim[1], self.env_dim[2], -1))(inp)
            x = conv_block(x, 32, (2, 2))
            x = conv_block(x, 32, (2, 2))
            x = Flatten()(x)
        else:
            x = Flatten()(inp)
            x = Dense(64, activation='relu')(x)
            x = Dense(128, activation='relu')(x)
        return Model(inp, x)``

This won't work if len(self._env_dim) <= 2.
A similar error is raised: ValueError: Input 0 is incompatible with layer flatten_5: expected min_ndim=3, found ndim=2.

I am using an environment with each state being a 1d array with 28 elements. The Input of the network does not take into consideration the number of training samples, so I have to define self.env_dim = (28,) which does not work followed by a Flatten layer. My solution was to remove the Flatten layer.

Axes don't match array

Thank you so much for the great effort of creating this... I had this issue when I try to load the model using load_and_run.py script

The command
python3 load_and_run.py --type A3C --actor_path 'A3C/models/A3C_ENV_CartPole-v1_NB_EP_10000_BS_64_LR_0.0001_actor.h5' --critic_path 'A3C/models/A3C_ENV_CartPole-v1_NB_EP_10000_BS_64_LR_0.0001_critic.h5'

Error :
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/keras/engine/topology.py", line 3152, in preprocess_weights_for_loading
weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
File "<array_function internals>", line 6, in transpose
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 653, in transpose
return _wrapfunc(a, 'transpose', axes)
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 58, in _wrapfunc
return bound(*args, **kwds)
ValueError: axes don't match array

Trouble with A3C and Breakout

Hi, I was trying to use a3c with the game 'breakout', but some error popped:

my command : python3 main.py --type A3C --env BreakoutNoFrameskip-v4 --is_atari --nb_episodes 10000 --n_threads 4
the result:

Score: 0%| | 0/10000 [00:00<?, ? episodes/s]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)

Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)

Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)

Traceback (most recent call last):
File "main.py", line 115, in
main()
File "main.py", line 108, in main
a = algo.policy_action(old_state)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)

Thank you!

Testing A3C

I tried testing the trained model on A3C on CartPole-v1 environment. However, I get the following error:
"ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (4, 4)"

Actor update equation in DDPG

Hi,

When you use the gradient of the critic to update the actor here, why do you put in the third parameter of tf.gradients() "-action_gdts" instead of "action_gdts". From where does the minus sign come ?

I double checked the formula and I still don't see why it is the case in your code.

Thanks!

Mistake in prioritised replay?

Hello again,

FYI: think you might have defined the TD error wrong in the "Deep-RL-Keras/DDQN/ddqn.py". On line 125 you have

"""
q_val = self.agent.predict(new_state) ## I think the argument should be 'state' here
q_val_t = self.agent.target_predict(new_state)
next_best_action = np.argmax(q_val)
new_val = reward + self.gamma * q_val_t[0, next_best_action]
td_error = abs(new_val - q_val)[0]
"""

But I think the correct definition is

td_error = abs( Q(s,a) - yi )
with yi = ri + gamma*max( Q(s', a') )

[Help] does this ddqn can run 'breakout' environment?

MsPacman-v0?

python main.py --type A2C --env MsPacman-v0

"ValueError: Error when checking input: expected input_1 to have 5 dimensions, but got array with shape (4, 210, 160, 3)"

A2C can't run NoFrameskip-v4?

Using TensorFlow backend.
WARNING:tensorflow:From /root/Deep-RL-Keras/utils/networks.py:8: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /root/Deep-RL-Keras/utils/networks.py:10: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From main.py:62: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:68: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:508: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3837: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

Score: 0%| | 0/5000 [00:00<?, ? episodes/s]Traceback (most recent call last):
File "main.py", line 118, in
main()
File "main.py", line 96, in main
stats = algo.train(env, args, summary_writer)
File "/root/Deep-RL-Keras/A2C/a2c.py", line 87, in train
a = self.policy_action(old_state)
File "/root/Deep-RL-Keras/A2C/a2c.py", line 47, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/root/Deep-RL-Keras/A2C/agent.py", line 21, in predict
return self.model.predict(self.reshape(inp))
File "/root/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/root/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (4, 210, 160, 3)
Score: 0%|

ResourceExhaustedError

Is there a workaround for the ResourceExhaustedError?

That's what happen when I run main.py with a custom env:

Traceback (most recent call last):
  File "main.py", line 125, in <module>
    main()
  File "main.py", line 103, in main
    stats = algo.train(env, args, summary_writer)
  File "[...]\Deep-RL-Keras\A2C\a2c.py", line 100, in train
    self.train_models(states, actions, rewards, done)
  File "[...]\Deep-RL-Keras\A2C\a2c.py", line 67, in train_models
    self.c_opt([states, discounted_rewards])
  File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "[...]\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
    run_metadata_ptr)
  File "[...]\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[177581,177581] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node sub_17}} = Sub[T=DT_FLOAT, _class=["loc:@gradients_1/sub_17_grad/Reshape_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_1, dense_6/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Necessity of a Custom optimizer for the Critic (A2C).

Hello Germain / Everyone,

I am currently trying to implement the A2C algorithm as part of a simulation for my PhD. Given that, I have very limited time to do so, your source code is a great help, since the algorithm and operations are clearly outlined and not hidden away as is the case for OpenAI baseline implementation.
Still after having a look at the code in critic.py, I was wondering why did you define a custom optimizer for the critic has well (it is clearly justified for the actor), when simply compiling the critic network and passing MSE as the loss seem to have the same effect? Is there something I am missing here?
Anyway, that was just a though nothing game changing. Thanks a lot for sharing those implementations.

DDPG

Hey, great work on the implementations! I tried using your DDPG implementation on another environment (BeerGame) and am getting the following error :

Traceback (most recent call last):
File "ddpg.py", line 191, in
main()
File "ddpg.py", line 183, in main
stats = distributor.train(env, args, summary_writer)
File "ddpg.py", line 130, in train
self.update_models(states, actions, critic_target)
File "ddpg.py", line 73, in update_models
self.actor.train(states, actions, np.array(grads).reshape((-1, self.act_dim)))
File "/Users/aravind/Desktop/DDPG/ddpg_actor.py", line 72, in train
self.adam_optimizer([states, grads])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2666, in call
return self._call(inputs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2635, in _call
session)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2587, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1414, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1368, in init
session._session, options_ptr, status)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: Node 'Adam' (type: 'NoOp', num of outputs: 0) does not have output 0
Exception ignored in: <bound method BaseSession._Callable.del of <tensorflow.python.client.session.BaseSession._Callable object at 0x119e7f630>>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1398, in del
self._session._session, self._handle, status)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140324910323680

I only made a few modifications to your code to suit my environment like state and action dimensions. This error occurs during training in Adam optimizer, when the action gradients from the critic are being propagated to the actor network. Was wondering if you encountered any similar errors during your implementation.

Q-value updating problem in DDQN

Hello,

I think there is a mistake in your Q-value updating part of DDQN code (ddqn.py). Should np.argmax(next_q[0,:]) be np.argmax(next_q[i,:]) in line 64? It doesn't make any sense if only choosing the same st+1 to update each q_value for each <st, at, st+1, rt> in a minibatch.

        # Apply Bellman Equation on batch samples to train our DDQN
        q = self.agent.predict(s)
        next_q = self.agent.predict(new_s)
        q_targ = self.agent.target_predict(new_s)

        for i in range(s.shape[0]):
            old_q = q[i, a[i]]
            if d[i]:
                q[i, a[i]] = r[i]
            else:
                next_best_action = np.argmax(next_q[0,:]) # problematic, might be np.argmax(next_q[i,:])
                q[i, a[i]] = r[i] + self.gamma * q_targ[i, next_best_action]

Question about A2C

Hi there, thanks for sharing your code -- its been very helpful!

One question: is your implementation of the A2C a 'genuine' actor-critic method? My (limited) understanding was that to qualify as an actor-critic method, there needed to be temporal difference learning; you learn from each (S,A,R,S') transition, as opposed to executing a full episode, and then learning. I'm following the logic in Sutton's book, the relevant part of which I'm quoting below.

Anyway -- I'm just curious, and thanks again!

Can find the textbook at http://incompleteideas.net/book/the-book-2nd.html
Quote is from page 331

"Although the REINFORCE-with-baseline method learns both a policy and a state-value
function, we do not consider it to be an actor–critic method because its state-value function
is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating
the value estimate for a state from the estimated values of subsequent states), but only
as a baseline for the state whose estimate is being updated. This is a useful distinction,
for only through bootstrapping do we introduce bias and an asymptotic dependence
on the quality of the function approximation. As we have seen, the bias introduced
through bootstrapping and reliance on the state representation is often beneficial because
it reduces variance and accelerates learning. REINFORCE with baseline is unbiased
and will converge asymptotically to a local minimum, but like all Monte Carlo methods
it tends to learn slowly (produce estimates of high variance) and to be inconvenient
to implement online or for continuing problems. As we have seen earlier in this book,
with temporal-di↵erence methods we can eliminate these inconveniences, and through
multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain
these advantages in the case of policy gradient methods we use actor–critic methods with
a bootstrapping critic."

[Error] Fetch data for Adam

Hi,

When I try to run the ddpg code as follows:

python3 main.py --type DDPG --env LunarLanderContinuous-v2

an error happens in Deep-RL-Keras/DDPG/actor.py at line 63, in the train function
self.adam_optimizer([states, grads]):

File "./.venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1441, in __init__ session._session, options_ptr, status) File "./.venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to fetch data for 'Adam', which produces no output. To run to a node but not fetch any data, pass 'Adam' as an argument to the 'target_node_names' argument of the Session::Run API. Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7f830c0528d0>> Traceback (most recent call last): File "./.venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1467, in __del__ self._session._session, self._handle, status) File "./.venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140198369660512

Did you already have this issue ?

A3C: Issue with keras.function in optimizer

When the optimize function of the actor is called, i get the following error:

Traceback (most recent call last): File "/home/alcon/Desktop/DRL-Python/A3CMain.py", line 41, in <module> A3C(action_dim, env_dim, args.consecutive_frames) File "/home/alcon/Desktop/DRL-Python/A3C/a3cNetwork.py", line 33, in __init__ self.a_opt = self.actor.optimizer() File "/home/alcon/Desktop/DRL-Python/A3C/actor.py", line 39, in optimizer return K.function(inputs=[self.model.input, self.action_pl, self.advantages_pl], outputs=[], updates=updates) File "/home/alcon/Desktop/DRL-Python/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3479, in function return GraphExecutionFunction(inputs, outputs, updates=updates, **kwargs) File "/home/alcon/Desktop/DRL-Python/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3142, in __init__ with ops.control_dependencies([self.outputs[0]]): IndexError: list index out of range

I found code on the internet using it but for some reason it doesn't work for me

I tried it with tensorflow 1.15 - 2.0.0
and tried importing
tensorflow.keras.backend
tensorflow.compat.v1.keras.backend
tensorflow.compat.v2.keras.backend
but none worked

Unable to run examples

Eg during running python3 main.py --type A2C --env CartPole-v1:

many libs are missing (opencv, pandas, tensorflow); I installed some versions, but they seem incompatible, also Python 3.7 seems incompatible (read below)
with TF 2.3.0: AttributeError: module 'tensorflow' has no attribute 'ConfigProto'
after downgrade to TF 2.0.0: RuntimeError: dictionary changed size during iteration

You made sooo much effort developing this project, but such simple thing as setup.py with exact library versions is missing.