germain-hug / deep-rl-keras Goto Github PK
View Code? Open in Web Editor NEWKeras Implementation of popular Deep RL Algorithms (A3C, DDQN, DDPG, Dueling DDQN)
Keras Implementation of popular Deep RL Algorithms (A3C, DDQN, DDPG, Dueling DDQN)
Hello and thanks for sharing your code!
Can you please let me know if there is a way to save model state/weights in order to test later or resume training?
Hi, any plans to update to Tensorflow 2 (and preferably tf.keras while at it)?
Thanks for your very clear code. I was reading through it, but couldn't understand one key step regarding training the agent:
Here is the code:
# Apply Bellman Equation on batch samples to train our DDQN
q = self.agent.predict(s)
next_q = self.agent.predict(new_s)
q_targ = self.agent.target_predict(new_s)
for i in range(s.shape[0]):
old_q = q[i, a[i]]
if d[i]:
q[i, a[i]] = r[i]
else:
next_best_action = np.argmax(next_q[i,:])
q[i, a[i]] = r[i] + self.gamma * q_targ[i, next_best_action]
if(self.with_per):
# Update PER Sum Tree
self.buffer.update(idx[i], abs(old_q - q[i, a[i]]))
# Train on batch
self.agent.fit(s, q)
# Decay epsilon
self.epsilon *= self.epsilon_decay
From my understanding, the Q function maps (state, action) pairs to rewards. However in the code above you assume that the Agent networks to return Q values. However a quick inspection of the Agent model, reveals that it actually returns actions, which makes sense since that's what the agent has to learn. Rewards are be calculated within env.step(action)
. So then, in the belmann equation you add r[i] + self.gamma * q_targ[i, next_best_action]
.
Isn't q_targ[i, next_best_action]
an action? so how can you add it to a reward?
I am sure I am not seeing some detail in the code that makes this all works. Would you mind clarifying it for me? Thanks.
Hi, I noticed that A3C is having 2 issues:
(36env) c:\Users\user\py\Deep-RL-Keras>main.py --type A3C --env CartPole-v0 --nb_episodes 10000 --n_threads 2 Using TensorFlow backend. Score: 40.0: : 5050 episodes [00:04, 1259.48 episodes/s] Traceback (most recent call last): File "C:\Users\user\py\Deep-RL-Keras\main.py", line 114, in <module> main() File "C:\Users\user\py\Deep-RL-Keras\main.py", line 107, in main a = algo.policy_action(old_state) File "C:\Users\user\py\Deep-RL-Keras\A3C\a3c.py", line 61, in policy_action return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0] File "C:\Users\user\py\Deep-RL-Keras\A3C\agent.py", line 22, in predict return self.model.predict(self.reshape(inp)) File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training.py", line 1149, in predict x, _, _ = self._standardize_user_data(x) File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training.py", line 751, in _standardize_user_data exception_prefix='input') File "C:\Users\user\py\36env\lib\site-packages\keras\engine\training_utils.py", line 128, in standardize_input_data 'with shape ' + str(data_shape)) ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (4, 4) Exception ignored in: <bound method Viewer.__del__ of <gym.envs.classic_control.rendering.Viewer object at 0x0000022E46D94BA8>> Traceback (most recent call last): File "c:\users\user\py\gym\gym\envs\classic_control\rendering.py", line 143, in __del__ File "c:\users\user\py\gym\gym\envs\classic_control\rendering.py", line 62, in close File "C:\Users\user\py\36env\lib\site-packages\pyglet\window\win32\__init__.py", line 305, in close File "C:\Users\user\py\36env\lib\site-packages\pyglet\window\__init__.py", line 770, in close ImportError: sys.meta_path is None, Python is likely shutting down
Hi everyone,
I modified the DQN algorithm in this repository to a multi-agent DQN approach for a wireless network environment. Actually, I wrote this code inspired by a repository on GitHub. Although the original code works well, when I change the environment, the following error occurs.
Traceback (most recent call last): File "D:/main -DQN.py", line 452, in <module> main() File "D:/main -DQN.py", line 432, in main algo = DQN( args) # n_clusters is the action dimension in DQN File "D:/main -DQN.py", line 158, in __init__ self.agent = Agent( args, self.tau) File "D:/main -DQN.py", line 246, in __init__ self.model = self.network() File "D:/main -DQN.py", line 254, in network inp = Input((self.state_dim)) File "C:\Users\AppData\Roaming\Python\Python37\site-packages\keras\engine\topology.py", line 1451, in Input batch_shape = (None,) + tuple(shape) TypeError: 'int' object is not iterable
The complete code is as follows:
`
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
import pandas as pd
import numpy as np
import sys
import os
import copy, json, argparse
from numpy import pi
from random import random, uniform, choices, randint, sample, randrange
import random
import math
from tqdm import tqdm
import keras.backend as K
from keras.optimizers import Adam
from keras.models import Model
from keras.layers import Dense, Flatten, Input
from collections import deque
class Environ:
def __init__(self, args):
self.args=args
self.state_dim= (self.args.A, )
self.action_dim=args.C
self.bs = complex((500 / 2), (500/ 2))
self.S=(np.zeros(self.args.A)).reshape(-1)
def Location(self):
rx = uniform(0, 500)
ry = uniform(0, 500)
Loc = complex(rx, ry)
return Loc
def PathGain(self,Loc):
d = abs(Loc- self.bs)
d=d **(-3)
u = np.random.rand(1, 1)
sigma = 1
x = sigma * np.sqrt(-2 * np.log(u))
h= d* x
return h
def reset(self): # Reset the states
s=np.zeros(self.args.A)
return s.reshape(-1)
def RecievePower(self,UsersLoc):
H=self.PathGain(UsersLoc)
UsersRecievePower=self.args.P*H
return UsersRecievePower
def TotalRate(self, actionRB_i,actionRB):
interference = self.args.Noise
Loc_i=self.Location()
for j in range(self.args.A):
if actionRB_i ==actionRB[j] :
Loc_j = self.Location()
RecievePower_j = self.RecievePower(Loc_j)
interference = interference + RecievePower_j
else:
interference= interference
RecievePower_i = self.RecievePower(Loc_i)
SINR = interference / (interference-RecievePower_i)
Rate =self.args.BW*( np.log2( SINR))
return Rate
def computeQoS(self,actionRB,actionRB_i):
TotalRate=self.TotalRate(actionRB,actionRB_i)
if TotalRate >=self.args.Rmin:
QoS=1.0
else:
QoS=0.0
return QoS
def ComputeState(self,actionRB):
QoS=np.zeros(self.args.A)
for i in range(self.args.A):
actionRB_i=actionRB[i]
QoS[i] = self.computeQoS(actionRB,actionRB_i)
S = np.zeros( self.args.A)
for i in range(self.args.A):
S[i]=QoS[i]
self.S=S
return self.S.reshape(-1)
def Reward(self,actionRB,actionRB_i):
Rate = np.zeros(self.args.A)
Satisfied_Users = 0
for i in range(self.args.A):
Rate[i] = self.TotalRate(actionRB, actionRB_i)
Satisfied_Users = Satisfied_Users + self.computeQoS(actionRB)
TotalRate = 0.0
TotalPower = self.args.circuitPower
for i in range(self.args.A):
TotalRate = TotalRate + Rate[i]
TotalPower = TotalPower + self.args.P
if Satisfied_Users == self.args.A:
reward = TotalRate / TotalPower
else:
reward = self.args.negative_cost
return reward
def step(self,actionRB):
next_s = self.ComputeState(actionRB)
r = self.Reward(actionRB)
done = False
info = None
return next_s, r, done, info
class Environment(object):
def __init__(self, gym_env, action_repeat):
self.env = gym_env
self.timespan = action_repeat
self.gym_actions = 2 # range(gym_env.action_space.n)
self.state_buffer = deque()
def get_action_size(self):
return self.env.action_dim
def get_state_size(self):
return self.env.state_dim
def reset(self):
# Clear the state buffer
self.state_buffer = deque()
x_t = self.env.reset()
s_t = np.stack([x_t for i in range(self.timespan)], axis=0)
for i in range(self.timespan - 1):
self.state_buffer.append(x_t)
return s_t
def step(self, action):
x_t1, r_t, terminal, info = self.env.step(action)
previous_states = np.array(self.state_buffer)
s_t1 = np.empty((self.timespan, *self.env.state_dim))
s_t1[:self.timespan - 1, :] = previous_states
s_t1[self.timespan - 1] = x_t1
# Pop the oldest frame, add the current frame to the queue
self.state_buffer.popleft()
self.state_buffer.append(x_t1)
return s_t1, r_t, terminal, info
def render(self):
return self.env.render()
class DQN:
def init(self, args):
# Environment and DQN parameters
self.args=args
self.action_dim = self.args.C
self.state_dim = self.args.A
self.buffer_size = self.args.capacity
# Memory Buffer for Experience Replay
self.buffer = MemoryBuffer(self.buffer_size)
self.epsilon=self.args.eps
self.tau = 1.0
self.agent = Agent( args, self.tau)
def policy_action(self, s):
if random() <= self.epsilon:
return randrange(self.action_dim)
else:
return np.argmax(self.agent.predict(s)[0])
def train_agent(self):
# Sample experience from memory buffer
s, a, r, d, new_s, idx = self.buffer.sample_batch(self.batch_size)
# Apply Bellman Equation on batch samples to train our DQN
q = self.agent.predict(s)
next_q = self.agent.predict(new_s)
q_targ = self.agent.target_predict(new_s)
for i in range(s.shape[0]):
if d[i]:
q[i, a[i]] = r[i]
else:
next_best_action = np.argmax(next_q[i, :])
q[i, a[i]] = r[i] + self.args.gamma * q_targ[i, next_best_action]
# Train on batch
self.agent.fit(s, q)
# Decay epsilon
self.epsilon *= self.args.eps_decay
def train(self, env, args, summary_writer):
results = []
tqdm_e = tqdm(range(self.args.nepisodes), desc='Score', leave=True, unit=" episodes")
for e in tqdm_e:
# Reset episode
time, cumul_reward, done = 0, 0, False
old_state = env.reset()
while not done:
# if args.render:
# env.render()
# Actor picks an action (following the policy)
a=[]
for i in range(self.args.A):
a[i]= self.policy_action(old_state)
# Retrieve new state, reward, and whether the state is terminal
new_state, r, done, _ = env.step(a)
# Memorize for experience replay
self.memorize(old_state, a, r, done, new_state)
# Update current state
old_state = new_state
cumul_reward += r
time += 1
# Train DDQN and transfer weights to target network
if(self.buffer.size() > args.batch_size):
self.train_agent(self.args.batch_size)
self.agent.transfer_weights()
# Gather stats every episode for plotting
if(args.gather_stats):
mean, stdev = gather_stats(self, env)
results.append([e, mean, stdev])
# Export results for Tensorboard
score = tfSummary('score', cumul_reward)
summary_writer.add_summary(score, global_step=e)
summary_writer.flush()
# Display score
tqdm_e.set_description("Score: " + str(cumul_reward))
tqdm_e.refresh()
return results
def memorize(self, state, action, reward, done, new_state):
self.buffer.memorize(state, action, reward, done, new_state)
def save_weights(self, path):
path += '_LR_{}'.format(self.args.learningrate)
self.agent.save(path)
def load_weights(self, path):
self.agent.load_weights(path)
class Agent:
def init(self, args, tau):
self.args=args
self.state_dim = self.args.A
self.action_dim = self.args.C
self.tau = tau
self.lr=self.args.learningrate
# Initialize Deep Q-Network
self.model = self.network()
self.model.compile(Adam(self.lr), 'mse')
# Build target Q-Network
self.target_model = self.network()
self.target_model.compile(Adam(self.lr), 'mse')
self.target_model.set_weights(self.model.get_weights())
def network(self):
inp = Input((self.state_dim))
if(len(self.state_dim) > 2):
inp = Input((self.state_dim[1:]))
x = conv_block(inp, 32, (2, 2), 8)
x = conv_block(x, 64, (2, 2), 4)
x = conv_block(x, 64, (2, 2), 3)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
else:
x = Flatten()(inp)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(self.action_dim, activation='linear')(x)
return Model(inp, x)
def transfer_weights(self):
W = self.model.get_weights()
tgt_W = self.target_model.get_weights()
for i in range(len(W)):
# updated based on Polyak averaging method
tgt_W[i] = self.tau * W[i] + (1 - self.tau) * tgt_W[i]
self.target_model.set_weights(tgt_W)
def fit(self, inp, targ):
self.model.fit(self.reshape(inp), targ, epochs=1, verbose=0)
def predict(self, inp):
return self.model.predict(self.reshape(inp))
def target_predict(self, inp):
return self.target_model.predict(self.reshape(inp))
def reshape(self, x):
if len(x.shape) < 4 and len(self.state_dim) > 2:
return np.expand_dims(x, axis=-1)
elif len(x.shape) < 3:
return np.expand_dims(x, axis=-1)
else:
return x
def save(self, path):
self.model.save_weights(path + '.h5')
def load_weights(self, path):
self.model.load_weights(path)
class MemoryBuffer(object):
def init(self, buffer_size):
# Standard Buffer
self.buffer = deque()
self.count = 0
self.buffer_size = buffer_size
def memorize(self, state, action, reward, done, new_state):
experience = (state, action, reward, done, new_state)
# Check if buffer is already full
if self.count < self.buffer_size:
self.buffer.append(experience)
self.count += 1
else:
self.buffer.popleft()
self.buffer.append(experience)
def size(self):
return self.count
def sample_batch(self, batch_size):
batch = []
if self.count < batch_size:
idx = None
batch = random.sample(self.buffer, self.count)
else:
idx = None
batch = random.sample(self.buffer, batch_size)
# Return a batch of experience
s_batch = np.array([i[0] for i in batch])
a_batch = np.array([i[1] for i in batch])
r_batch = np.array([i[2] for i in batch])
d_batch = np.array([i[3] for i in batch])
new_s_batch = np.array([i[4] for i in batch])
return s_batch, a_batch, r_batch, d_batch, new_s_batch, idx
def update(self, idx):
self.buffer.update(idx)
def clear(self):
self.buffer = deque()
self.count = 0
def get_session():
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
return tf.Session(config=config)
def tfSummary(tag, val):
return tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=val)])
def gather_stats(agent, env):
score = []
for k in range(10):
old_state = env.reset()
cumul_r, done = 0, False
while not done:
a = agent.policy_action(old_state)
old_state, r, done, _ = env.step(a)
cumul_r += r
score.append(cumul_r)
return np.mean(np.array(score)), np.std(np.array(score))
def conv_block(inp, d=3, pool_size=(2, 2), k=3):
conv = conv_layer(d, k)(inp)
return MaxPooling2D(pool_size=pool_size)(conv)
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
def parse_args(args):
parser = argparse.ArgumentParser(description='Training parameters')
#
parser.add_argument('--out_dir', type=str, default='experiments', help="Name of the output directory")
parser.add_argument('--consecutive_frames', type=int, default=2, help="Number of consecutive frames (action repeat)")
parser.add_argument('--gather_stats', dest='gather_stats', action='store_true', help="Compute Average reward per episode (slower)")
parser.add_argument('--A', type=int, default='10', help="The number of agents")
parser.add_argument('--C', type=int, default='30', help="The number of Resources")
parser.add_argument('--Noise', type=float, default='0.00000000000001', help="The background noise")
parser.add_argument('--BW', type=int, default='180000', help="The bandwidth")
parser.add_argument('--Rmin', type=int, default='1000000', help="Agents' QoS")
parser.add_argument('--P', type=float, default='0.01', help="Agents' transmit power")
parser.add_argument('--circuitPower', type=float, default='0.05', help="The circuit Power")
parser.add_argument('--negative_cost', type=float, default='-1.0', help="The negative cost")
parser.add_argument('--capacity', type=int, default='500', help="Capacity of Replay Buffer")
parser.add_argument('--learningrate', type=float, default='0.01', help="The learning rate")
parser.add_argument('--eps', type=float, default='0.8', help="The epsilon")
parser.add_argument('--eps_decay', type=float, default='0.99', help="The epsilon decay")
parser.add_argument('--eps_increment', type=float, default='0.003', help="The epsilon increment")
parser.add_argument('--batch_size', type=int, default='8', help="The batch size")
parser.add_argument('--gamma', type=float, default='0.99', help="The discount factor")
parser.add_argument('--nepisodes', type=int, default='500', help="The number of episodes")
parser.add_argument('--nsteps', type=int, default='500', help="The number of steps")
parser.add_argument('--env', type=str, default='Environ', help="Wireless environment")
parser.add_argument('--gpu', type=str, default="", help='GPU ID')
args=parser.parse_args(args)
parser.set_defaults(render=False)
return args
def main(args=None):
# Parse arguments
if args is None:
args = sys.argv[1:]
args = parse_args(args)
# Check if a GPU ID was set
if args.gpu:
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu
set_session(get_session())
summary_writer = tf.summary.FileWriter("/tensorboard_" + args.env)
# Initialize the wireless environment
users_env = Environ(args)
# print(users_env)
# Wrap the environment to use consecutive frames
env = Environment(users_env, args.consecutive_frames)
env.reset()
# Define parameters for the DDQN and DDPG algorithms
state_dim = env.get_state_size()
action_dim = users_env.action_dim
# The maximum and minimum values for precoding vectors
# act_range = 1
# act_min = 0
# Initialize the DQN algorithm for the clustering optimization
algo = DQN( args) # n_clusters is the action dimension in DQN
# if args.step == "train":
# Train
stats = algo.train(env, args, summary_writer)
# Export results to CSV
if(args.gather_stats):
df = pd.DataFrame(np.array(stats))
df.to_csv(args.out_dir + "/logs.csv", header=['Episode', 'Mean', 'Stddev'], float_format='%10.5f')
# df.to_csv(args.type + "/logs.csv", header=['Episode', 'Mean', 'Stddev'], float_format='%10.5f')
# Save weights and close environments
exp_dir = '{}/models_A_{}_C_{}_Rmin_{}/'.format(args.out_dir, args.A, args.C, args.Rmin)
# exp_dir = '{}/models/'.format(args.type)
if not os.path.exists(exp_dir):
os.makedirs(exp_dir)
# Save DDQN
export_path = '{}_{}_NB_EP_{}_BS_{}'.format(exp_dir, "DQN", args.nepisodes, args.batch_size)
algo.save_weights(export_path)
if name == "main":
main()
`
Thanks in advance for your help.
Has anyone come up with a way to visualize the training of a DDPG with tensorboard?
It's a bit tricky because does not make use of the model.fit
, which can use Keras' Tensorboard callback.
Hi I don't understand why we still need to define the action space as it is supposed to be infinite when using DDPG.
Hi . I want to learn more about RL and I think examples from this repo are great, but I have some trouble with requirements version. Can anyone provide a requirements.txt file ?
Thank you
It seems like if I compare from [https://arxiv.org/pdf/1511.05952.pdf](PER paper):
Algorithm 1: line 11
TD error
delta(j) = Reward(j) + gamma(j) * Q_target(S_j, arg max_a Q(S_j, a)) - Q(S_j-1, A_j-1)
If I am not mistaken the j-1 subscript refers to current state in the implementation, i.e. state, action, reward, done all refer to j-1
. And new_state refers to j
Then line 125 in ddqn.py refers to arg max of itself not to the previous one:
q_val = self.agent.predict(state)
next_best_action = np.argmax(q_val)
should be
q_val = self.agent.predict(new_state)
next_best_action = np.argmax(q_val)
Hi,
Thanks for this great implementation.
I encountered an error when tried to run "python3 main.py --type DDQN --env CartPole-v1 --batch_size 64 --with_PER"
The error is:
/utils/memory_buffer.py", line 69, in sample_batch
batch.append((*data, idx))
TypeError: 'int' object is not iterable
Have anyone fixed this issue?
Great work on the implementation - very clean code and easy to follow
I have been running the LunarLanderContinuous Environment
python main.py --type DDPG --env LunarLanderContinuous-v2 --render
I have not been able to get it converge. I have been running for >4000 episodes, but I have not seen any improvements. score bounces around -320 to -480.... and clearly the lander is not making progress.
I am using a newer version of Keras (2.2.4) - I had errors initially, but was able to resolve them using the comments from @zynk13 (#2)
All parameters (lr, network structure) are the same as the original code
Anyone able to achieve good results with the lunar lander?
Thanks
I am confused. Dose get_updates apply the gradient to the trainable variables directly? Or just return the gradient?
I tried your code on mountain car environment by A2C. It showed no progress (always -200) no matter how long it was trained. However, this problem can be easily solved by DQN or DDQN algorithm.
Actually I used my own program in mountain car and encountered the same problem. That is why I started to study your code. Can’t actor-critic method solve mountain car environment? If no, do you know the reason?
` def buildNetwork(self):
""" Assemble shared layers
"""
inp = Input((self.env_dim))
# If we have an image, apply convolutional layers
if(len(self.env_dim) > 2):
x = Reshape((self.env_dim[1], self.env_dim[2], -1))(inp)
x = conv_block(x, 32, (2, 2))
x = conv_block(x, 32, (2, 2))
x = Flatten()(x)
else:
x = Flatten()(inp)
x = Dense(64, activation='relu')(x)
x = Dense(128, activation='relu')(x)
return Model(inp, x)``
This won't work if len(self._env_dim) <= 2.
A similar error is raised: ValueError: Input 0 is incompatible with layer flatten_5: expected min_ndim=3, found ndim=2.
I am using an environment with each state being a 1d array with 28 elements. The Input of the network does not take into consideration the number of training samples, so I have to define self.env_dim = (28,) which does not work followed by a Flatten layer. My solution was to remove the Flatten layer.
Thank you so much for the great effort of creating this... I had this issue when I try to load the model using load_and_run.py script
The command
python3 load_and_run.py --type A3C --actor_path 'A3C/models/A3C_ENV_CartPole-v1_NB_EP_10000_BS_64_LR_0.0001_actor.h5' --critic_path 'A3C/models/A3C_ENV_CartPole-v1_NB_EP_10000_BS_64_LR_0.0001_critic.h5'
Error :
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/keras/engine/topology.py", line 3152, in preprocess_weights_for_loading
weights[0] = np.transpose(weights[0], (3, 2, 0, 1))
File "<array_function internals>", line 6, in transpose
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 653, in transpose
return _wrapfunc(a, 'transpose', axes)
File "/Deep-RL-Keras-master/venv/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 58, in _wrapfunc
return bound(*args, **kwds)
ValueError: axes don't match array
Hi, I was trying to use a3c with the game 'breakout', but some error popped:
my command : python3 main.py --type A3C --env BreakoutNoFrameskip-v4 --is_atari --nb_episodes 10000 --n_threads 4
the result:
Score: 0%| | 0/10000 [00:00<?, ? episodes/s]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)
Exception in thread Thread-5:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)
Exception in thread Thread-6:
Traceback (most recent call last):
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/thread.py", line 24, in training_thread
a = agent.policy_action(np.expand_dims(old_state, axis=0))
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)
Traceback (most recent call last):
File "main.py", line 115, in
main()
File "main.py", line 108, in main
a = algo.policy_action(old_state)
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/a3c.py", line 58, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/home/kexin/Desktop/CHRISTOPHE/Deep-RL-Keras-master/A3C/agent.py", line 22, in predict
return self.model.predict(self.reshape(inp))
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/home/kexin/anaconda3/envs/tensorflow_gpuenv/lib/python3.6/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (1, 84, 84, 4)
Thank you!
I tried testing the trained model on A3C on CartPole-v1 environment. However, I get the following error:
"ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (4, 4)"
Hi,
When you use the gradient of the critic to update the actor here, why do you put in the third parameter of tf.gradients() "-action_gdts" instead of "action_gdts". From where does the minus sign come ?
I double checked the formula and I still don't see why it is the case in your code.
Thanks!
Hello again,
FYI: think you might have defined the TD error wrong in the "Deep-RL-Keras/DDQN/ddqn.py". On line 125 you have
"""
q_val = self.agent.predict(new_state) ## I think the argument should be 'state' here
q_val_t = self.agent.target_predict(new_state)
next_best_action = np.argmax(q_val)
new_val = reward + self.gamma * q_val_t[0, next_best_action]
td_error = abs(new_val - q_val)[0]
"""
But I think the correct definition is
td_error = abs( Q(s,a) - yi )
with yi = ri + gamma*max( Q(s', a') )
python main.py --type A2C --env MsPacman-v0
"ValueError: Error when checking input: expected input_1 to have 5 dimensions, but got array with shape (4, 210, 160, 3)"
Using TensorFlow backend.
WARNING:tensorflow:From /root/Deep-RL-Keras/utils/networks.py:8: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
WARNING:tensorflow:From /root/Deep-RL-Keras/utils/networks.py:10: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From main.py:62: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:68: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:508: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /root/miniconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3837: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
Score: 0%| | 0/5000 [00:00<?, ? episodes/s]Traceback (most recent call last):
File "main.py", line 118, in
main()
File "main.py", line 96, in main
stats = algo.train(env, args, summary_writer)
File "/root/Deep-RL-Keras/A2C/a2c.py", line 87, in train
a = self.policy_action(old_state)
File "/root/Deep-RL-Keras/A2C/a2c.py", line 47, in policy_action
return np.random.choice(np.arange(self.act_dim), 1, p=self.actor.predict(s).ravel())[0]
File "/root/Deep-RL-Keras/A2C/agent.py", line 21, in predict
return self.model.predict(self.reshape(inp))
File "/root/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 1817, in predict
check_batch_axis=False)
File "/root/miniconda3/lib/python3.7/site-packages/keras/engine/training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking : expected input_1 to have 5 dimensions, but got array with shape (4, 210, 160, 3)
Score: 0%|
Is there a workaround for the ResourceExhaustedError
?
That's what happen when I run main.py
with a custom env:
Traceback (most recent call last):
File "main.py", line 125, in <module>
main()
File "main.py", line 103, in main
stats = algo.train(env, args, summary_writer)
File "[...]\Deep-RL-Keras\A2C\a2c.py", line 100, in train
self.train_models(states, actions, rewards, done)
File "[...]\Deep-RL-Keras\A2C\a2c.py", line 67, in train_models
self.c_opt([states, discounted_rewards])
File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "[...]\lib\site-packages\keras\backend\tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "[...]\lib\site-packages\tensorflow\python\client\session.py", line 1439, in __call__
run_metadata_ptr)
File "[...]\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[177581,177581] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node sub_17}} = Sub[T=DT_FLOAT, _class=["loc:@gradients_1/sub_17_grad/Reshape_1"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_1, dense_6/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Hello Germain / Everyone,
I am currently trying to implement the A2C algorithm as part of a simulation for my PhD. Given that, I have very limited time to do so, your source code is a great help, since the algorithm and operations are clearly outlined and not hidden away as is the case for OpenAI baseline implementation.
Still after having a look at the code in critic.py
, I was wondering why did you define a custom optimizer for the critic has well (it is clearly justified for the actor), when simply compiling the critic network and passing MSE as the loss seem to have the same effect? Is there something I am missing here?
Anyway, that was just a though nothing game changing. Thanks a lot for sharing those implementations.
Hey, great work on the implementations! I tried using your DDPG implementation on another environment (BeerGame) and am getting the following error :
Traceback (most recent call last):
File "ddpg.py", line 191, in
main()
File "ddpg.py", line 183, in main
stats = distributor.train(env, args, summary_writer)
File "ddpg.py", line 130, in train
self.update_models(states, actions, critic_target)
File "ddpg.py", line 73, in update_models
self.actor.train(states, actions, np.array(grads).reshape((-1, self.act_dim)))
File "/Users/aravind/Desktop/DDPG/ddpg_actor.py", line 72, in train
self.adam_optimizer([states, grads])
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2666, in call
return self._call(inputs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2635, in _call
session)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2587, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1414, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1368, in init
session._session, options_ptr, status)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: Node 'Adam' (type: 'NoOp', num of outputs: 0) does not have output 0
Exception ignored in: <bound method BaseSession._Callable.del of <tensorflow.python.client.session.BaseSession._Callable object at 0x119e7f630>>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1398, in del
self._session._session, self._handle, status)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140324910323680
I only made a few modifications to your code to suit my environment like state and action dimensions. This error occurs during training in Adam optimizer, when the action gradients from the critic are being propagated to the actor network. Was wondering if you encountered any similar errors during your implementation.
Hello,
I think there is a mistake in your Q-value updating part of DDQN code (ddqn.py). Should np.argmax(next_q[0,:]) be np.argmax(next_q[i,:]) in line 64? It doesn't make any sense if only choosing the same st+1 to update each q_value for each <st, at, st+1, rt> in a minibatch.
# Apply Bellman Equation on batch samples to train our DDQN
q = self.agent.predict(s)
next_q = self.agent.predict(new_s)
q_targ = self.agent.target_predict(new_s)
for i in range(s.shape[0]):
old_q = q[i, a[i]]
if d[i]:
q[i, a[i]] = r[i]
else:
next_best_action = np.argmax(next_q[0,:]) # problematic, might be np.argmax(next_q[i,:])
q[i, a[i]] = r[i] + self.gamma * q_targ[i, next_best_action]
Hi there, thanks for sharing your code -- its been very helpful!
One question: is your implementation of the A2C a 'genuine' actor-critic method? My (limited) understanding was that to qualify as an actor-critic method, there needed to be temporal difference learning; you learn from each (S,A,R,S') transition, as opposed to executing a full episode, and then learning. I'm following the logic in Sutton's book, the relevant part of which I'm quoting below.
Anyway -- I'm just curious, and thanks again!
Can find the textbook at http://incompleteideas.net/book/the-book-2nd.html
Quote is from page 331
"Although the REINFORCE-with-baseline method learns both a policy and a state-value
function, we do not consider it to be an actor–critic method because its state-value function
is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating
the value estimate for a state from the estimated values of subsequent states), but only
as a baseline for the state whose estimate is being updated. This is a useful distinction,
for only through bootstrapping do we introduce bias and an asymptotic dependence
on the quality of the function approximation. As we have seen, the bias introduced
through bootstrapping and reliance on the state representation is often beneficial because
it reduces variance and accelerates learning. REINFORCE with baseline is unbiased
and will converge asymptotically to a local minimum, but like all Monte Carlo methods
it tends to learn slowly (produce estimates of high variance) and to be inconvenient
to implement online or for continuing problems. As we have seen earlier in this book,
with temporal-di↵erence methods we can eliminate these inconveniences, and through
multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain
these advantages in the case of policy gradient methods we use actor–critic methods with
a bootstrapping critic."
Hi,
When I try to run the ddpg code as follows:
python3 main.py --type DDPG --env LunarLanderContinuous-v2
an error happens in Deep-RL-Keras/DDPG/actor.py at line 63, in the train function
self.adam_optimizer([states, grads]):
File "./.venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1441, in __init__ session._session, options_ptr, status) File "./.venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: Tried to fetch data for 'Adam', which produces no output. To run to a node but not fetch any data, pass 'Adam' as an argument to the 'target_node_names' argument of the Session::Run API. Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7f830c0528d0>> Traceback (most recent call last): File "./.venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1467, in __del__ self._session._session, self._handle, status) File "./.venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InvalidArgumentError: No such callable handle: 140198369660512
Did you already have this issue ?
When the optimize function of the actor is called, i get the following error:
Traceback (most recent call last): File "/home/alcon/Desktop/DRL-Python/A3CMain.py", line 41, in <module> A3C(action_dim, env_dim, args.consecutive_frames) File "/home/alcon/Desktop/DRL-Python/A3C/a3cNetwork.py", line 33, in __init__ self.a_opt = self.actor.optimizer() File "/home/alcon/Desktop/DRL-Python/A3C/actor.py", line 39, in optimizer return K.function(inputs=[self.model.input, self.action_pl, self.advantages_pl], outputs=[], updates=updates) File "/home/alcon/Desktop/DRL-Python/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3479, in function return GraphExecutionFunction(inputs, outputs, updates=updates, **kwargs) File "/home/alcon/Desktop/DRL-Python/venv/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3142, in __init__ with ops.control_dependencies([self.outputs[0]]): IndexError: list index out of range
I found code on the internet using it but for some reason it doesn't work for me
I tried it with tensorflow 1.15 - 2.0.0
and tried importing
tensorflow.keras.backend
tensorflow.compat.v1.keras.backend
tensorflow.compat.v2.keras.backend
but none worked
Eg during running python3 main.py --type A2C --env CartPole-v1
:
AttributeError: module 'tensorflow' has no attribute 'ConfigProto'
RuntimeError: dictionary changed size during iteration
You made sooo much effort developing this project, but such simple thing as setup.py
with exact library versions is missing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.