At first, thank you very much @takuseno for this super nice and clean written RL library! ๐ฅ
Describe the bug
1. Truncated episodes:
In my RL project, I am running complete episodes and add them to the replay buffer afterwards by creating Episode
objects and using buffer.append_episode(episode)
.
I saw, that one cannot pass in next_observations
and terminals
to create episodes. When looking at the internals how you create transitions, it makes sense, because next_observations
do not have the corresponding next_actions
and next_rewards
available.
Since you are creating next_observations
from observations
, the last observation (which is next_observations[-1]
) of the episode is always missing.
Furthermore, if next_observations[-1]
is a real terminal state, where terminals[-1]=True
, the agent would never see it in training.
The question is, are you using next_actions
and next_rewards
somewhere explicitly in the algorithms?
2. Implicit terminals:
I also noticed, that you are implicitly setting terminal=True
for the last observation. There are use cases, where there are no real terminal states, e.g. in environments with infinite time horizon (control environments like Pendulum-v0
). Setting terminal=True
at the end of an episode for those environments would result in wrong TD targets.
To Reproduce
In Pendulum-v0
environment, the episode is truncated after 200 time steps. The info
dict, that is returned each time step indicates the truncation with info={'TimeLimit.truncated': True}
.
import gym
import numpy as np
from d3rlpy.dataset import Episode
if __name__ == '__main__':
env = gym.make('Pendulum-v0')
states = []
actions = []
rewards = []
next_states = []
dones = []
done = False
state = env.reset()
while not done:
action = env.action_space.sample()
next_state, reward, done, info = env.step(action)
states.append(state)
rewards.append(reward)
actions.append(action)
next_states.append(next_state)
dones.append(done)
state = next_state
# info of last env step indicates time truncated episode -> no terminal state
print(info)
if info.get('TimeLimit.truncated', False):
dones[-1] = False
states = np.stack(states, axis=0)
actions = np.stack(actions, axis=0)
rewards = np.stack(rewards, axis=0)
episode = Episode(env.observation_space.shape,
env.action_space.low.size,
states,
actions,
rewards)
# len(episode)==199 since next_states not being used to create it
print(len(episode))
Expected behavior
I would expect, that one can control if terminals=True
should be set at the end of the episode.
To get an episode length of 200 in the Pendulum-v0
example, one can extend action and reward vectors with dummy values at the last time step but at the risk that this dummy values are used somewhere in the library.
A workaround is to use buffer.append
in a for loop, to add the episode to the buffer. But one must not forget to set buffer.prev_observation
, buffer.prev_actions
etc. to None
afterwards.
Edit:
I dived a bit deeper into your algo update functions and found that you are using rew_tp1
to calculate the TD target with
y = rew_tp1 + gamma * q_tp1
If I add transitions to the buffer as mentioned above, the reward of the current time step should be used instead of rew_tp1
(as far as I know).
Furthermore, I see that you are not taking the entropy bonus (e.g. in SAC) into account in the TD target calculation. Is this intended? In other libraries, it was added here.
Sorry for mixing up so much topics. Should I create separate issues for discussion?