kaixhin / rainbow Goto Github PK

View Code? Open in Web Editor NEW

1.5K 41.0 279.0 176 KB

Rainbow: Combining Improvements in Deep Reinforcement Learning

License: MIT License

Python 100.00%

deep-learning deep-reinforcement-learning

rainbow's Introduction

Rainbow

Rainbow: Combining Improvements in Deep Reinforcement Learning [1].

Results and pretrained models can be found in the releases.

Run the original Rainbow with the default arguments:

python main.py

Data-efficient Rainbow [9] can be run using the following options (note that the "unbounded" memory is implemented here in practice by manually setting the memory capacity to be the same as the maximum number of timesteps):

python main.py --target-update 2000 \
               --T-max 100000 \
               --learn-start 1600 \
               --memory-capacity 100000 \
               --replay-frequency 1 \
               --multi-step 20 \
               --architecture data-efficient \
               --hidden-size 256 \
               --learning-rate 0.0001 \
               --evaluation-interval 10000

Note that pretrained models from the 1.3 release used a (slightly) incorrect network architecture. To use these, change the padding in the first convolutional layer from 0 to 1 (DeepMind uses "valid" (no) padding).

Requirements

To install all dependencies with Anaconda run conda env create -f environment.yml and use source activate rainbow to activate the environment.

Available Atari games can be found in the atari-py ROMs folder.

Acknowledgements

References

[1] Rainbow: Combining Improvements in Deep Reinforcement Learning
[2] Playing Atari with Deep Reinforcement Learning
[3] Deep Reinforcement Learning with Double Q-learning
[4] Prioritized Experience Replay
[5] Dueling Network Architectures for Deep Reinforcement Learning
[6] Reinforcement Learning: An Introduction
[7] A Distributional Perspective on Reinforcement Learning
[8] Noisy Networks for Exploration
[9] When to Use Parametric Models in Reinforcement Learning?

rainbow's People

Contributors

Stargazers

Watchers

Forkers

benjamesbabala zhuwenxiao mazecreator shyamalschandra kastnerkyle ml-lab lyx-x corcovadoming meelement shubhampachori12110095 shiyongde zhuyiming filipre emigmo iamrwbylover frederikschubert ashkoofaraz athon-millane renly ericl jidiazhernandez ffrankyy codeaudit zhujianing oakyms roopy7890 zhiruifeng rorandangerous ashutosh-adhikari joddm framework-learner wwxfromtju ppiedra toon159 cocobol uotter mdasifhasan parsonszeng tennenboke reinforcement-learning-kr zyqlzr ddgll 4andrew xiaoanshi yunji-unity marintoro kunlun-zhu kelvinson krishpop ahavenoname dhanush-ai1990 wh-forker nicetea125 w0lv3r1nix zhan0903 deepbrain hanntonkin kevintrannz almoslmi ahundt abagaria pkulics yandongloyid zrclll lkh-1 jiangyangzhou jacobtyo halessi gnomehat hebinycdt stormont bzp92 liyunbin nickmccarthy101 maxwellrebo ankeshanand yiwenlu66 chaoyan1037 khoink94 faddi nguyenvo09 robotwithsoul eric-yyjau jasontangjs hiive phma21 wook133 albertwujj onursahin93 gcbbobo ndrwmlnk neighthan maxwab abhimanyudubey cameronconradt yingweiy nethask lzhan144 thisisisaac praveshk15

rainbow's Issues

Quick questions on the Quantile loss function

Rainbow/agent.py

Lines 62 to 67 in cf4c315

    
           pns = self.online_net(next_states)  # Probabilities p(s_t+n, ·; θonline) 
        
           dns = ((1 / self.atoms) if self.quantile else self.support.expand_as(pns)) * pns  # Distribution d_t+n = (z, p(s_t+n, ·; θonline)) 
        
           argmax_indices_ns = dns.sum(2).argmax(1)  # Perform argmax action selection using online network: argmax_a[(z, p(s_t+n, a; θonline))] 
        
           self.target_net.reset_noise()  # Sample new target net noise 
        
           pns = self.target_net(next_states)  # Probabilities p(s_t+n, ·; θtarget) 
        
           pns_a = pns[range(self.batch_size), argmax_indices_ns]  # Double-Q probabilities p(s_t+n, argmax_a[(z, p(s_t+n, a; θonline))]; θtarget)

Great codes. Thanks.

In terms of the action selection, at least from the 'ShangtongZhang for DeepRL' repository, it just seems to me the Quantile loss (or maybe also the Categorical loss) selects the action with the target network, instead of the online_net as done in your code? Just seems to me there is a difference from typical ways of implementing the Quantile loss?

Also I am wondering Quantile vs Categorical in generale which one is better according to your experiments? Thanks.

Typo in readme

In your readme, you have "Data-efficient Rainbow [9] can be run using the following options:", but there's no [9] citation. There are however two [8] citations after each other.

Future improvements

First, hands down, amazing work. Serving as a baseline, I see a possible improvement, if someone wants to implement it:

The n-step return, as it is, is biased (as you are using old off-policy samples). Retrace [Safe and Efficient Off-Policy Reinforcement Learning] would resolve the issue. However, implementing Retrace in Distributional RL is not straightforward, but I see that work [The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning] deals with the issue (as it seems, without the quantile regression, however).

TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

Traceback (most recent call last):
File "main.py", line 81, in
state, done = env.reset(), False
File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 53, in reset
return torch.stack(self.state_buffer, 0)
TypeError: stack(): argument 'tensors' (position 1) must be tuple of Tensors, not collections.deque

Could somebody give a hand?

Memory capacity for example data-efficient Rainbow?

Hi folks,

I'm running the data-efficient Rainbow as a baseline for a project I'm starting, and one thing isn't making sense in my head. The original Rainbow paper uses a 1M transition buffer, and comparatively, the data-efficient paper (Appendix E) claims to use an unbounded memory.

Do you have any sense of what does an unbounded memory even mean in practice? Is there any particular reason you chose to make it smaller than the default Rainbow's memory buffer, rather than larger?

Thank you!

Prioritised Experience Replay

I am interested in implementing Rainbow too. I didn't go deep in code for the moment, but I just saw on the Readme.md that Prioritised Experience Replay is not checked.
Will this feature be implemented or it is maybe already working?
On their paper, Deepmind are actually showing that Prioritized Experience Replay is the most important feature, that means the "no priority" got the bigger performance gap with the full Rainbow.

Performance with QR prioritization on Space Invaders

Hello,

I wanted to make a sanity check of your code with QR prioritization (commit cf4c315) on Space Invaders.
I only did 20 millions step but the performance are way lower than expected.
My torch version is '0.4.0' and my atari_py version is '0.1.1'...
Here are the reward and the Q values for this training (I barely reach 3000 after 25 millions iterations).

I will now launch a sanity check on Space Invaders with those versions of pytorch and atari_py of your release v1.0 (i.e. commit 952fcb4)
I am doing this cause I got a multi-agent version of Rainbow, but it reaches only around 4000 score on space invaders after 50Millions iterations and with 4 agents. But the most likely is that I still got bugs in my multi-agent version of Rainbow...

Please create documentation for --render

Title stands- I'd very much like to be able to watch the trained agent play, and I haven't been able to figure out how to make it do that after spending half an hour looking through the code

Edit:
Can --render actually do anything? You use atari-py not gym, and only gym supports rendering: openai/atari-py#14

Asynchronous Multi-agent Rainbow

Hi,

I am currently using your code to communicate with the autonomous car simulator Carla, for the moment it's showing decent result (at least it can learn something).
I will now implement asynchronous multi-agent for Rainbow and before starting doing it, I just wanted to know if you already worked on it?
I will submit a pull request when it will be ready if it's ok for you.

Non-ASCI characters used without declaring encoding

When you use some of the flags to run the code (i.e. for the data efficient rainbow like i was) you use greek unicode characters for some of the metavariables. This creates an error when using them:

justinkterry@prophet:~/Dropbox/rainbow/Rainbow$ python main.py --target-update 2000                --T-max 100000                --learn-start 1600                --memory-capacity 100000                --replay-frequency 1                --multi-step 20                --architecture data-efficient                --hidden-size 256                --learning-rate 0.0001                --evaluation-interval 10000                --enable-cudnn
  File "main.py", line 27
SyntaxError: Non-ASCII character '\xcf' in file main.py on line 27, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

I read the PEP mentioned, and was going to make a PR with the fix myself, but adding # coding: utf-8 at the top like you'd expect gave the error

  File "main.py", line 5, in <module>
    from math import inf
ImportError: cannot import name inf

Can you please fix this so that the data efficient rainbow is usable?

Taking the max over step frame buffer

Hey Kai,

I'm digging into the implementation details, and have another question about a particular detail. in Env.step(), you store the frames after the 3rd and 4th repetitions of the actions, and then take the pixel-wise max between the two as the observation. Why do you do that? Is there a particular paper this comes from?

Thank you!

Add DeepMind wrappers for Gym environments

Port over wrappers necessary to match DeepMind training/testing protocol.

Alternative.

Breakout

Need to work out why Breakout fails to learn. After checking with Charles Beattie, DeepMind does not use OpenAI's FireResetEnv environment wrapper.

Updating Priorities with Importance Weighted Loss instead of TD-Error

In this line, the priorities are updated with the importance sampled weights (see this line). This does not appear to be consistent with algorithm 1 of Schaul et al. 2016 - is this intentional?

P.S. great work!

Unit test Prioritised Experience Replay Memory

PER was reported to cause issues (decreasing the performance of a DQN) when ported to another codebase. Although PER can cause performance to decrease, it is still likely that there exists a bug within it.

Hyper parameter

Can i check with you on the difference between the canonical parameter and the data efficient parameter? Can i say that the data efficient parameter are more likely recommended?

Zero-filling tensors to reset state buffer

In env.py, you clear state_buffer by enqueuing empty zero-filled tensors:

for _ in range(self.window):
      self.state_buffer.append(torch.zeros(84, 84, device=self.device))

I see how this would work if we are dealing with an environment where the state is a screen and each value in the tensor is an RGB value. However, can zero-filling a tensor to indicate the non-existance of state be generalized to other environments? For example, if each value in the tensor means "waiting time of a customer" then, would this approach also work?

Interrupted history transitions in ReplayMemory?

As segment tree uses a cyclic buffer, the beginning of some transitions (the history part) will be overwritten by new transitions. Thus the 'history' transitions will not be the actual previous transitions. I do not see the code to handle this. Is it a bug?

Same action in multi-agent environment

Hello, thank you for your contribution!
I am a student and recently I am running a multi-agent program with your code and I am suffering from a problem.
I'm using Unity3d to simulate multi-robots experiments and send the observations(a camera image and several sensors' information) to the python script. When I feed the states to the network, the output actions are the same.
e.g. We have 9 agents and each agent can choose 8 different actions, these actions can be {0,1,2,3,4,5,6,7}, when we feed the state to network, the outputs are always the same action for every time step, such as {1,1,1,1,1,1,1,1,1} or {2,2,2,2,2,2,2,2,2}, etc.
Do you have any idea about this kind of problem?

Might also want to add uncertainty bellman equation

https://arxiv.org/abs/1709.05380

Performance of release v1.0 on Space Invaders

I just launched the release v1.0 (commit 952fcb4) on Space Invaders for the whole week-end (around 25M steps). I took the exact same code with the exact same random seed.
I got really lower performance than the one you are showing.
Here are the plots of rewards and Q-values

Could you explain exactly how you got your results for this release? Did you try multiple experiments with different random seed and average them or just took the best one of them?
Or maybe it's a pytorch, atari_py or any other library issue? Could you give all your library version?

Port alewrap

alewrap is the wrapper used by DeepMind's Torch DQN code, and hence should contain the basic wrapper for the ALE. Atari contains the options that should be passed to the wrapper.

About Episodic Life at Test Phase

In quite a lot implementations and papers, the scores reported are actually when the game is over instead of loss of life (episodic life is only used during training).

You may consider remove episodic life for testing environment to match the score reported.

Pre-trained models param mismatch

Are the pretrained model files correct/linked to the correct commit? Tried running the evaluation with pretrained models of v1.3 and v1.4 but receiving a runtime error "Error(s) in loading state_dict for DQN"

Traceback (most recent call last):
File "main.py", line 82, in
dqn = Agent(args, env)
File "/home/akanksha/Documents/Rainbow/agent.py", line 24, in init
self.online_net.load_state_dict(torch.load(args.model, map_location='cpu'))
File "/home/akanksha/anaconda3/envs/rainbow/lib/python3.7/site-packages/torch/nn/modules/module.py", line 845, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DQN:
Missing key(s) in state_dict: "convs.4.weight", "convs.4.bias".
size mismatch for convs.0.weight: copying a param with shape torch.Size([32, 4, 5, 5]) from checkpoint, the shape in current model is torch.Size([32, 4, 8, 8]).
size mismatch for convs.2.weight: copying a param with shape torch.Size([64, 32, 5, 5]) from checkpoint, the shape in current model is torch.Size([64, 32, 4, 4]).
size mismatch for fc_h_v.weight_mu: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_v.weight_sigma: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_v.bias_mu: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_h_v.bias_sigma: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_h_v.weight_epsilon: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_v.bias_epsilon: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_h_a.weight_mu: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_a.weight_sigma: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_a.bias_mu: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_h_a.bias_sigma: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_h_a.weight_epsilon: copying a param with shape torch.Size([256, 576]) from checkpoint, the shape in current model is torch.Size([512, 3136]).
size mismatch for fc_h_a.bias_epsilon: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for fc_z_v.weight_mu: copying a param with shape torch.Size([51, 256]) from checkpoint, the shape in current model is torch.Size([51, 512]).
size mismatch for fc_z_v.weight_sigma: copying a param with shape torch.Size([51, 256]) from checkpoint, the shape in current model is torch.Size([51, 512]).
size mismatch for fc_z_v.weight_epsilon: copying a param with shape torch.Size([51, 256]) from checkpoint, the shape in current model is torch.Size([51, 512]).
size mismatch for fc_z_a.weight_mu: copying a param with shape torch.Size([918, 256]) from checkpoint, the shape in current model is torch.Size([306, 512]).
size mismatch for fc_z_a.weight_sigma: copying a param with shape torch.Size([918, 256]) from checkpoint, the shape in current model is torch.Size([306, 512]).
size mismatch for fc_z_a.bias_mu: copying a param with shape torch.Size([918]) from checkpoint, the shape in current model is torch.Size([306]).
size mismatch for fc_z_a.bias_sigma: copying a param with shape torch.Size([918]) from checkpoint, the shape in current model is torch.Size([306]).
size mismatch for fc_z_a.weight_epsilon: copying a param with shape torch.Size([918, 256]) from checkpoint, the shape in current model is torch.Size([306, 512]).
size mismatch for fc_z_a.bias_epsilon: copying a param with shape torch.Size([918]) from checkpoint, the shape in current model is torch.Size([306]).

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

......
self.actions.get(action): 4
self.actions.get(action): 4
self.actions.get(action): 4
self.actions.get(action): 4
self.actions.get(action): 1
self.actions.get(action): 1
self.actions.get(action): 1
self.actions.get(action): 1
self.actions.get(action): None

Traceback (most recent call last):
File "main.py", line 103, in
next_state, reward, done = env.step(action) # Step
File "C:\Users\simon\Desktop\DQN\RL-AlphaGO\Rainbow-master\env.py", line 63, in step
reward += self.ale.act(self.actions.get(action))
File "C:\Program Files\Python35\lib\site-packages\atari_py\ale_python_interface.py", line 159, in act
return ale_lib.act(self.obj, int(action))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Suggestions for improving training speed (especially when input data is large)

Hi, In memory.py, I suggested to change a little bit of your codes, which will be helpful for improving the training speed (especially when the input image data is 3D)

your original codes

state = torch.stack([trans.state for trans in transition[:self.history]]).to(dtype=torch.float32, device=self.device).div_(255)
next_state = torch.stack([trans.state for trans in transition[self.n:self.n + self.history]]).to(dtype=torch.float32, device=self.device).div_(255)

suggested codes

state = torch.stack([trans.state for trans in transition[:self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255)
next_state = torch.stack([trans.state for trans in transition[self.n:self.n + self.history]]).to(device=self.device).to(dtype=torch.float32).div_(255)

Here is the code for testing：

import timeit
import numpy as np
import torch

T,T1=[],[]

device=torch.device('cuda')
for i in range (0,4):

    A=np.zeros((100,100,100),dtype=np.int)
    B=torch.tensor(A,dtype=torch.int)
    T.append(B)
for i in range (0,4):
    A=np.zeros((100,100,100),dtype=np.int)
    B=torch.tensor(A,dtype=torch.int)
    T1.append(B)

# This line is used for initilization
M=torch.stack(T).to(dtype=torch.float32, device=device).div_(255)

# Comparison 
timea=timeit.default_timer()
M=torch.stack(T).to(dtype=torch.float32, device=device).div_(255)
timeb=timeit.default_timer()
N=torch.stack(T1).to(device=device).to(dtype=torch.float32).div_(255)
timec=timeit.default_timer()

print("time1 is:{}\n time2 is:{}".format(timeb-timea,timec-timeb))

Infinite loop in ReplayMemory._get_sample_from_segment

I'm testing a variant of this code using the vizdoom environment and in some cases I see an infinite loop here https://github.com/Kaixhin/Rainbow/blob/master/memory.py#L103. I'm still not familiar with the code so not sure how to go about debugging this. Any pointers would be appreciated. (Note that I haven't changed any of the replay memory code).

Thanks.

Async queue for testing

For games where evaluation takes a lot of time (particularly those that will just time out), a lot of time is wasted waiting for the evaluation to finish before the agent can resume training. It would be better to have a queue that takes a copy of the model at the evaluation timepoint and then evaluates the model in parallel with the training. Care needs to be taken to make sure the queue doesn't fill up with models (causing memory to run out) or many models are evaluated at once (probably best to limit to one evaluation at a time).

Testing should be not deterministic

There is a parameter --evaluation-episodes but in the current implementation, like we are always acting greedly, all the episodes are going to be exactly the same. I think that to get a better testing evaluation, you should add a deterministic=False when you are testing (i.e. in stead of taking the action with the higher Q value, you can sample on all the action with each Q value as the probability) .

I implemented that on my branch on the last commit marintoro/Rainbow@d061caf (it's really straightforward)

Btw I launched a training last night, everything worked properly. But I don't have access to a powerfull computer yet so the agent was still pretty poor in performance (in the early stage of training). I just wanted to know if you already launched a big training, on which game and if you compared it to a standard DRL algo (like simple DQN for example)?
Because there may still be some non-breaking errors in the implementation which could be sneaky to spot and debug (I mean if the agent is learning worse than simple DQN, there must be something wrong for example).

Detach or not on Quantile loss?

Rainbow/agent.py

Line 91 in cf4c315

kappa_cond = (u < 1).to(torch.float32) # |u| ≤ κ

I figured this out. Thanks. This is great.

memory file stuck in the while loop

Hello. In the memory file (line 101), the code stuck in the while loop for low learn_start parameter(like 200).

Use preallocated tensor in replay memory

Using a preallocated tensor (the size of which is known) in the replay memory instead of doing CUDA casts and hence allocating new memory should provide a speed boost (maybe)?

Question: Where did you get the history-length concept from?

Is it in a paper or did you decide to also train on adjacent previous transitions yourself?

Replicating DeepMind results

As of 5c252ea, this repo has been checked over several times for discrepancies, but is still unable to replicate DeepMind's results. This issue is to discuss any further points that may need fixing.

Should the loss be averaged or summed over the minibatch?
Should noisy network updating use independent noise per transition in the batch [v1] or the same noise but another noise sample for action selection [v2]?
Is the max priority over all time, or just from the current buffer (may be the former)? Results and paper indicate former.
Are priorities added as δ, or δ + ε (ε may not be needed with a KL loss)? One single ablation run indicates adding ε causes performance to drop more at end of training. δ + ε shouldn't be needed with a KL loss,
Most people implement PER by adding priorities already multiplied by α, but the maths indicates that the raw values should be stored and sampling should be done with respect to everything to the power of α? α isn't changed here - so not an issue.

Space Invaders (averaged losses):

Space Invaders (summed losses):

Building the environment with DeepMind wrappers

The environment in this repo is implemented directly from atari_py. However, in some cases, it might be useful to have the option to build it using OpenAI Gym's syntax. Is there a direct equivalent of the current environment in terms of OpenAi baselines.common.atari_wrapper? I was thinking something along the lines of wrap_deepmind(gym.make('QbertNoFrameskip-v4'),episode_life=False,frame_stack=True,scale=True,clip_rewards=False), but this doesn't seem to achieve quite the same rewards as your implementation.

Is this code only implemented DQN version?

Hi, I figure out that this code is only implemented dqn, not ddqn or rainbow. Can you provide a ddqn version for me? Mail: [email protected]

Dimension problem in forward function in model.py

I got an error when I am trying to launch main.py (pulled from master).
I got a "Number of dimensions of repeat dims can not be smaller than number of dimensions of tensor" at line 68.

x = v.repeat(1, self.action_space) + a - a_mean.repeat(1, self.action_space)  # Combine streams

Actually the dimension of a_mean is [batch_size,1,51] when the dim of v is [batch_size,51].
I am using Torch version '0.1.12_1'.

I tried quickly to reshape a_mean like v but there is some dimension missmatch later in code so I wanted to know if I were wrong before going deeper (I am pretty new with Pytorch..).

Question regarding the commit : Fix stddev to match NoisyNet paper

Hello,
I see one commit (6c8b281) which tries to fix the default value of the stddev in Noisy layer but I think this is anyway overridden by the default value in args.py which is 0.1.

Moreover there is a small note in the original Rainbow paper - small note just above Table 1 - stating :
"The noise was generated on the GPU. Tensorflow noise generation can be unreliable on GPU. If generating the noise on the CPU, lowering σ0 to 0.1 may be helpful"

So I think in all your experiments you actually used 0.1 and not 0.5 (nor 0.4) and this may be the right thing to do?

Handling of terminal state

I don't really understand what you did to the calcul of the nonterminals state in the last commit (line 128 in memory.py).

nonterminals = self.dtype_float([transition.nonterminal for transition in full_transitions[self.history + self.n - 1]]).unsqueeze(1)

What if the current state is just one step before a terminal state?
The full_transitions[self.history] will be terminal but not all the next one (cause you only postappend one frame as terminal in memory) and particulary the full_transitions[self.history + self.n - 1] will not be terminal...

In fact I think that the safest way to handle terminal_state is to postappend self.n frames (and set them as terminal) and not only one (exactly in the same way that in preappend where you add self.history frames and not only one).

Indeed if you got self.n>>self.history, the current computation of the returns will be wrong cause you don't check if we reach a terminal state and we could go in the next episode reward (when self.n<self.history this bug is hidden by the fact that you preappend self.history frames with 0 reward at the beginning of each episode)

Fix max in prioritised experience replay

Currently the "current" max, which should be used to initialise the priorities of new transitions, is set to the all-time max. Although this seems a small bug, it is still a bug. The best solution would be to combine the current sum tree with a max tree.

float state?

Hi Kaixhin,
I am trying to modify the code to use unscaled float inputs. Do I only have to make modifications in memory.py? Basically using float32 instead of uint8 and removing multiplying and dividing by 255 ?
Thanks!

Human-expert normalized scores

The Rainbow DQN paper uses human-expert normalized scores, so I am not sure how to evaluate the training results against the original paper. Do you know what values were used for human expert scores?

I found snippets of the values used from papers here and there, but not sure if we can use the same number and how we can compute a single normalized value for all Atari games:

Add ability to resume training

Why didn't you reset the noise of target net?

Population Based Training of Neural Networks

Hello, is it possible to include following technique in the repository as well?
Population Based Training of Neural Networks https://arxiv.org/abs/1711.09846

Valid or Same padding in Conv2D?

Hello,

I just realize by looking at the code of the open-source Rainbow agent from Google (here) that the network they use is using "Same" padding (the default in Tensorflow) while in torch the default is to use "Valid" padding.

This result in a really different network, indeed in the Tensorflow case, the first fully connected is taking a 7744 size vector (after flatten, it's [11,11,64] before flatten) while in your case using PyTorch the same fully connected is taking a 3136 size vector (it's [7,7,64] before flatten).

Don't know if it's really a problem but I thought it was interesting to notice...

Is the evluation procedure different?

Hi Kai,

In the Rainbow paper, the evaluation procedure is described as

The average scores of the agent are evaluated during training, every 1M steps in the environment, by suspending learning and evaluating the latest agent for 500K frames. Episodes are truncated at 108K frames (or 30 minutes of simulated play).

However, the code as written tests for a fixed number of episodes. Am I missing anything? Or is this the procedure from the data-efficient Rainbow paper (I couldn't find a detailed description there).

Thanks!

Pinned memory experience replay

A more efficient implementation would allocate a giant tensor in advance for each item (e.g. state, action) in a transition tuple, furthermore pin it (as long as the machine has enough RAM spare - should be at least 6GB?), and use asynchronous copies to GPU.

properly notify & raise errors on loading pretrained models

when I accidently gave incorrect model path:python main.py --model wrong/path, I wasn't able to catch this until too late. To avoid such mistakes, I've added a check to properly throw error if the file provided is missing. Also, I thought it'd be nice to be notified whether the training is starting from scratch or from model ( since there is really no way to figure out if the model is loaded properly without a print statement).

Here is the link to the pull request

No ROM File specified or the ROM file was not found.

I tried to run the game "AmidarNoFrameskip-v4" ,but there was an error like that, how to solve it? Thanks.

	pns = self.online_net(next_states) # Probabilities p(s_t+n, ·; θonline)
	dns = ((1 / self.atoms) if self.quantile else self.support.expand_as(pns)) * pns # Distribution d_t+n = (z, p(s_t+n, ·; θonline))
	argmax_indices_ns = dns.sum(2).argmax(1) # Perform argmax action selection using online network: argmax_a[(z, p(s_t+n, a; θonline))]
	self.target_net.reset_noise() # Sample new target net noise
	pns = self.target_net(next_states) # Probabilities p(s_t+n, ·; θtarget)
	pns_a = pns[range(self.batch_size), argmax_indices_ns] # Double-Q probabilities p(s_t+n, argmax_a[(z, p(s_t+n, a; θonline))]; θtarget)