microsoft / fqf Goto Github PK

FQF(Fully parameterized Quantile Function for distributional reinforcement learning) is a general reinforcement learning framework for Atari games, which can learn to play Atari games automatically by predicting return distribution in the form of a fully parameterized quantile function.

License: Other

Python 20.12% Jupyter Notebook 79.70% Shell 0.18%

fqf's Introduction

Fully parameterized Quantile Function (FQF)

Tensorflow implementation of paper

Fully Parameterized Quantile Function for Distribution Reinforcement Learning

Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, Tie-yan Liu

If you use this code in your research, please cite

@inproceedings{yang2019fully,
  title={Fully Parameterized Quantile Function for Distributional Reinforcement Learning},
  author={Yang, Derek and Zhao, Li and Lin, Zichuan and Qin, Tao and Bian, Jiang and Liu, Tie-Yan},
  booktitle={Advances in Neural Information Processing Systems},
  pages={6190--6199},
  year={2019}
}

Requirements

python==3.6
tensorflow
gym
absl-py
atari-py
gin-config
opencv-python

Installation on Ubuntu

sudo apt-get update && sudo apt-get install cmake zlib1g-dev
pip install absl-py atari-py gin-config==0.1.4 gym opencv-python tensorflow-gpu==1.12.0
cd FQF
pip install -e .

Experiments

Our experiments and hyper-parameter searching can be simply run as the following

cd FQF/dopamine/discrete_domains
bash run-fqf.sh

Bug Fixed

It is recommended to use the L2 loss on gradient for probability proposal network, or clip the largest proposed probability to 0.98. The reason is as follows: in quantile function, when the probability goes to 1, the quantile value goes to infinity(or a very large number). Although a very large quantile value is reasonable for a probability such as 0.9999999, with limited approximation ability of neural network, quantile values for other probabilities will go up quickly, leading to a performance drop.

Acknowledgement

Our code is implemented based on dopamine.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

fqf's People

Contributors

Stargazers

Watchers

Forkers

taffywrinkle claudiusgonzo justcherie surapoom kellsky tzuren tony177269878 julio-cmdr guichuan0817 dc-liuyy

fqf's Issues

Some Instructions Required

Hi, I'm just trying to build up some new stuff from your word. Could you please give me some instructions or suggestions on how to use your code here?

tf.gather_nd error

Hi!

I'm triyng to run FQF using the script run-fqf.sh, but I'm getting an error that I couldn't resolve. It only happens when the agent starts trainning.

I'm running the code using CPU and not GPU. Would be it the problem?

Thanks for your attention!

File "train.py", line 65, in <module>
    app.run(main)
[elided 14 identical lines from previous traceback]
File "../../dopamine/agents/dqn/dqn_agent.py", line 205, in __init__
    self._train_op = self._build_train_op()
File "../../dopamine/agents/fqf/fqf_agent.py", line 377, in _build_train_op
    chosen_action_L_tau = tf.gather_nd(self._replay_net_outputs.L_tau, reshaped_actions)
File "/home/julio/.local/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3647, in gather_nd
    "GatherNd", params=params, indices=indices, name=name)
File "/home/julio/.local/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
File "/home/julio/.local/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
File "/home/julio/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
File "/home/julio/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): indices[31] = [31, 1] does not index into shape [31,32,9]
	 [[node gradients_2/GatherNd_3_grad/ScatterNd (defined at ../../dopamine/agents/fqf/fqf_agent.py:410) ### ]]

stale gradients problem

If I didn't get it wrong, there might be a subtle problem in applying gradients to FPN's trainable variables.

When optimizing FPN, the application of gradients w.r.t. FPN's trainable variables is separated into 2 stages: first dW1 (from the 1-Wasserstein loss) and then the entropy.
After the first optimization, the trainables would have changed.
What I mean is: entropy is calculated based on the old trainables but applied to the new trainables.
I'm not sure, but is this the so-called stale gradients problem?

Hope to respond

Reproducing paper results

Hi,
I am trying to evaluate FQF, to use it as a baseline on some discrete environments. However, I encountered an issue: the script run-iqn.sh [EDIT: run-fqf.sh] does not seem to evaluate FQF, but actually IQN. I think the problem comes from the function create_agent in dopamine/discrete_domains/run_experiment.py can only create Rainbow, DQN and IQN (and not FQF). It is possible I missed something, could you explain how I can use this code to evaluate FQF?
Thanks,
Nino

entroy coeffieicent problem

If I didn't get it wrong, there might be a subtle problem in applying gradients to FPN's trainable variables.

the entropy coefficient, 0.001 or fqf_ent or self.ent in the code, applied twice.

first at fqf_agent.py, line 399, via a magic number 0.001：

q_entropy = tf.reduce_sum(-quantile_tau * tf.log(quantile_tau), axis=1) * 0.001

then at line 419 the same file, applied twice, via self.ent：

self.optimizer1.minimize(self.ent * tf.reduce_mean(-q_entropy), var_list=fqf_params), \

fraction proposal network of FQF

Hi,
I have some problems about fraction proposal network of FQF:
1.why set fraction_lr=5e-5*fqf_factor(0.000001)=5e-11, which is very small? And I found that the tau_hats distibution almost had no change during the training.
2.why apply initialize_weights_xavier(x, gain=0.01)? When I trained, I found if I didn't apply this initialization, gradient explosion would happen sometiomes.
3.why use RMSprop, and set alpha=0.95, eps=0.00001, of which the default values are 0.99 and 10e-8 respectively.
4.And I found that the tau_hats distibution almost had no change during the training of qbert. Is it the key of this algorithm?
thanks!