google-deepmind / trfl Goto Github PK

TensorFlow Reinforcement Learning

License: Apache License 2.0

Python 100.00%

trfl's Introduction

TRFL

TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Learning agents.

Installation

TRFL can be installed from pip with the following command: pip install trfl

TRFL will work with both the CPU and GPU version of tensorflow, but to allow for that it does not list Tensorflow as a requirement, so you need to install Tensorflow and Tensorflow-probability separately if you haven't already done so.

Usage Example

import tensorflow as tf
import trfl

# Q-values for the previous and next timesteps, shape [batch_size, num_actions].
q_tm1 = tf.get_variable(
    "q_tm1", initializer=[[1., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
q_t = tf.get_variable(
    "q_t", initializer=[[0., 1., 0.], [1., 2., 0.]], dtype=tf.float32)

# Action indices, discounts and rewards, shape [batch_size].
a_tm1 = tf.constant([0, 1], dtype=tf.int32)
r_t = tf.constant([1, 1], dtype=tf.float32)
pcont_t = tf.constant([0, 1], dtype=tf.float32)  # the discount factor

# Q-learning loss, and auxiliary data.
loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

loss is the tensor representing the loss. For Q-learning, it is half the squared difference between the predicted Q-values and the TD targets, shape [batch_size]. Extra information is in the q_learning namedtuple, including q_learning.td_error and q_learning.target.

The loss tensor can be differentiated to derive the corresponding RL update.

reduced_loss = tf.reduce_mean(loss)
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
train_op = optimizer.minimize(reduced_loss)

All loss functions in the package return both a loss tensor and a namedtuple with extra information, using the above convention, but different functions may have different extra fields. Check the documentation of each function below for more information.

Documentation

Check out the full documentation page here.

trfl's People

Contributors

Stargazers

Watchers

Forkers

chenghuzi tigerneil universityai pranavbudhwant jg-fisher abhimanyuaryan shyamalschandra o7s8r6 arunkumarramanan deep-brainz codeaudit sondro triper1022 kaue shafiahmed ricklentz senen2 shubhampachori12110095 jdetras miquelramirez g-wang nd1511 ythwilam fendaq wanjinchang hareshkarnan little1tow houluy 170928 jk-cim hal2001 justindixon wwxfromtju lihuawu kazimbalti tonyle9 kaizengrowth brucedai003 oreo11 cmk pvr1 sharmer156 xylary anitksahu datianshi21 levelsethu xiaoyangxiaoen jxlin yuanwentian winning1120xx nkcr7 hanxixuana llealgt renly jdc08161063 dnzengou leonh316 wellbeing18 jgoodman8 jerrycatleung shaunstanislauslau lewisget lpadukana fengsee huismiling batermj faisalnazir hfxunlp decoderkurt logan-lu mathsshen mgrabbani allensmile buaazb yixinethanxie doyley91 ryanmaynard kongan wodo2008 tony32769 devhttps xyuan aavella77 yangdegang hhy5277 collector-m ghostintheshellarise hoangcuong2011 leechikara gbdevux hejujie sumhncku xiaoschannel tmorgan4 fundou xhoong mldl ilabutk mejihero science4fun

trfl's Issues

Legal actions mask bug

Found a bug in epsilon_greedy() in policy_ops.py when applying legal_actions_mask. It fails when masking the action with the highest action value.

For example:

action_values = [2.0, 1.0, 1.0]
legal_actions_mask = [0., 1., 1.]
epsilon = 0.1
result = policy_ops.epsilon_greedy(action_values, epsilon, legal_actions_mask).probs

Outputs:
[0.9 0.05 0.05]

import trfl not working

I am using Spyder (Python 3.6) in ubuntu 18.04
import tensorflow

import trfl

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):

File "", line 1, in
import trfl

File "/home/dd/.local/lib/python3.6/site-packages/trfl/init.py", line 31, in
from trfl.dist_value_ops import categorical_dist_double_qlearning

File "/home/dd/.local/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in
from trfl import distribution_ops

File "/home/dd/.local/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in
from trfl import gen_distribution_ops

File "/home/dd/.local/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 2, in
_op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))

File "/home/dd/.local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)

NotFoundError: /home/dd/.local/lib/python3.6/site-packages/trfl/_gen_distribution_ops.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl11string_viewEPFPNS_8OpKernelEPNS_20OpKernelConstructionEE

Pre-built python 3.7 packages

Pre-built wheel packages do not have python 3.7 --- https://pypi.org/project/trfl/#files.

Since this library does not depend on old behaviors Python 3 (if I am not wrong), it would be great to upload py37 packages to pypi.

Removing tf.contrib

Would you be open to accepting a PR to remove code using tf.contrib as it won't be available in TF 2 ?

How is deterministic policy gradient being evaluated?

I cannot grasp the steps for lines 87 to 92 in trfl/blob/master/trfl/dpg_ops.py. Why is a target_a being created? The subsequent stop_gradient is understandable since we don't want to update the Q-network's trainable variables. But then, what does this loss represent in the next line?
DPG to me is an application of the chain rule. How is the optimization of loss helping update the network?

I don't know if there is a better way to ask this question as I could not contact the authors of the dpg_ops.py (mainly Matteo Hessel and Miljan Martic) by any other means.

Add/alias dpg critic update

Hi, the DPG critic update (see Algorithm 1 of Lillicrap et al. 2016, https://arxiv.org/abs/1509.02971) is substantively the same as your td_learning function; however, this is currently obscured. I would suggest adding a dpg_qlearning function that aliases td_learning in dpg_ops.py:

from trfl.value_ops import td_learning
...
dpg_qlearning = td_learning

Alternatively, one could add a comment referencing the td_learning fn in the dpg actor update fn.

Retrace Ops: documented return shapes

Hi, it seems like the documented returns shapes for the following functions might be off:

retrace_ops.retrace(...)
retrace_ops.retrace_core(...)
retrace_ops._general_off_policy_corrected_multistep_target(...)

The first two are documented to return shape [B] and third shape [T, B, num_actions], while they all appear to return [T, B].

Some test code to check.

import numpy as np
import tensorflow as tf

from trfl import retrace_ops, indexing_ops


### Example input data: 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops_test.py#L41

lambda_ = 0.9
qs = [
    [[2.2, 3.2, 4.2],
     [5.2, 6.2, 7.2]],
    [[7.2, 6.2, 5.2],
     [4.2, 3.2, 2.2]],
    [[3.2, 5.2, 7.2],
     [4.2, 6.2, 9.2]],
    [[2.2, 8.2, 4.2],
     [9.2, 1.2, 8.2]]
     ]
targnet_qs = [
    [[2., 3., 4.],
     [5., 6., 7.]],
    [[7., 6., 5.],
     [4., 3., 2.]],
    [[3., 5., 7.],
     [4., 6., 9.]],
    [[2., 8., 4.],
     [9., 1., 8.]]
     ]
actions = [
    [2, 0], 
    [1, 2], 
    [0, 1], 
    [2, 0]
    ]
rewards = [
    [1.9, 2.9], 
    [3.9, 4.9], 
    [5.9, 6.9], 
    [np.nan, np.nan]  # nan marks entries we should never use.
    ]
pcontinues = [
    [0.8, 0.9], 
    [0.7, 0.8], 
    [0.6, 0.5], 
    [np.nan, np.nan]
    ]
target_policy_probs = [
    [[np.nan] * 3,
     [np.nan] * 3],
    [[0.41, 0.28, 0.31],
     [0.19, 0.77, 0.04]],
    [[0.22, 0.44, 0.34],
     [0.14, 0.25, 0.61]],
    [[0.16, 0.72, 0.12],
     [0.33, 0.30, 0.37]]
     ]
behaviour_policy_probs = [
    [np.nan, np.nan], 
    [0.85, 0.86], 
    [0.87, 0.88], 
    [0.89, 0.84]
    ]

### Retrace Test: ###
retrace = retrace_ops.retrace(
        lambda_, qs, targnet_qs, actions, rewards,
        pcontinues, target_policy_probs, behaviour_policy_probs)

# qs: shape [(T+1), B, num_actions] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L85
T = len(qs) - 1  # sequence length
B = len(qs[0])  # batch dimension
N = len(qs[0][0])  # number of actions

# loss: documented shape [B] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L121
tf.debugging.assert_equal(retrace.loss.shape, [T, B])  # succeeds

### Multi-step target Test: ###
timesteps = tf.shape(qs)[0] # Batch size is qs_shape[1].
timestep_indices_tm1 = tf.range(0, timesteps - 1)
timestep_indices_t = tf.range(1, timesteps)

target_policy_t = tf.gather(target_policy_probs, timestep_indices_t)
behaviour_policy_t = tf.gather(behaviour_policy_probs, timestep_indices_t)
a_t = tf.gather(actions, timestep_indices_t)
r_t = tf.gather(rewards, timestep_indices_tm1)
pcont_t = tf.gather(pcontinues, timestep_indices_tm1)
targnet_q_t = tf.gather(targnet_qs, timestep_indices_t)

c_t = retrace_ops._retrace_weights(
        indexing_ops.batched_index(target_policy_t, a_t),
        behaviour_policy_t) * lambda_

target = retrace_ops._general_off_policy_corrected_multistep_target(
  r_t, pcont_t, target_policy_t, c_t, targnet_q_t, a_t
)

# target: documented shape [T, B, N] 
# https://github.com/deepmind/trfl/blob/08ccb293edb929d6002786f1c0c177ef291f2956/trfl/retrace_ops.py#L241
tf.debugging.assert_equal(target.shape, [T, B])  # succeeds

target 'libtensorflow_framework.so' is both a rule and a file; please choose another name for the rule

When trying to build trfl from source i get this error and am unsure of how to proceed.

Trouble Installing TRFL 1.0.1 in Colab

I tried installing trfl version 1.0.1 in Colab and am getting an error:
import trfl

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-2-dd69192d7d7c> in <module>()
----> 1 import trfl

/usr/local/lib/python3.6/dist-packages/trfl/__init__.py in <module>()
     29 from trfl.discrete_policy_gradient_ops import discrete_policy_gradient_loss
     30 from trfl.discrete_policy_gradient_ops import sequence_advantage_actor_critic_loss
---> 31 from trfl.dist_value_ops import categorical_dist_double_qlearning
     32 from trfl.dist_value_ops import categorical_dist_qlearning
     33 from trfl.dist_value_ops import categorical_dist_td_learning

/usr/local/lib/python3.6/dist-packages/trfl/dist_value_ops.py in <module>()
     31 import tensorflow as tf
     32 from trfl import base_ops
---> 33 from trfl import distribution_ops
     34 
     35 Extra = collections.namedtuple("dist_value_extra", ["target"])

/usr/local/lib/python3.6/dist-packages/trfl/distribution_ops.py in <module>()
     28 import tensorflow as tf
     29 import tensorflow_probability as tfp
---> 30 from trfl import gen_distribution_ops
     31 
     32 

/usr/local/lib/python3.6/dist-packages/trfl/gen_distribution_ops.py in <module>()
      1 import tensorflow as tf
----> 2 _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))
      3 project_distribution = _op_lib.project_distribution
      4 del _op_lib, tf

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename)
     58     RuntimeError: when unable to load the library or get the python wrappers.
     59   """
---> 60   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     61 
     62   op_list_str = py_tf.TF_GetOpList(lib_handle)

NotFoundError: /usr/local/lib/python3.6/dist-packages/trfl/_gen_distribution_ops.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl11string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS8_EE

I was able to install TRFL previously with Colab. As discussed in earlier issues I installed TF 1.12, reset the runtime, installed TF prob 0.5, and installed TRFL. This was working until recently (past week or so?):
https://colab.research.google.com/drive/1h5QdpZZ-Vz2KdTiiidS4O28b-pU0ihgn

If I specify the TRFL version as 1.0, I am still able to run TRFL and install as I used to:
https://colab.research.google.com/drive/1YoITxCmP-3v-WWKqQxJMR1w3Kyc5nKjw

Clarification of some abbreviations?

Dear Deepminder:

During a group meeting I was raised a question about the meanings of abbreviations in the demo code of TRFL when I tried to introduce TRFL to my lab members. So I have to ask it here.

It reads:

q_tm1: the action value in the source state of a transition.
a_tm1: the action that was selected in the source state.

What does m1 mean here? I know "q" stands for action value, "t" stands for time step, I tried to figure "m1" stands for what, but it is not so intuitive.

Could you please help me on that? Thanks a lot.

Unable to install trfl on Windows 10 via Anaconda Prompt

Neither of the two installing options seem to work for me.

The command pip install trfl throws the following error:

Collecting trfl ERROR: Could not find a version that satisfies the requirement trfl (from versions: none) ERROR: No matching distribution found for trfl

And the command pip install git+git://github.com/deepmind/trfl.git throws this error:

...
Building wheels for collected packages: trfl
Building wheel for trfl (setup.py) ... error
ERROR: Complete output from command 'c:\users\luis\anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\Luis\AppData\Local\Temp\pip-req-build-5488atsp\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\Luis\AppData\Local\Temp\pip-wheel-2xiqoy_g' --python-tag cp36:
ERROR: running bdist_wheel
running build
running build_py
creating build
error: could not create 'build': file exists

ERROR: Failed building wheel for trfl
Running setup.py clean for trfl
Failed to build trfl
Installing collected packages: trfl
Running setup.py install for trfl ... error
ERROR: Complete output from command 'c:\users\luis\anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\Luis\AppData\Local\Temp\pip-req-build-5488atsp\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Luis\AppData\Local\Temp\pip-record-82clib4d\install-record.txt' --single-version-externally-managed --compile:
ERROR: running install
running build
running build_py
creating build
error: could not create 'build': file exists
----------------------------------------
ERROR: Command "'c:\users\luis\anaconda3\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\Luis\AppData\Local\Temp\pip-req-build-5488atsp\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Luis\AppData\Local\Temp\pip-record-82clib4d\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Luis\AppData\Local\Temp\pip-req-build-5488atsp`

Is gen_distribution_ops.py missing?

In https://github.com/deepmind/trfl/blob/master/trfl/distribution_ops.py#L30, I found

from trfl import gen_distribution_ops

But it seems that there is no such gen_distribution_ops.py in source code. Is gen_distribution_ops.py missing?

policy_gradient_loss batch_shape requirements

Why does policy_gradient_ops.policy_gradient_loss require batch_shape to be a rank 2 tensor? This would limit the policy_gradient_loss operation to only single univariate distributions that implement log_prob?

For instance, consider the problem where the actions are multivariate and follow a normal distribution:

>>> import tensorflow as tf; tf.enable_eager_execution()
>>> import tensorflow.contrib.eager as tfe
>>> import tensorflow_probability as tfp
>>> loc = tfe.Variable(tf.zeros([5, 5, 2]))
>>> policy = tfp.distributions.Normal(loc=loc, scale=1.)
<tfp.distributions.Normal 'Normal/' batch_shape=(5, 5, 2) event_shape=() dtype=float32>
>>> trfl.policy_gradient_loss(policy, tf.zeros([5, 5, 2]), tf.ones([5, 5]), [loc])
Traceback (most recent call last):
  File "/trfl/policy_gradient_ops.py", line 119, in policy_gradient_loss
    policies_.batch_shape.assert_has_rank(2)
  File "/tensorflow/python/framework/tensor_shape.py", line 728, in assert_has_rank
    raise ValueError("Shape %s must have rank %d" % (self, rank))
ValueError: Shape (5, 5, 2) must have rank 2

I could understand how it is a requirement for a discrete distribution. But for the sake of supporting other distributions, it may be more structured to require the log_prob to be rank 3 and then perform a summation operation over the leading dimension:

>>> policy.log_prob(tf.zeros([5, 5, 2])).shape
TensorShape([Dimension(5), Dimension(5), Dimension(2)])
>>> tf.reduce_sum(policy.log_prob(tf.zeros([5, 5, 2])), axis=-1).shape
TensorShape([Dimension(5), Dimension(5)])

Thank you for your support and time.

tensorflow.python.framework.errors_impl.NotFoundError: _gen_distribution_ops.so

I can not run the example the example file from a basic installation

repository here: https://github.com/LuisSaybe/trfl-gridworld

output:

WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.

Traceback (most recent call last):
  File "index.py", line 2, in <module>
    import trfl
  File "/usr/local/lib/python3.6/site-packages/trfl/__init__.py", line 31, in <module>
    from trfl.dist_value_ops import categorical_dist_double_qlearning
  File "/usr/local/lib/python3.6/site-packages/trfl/dist_value_ops.py", line 33, in <module>
    from trfl import distribution_ops
  File "/usr/local/lib/python3.6/site-packages/trfl/distribution_ops.py", line 30, in <module>
    from trfl import gen_distribution_ops
  File "/usr/local/lib/python3.6/site-packages/trfl/gen_distribution_ops.py", line 2, in <module>
    _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.6/site-packages/trfl/_gen_distribution_ops.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl11string_viewEPFPNS_8OpKernelEPNS_20OpKernelConstructionEE

I installed trfl with the following dockerfile

FROM centos:latest

ENV TERM xterm
ENV SOURCE_DIRECTORY /root/source
ENV PYTHON_VERSION 3.6.8

RUN yum -y update && \
    yum install -y gcc g++ openssl-devel zlib-devel libffi-devel man-pages man nano wget curl git-all unzip && \
    yum clean all && \

    wget --directory-prefix=/opt https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz && \
    tar -xzf /opt/Python-$PYTHON_VERSION.tgz --directory /opt && \
    rm /opt/Python-$PYTHON_VERSION.tgz && \
    cd /opt/Python-$PYTHON_VERSION && \

    ./configure && \
    make && \
    make install && \

    pip3 install --upgrade pip && \
    pip3 install numpy tensorflow tensorflow_probability trfl && \

    mkdir -p $SOURCE_DIRECTORY

WORKDIR $SOURCE_DIRECTORY

Then I run

docker run -it --rm -v $(pwd)/src:/root/source gridworld-trfl python3 index.py

index.py here

import tensorflow as tf
import trfl

# Q-values for the previous and next timesteps, shape [batch_size, num_actions].
q_tm1 = tf.get_variable(
    "q_tm1", initializer=[[1., 1., 0.], [1., 2., 0.]], dtype=tf.float32)
q_t = tf.get_variable(
    "q_t", initializer=[[0., 1., 0.], [1., 2., 0.]], dtype=tf.float32)

# Action indices, discounts and rewards, shape [batch_size].
a_tm1 = tf.constant([0, 1], dtype=tf.int32)
r_t = tf.constant([1, 1], dtype=tf.float32)
pcont_t = tf.constant([0, 1], dtype=tf.float32)  # the discount factor

# Q-learning loss, and auxiliary data.
loss, q_learning = trfl.qlearning(q_tm1, a_tm1, r_t, pcont_t, q_t)

print('loss', loss)

Questions about retrace implementation

Hey,

I was looking at the retrace ops provided by trfl and there are a couple of implementation details that seem a bit confusing to me.

It seems like trfl retrace drops the discount terms from the 𝔼_π Q(x_t, .) term. This is in line with the retrace formulation in Equation 13 in MPO paper [1], but is different from Equation 4 in the original retrace paper [2]. I have included a small test case below that shows this. Is this a bug or a conscious choice? Edit: actually, it seems like at least one of the terms is included in the continuation probs.
In retrace_ops._general_off_policy_corrected_multistep_target comments, it's mentioned that exp_q_t = 𝔼_π Q(x_{t+1},.) and qa_t = Q(x_t, a_t), indicating that exp_q_t should be one timestep ahead of qa_t: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L252-L253
However, If I understand this correctly, when those values are actually assigned, they come from the same time indices: https://github.com/deepmind/trfl/blob/e633edbd9d326b8bebc7c7c7d53f37118b48a440/trfl/retrace_ops.py#L263-L264
It's possible that the target_policy_t values that are used to index for exp_q_t somehow account this, but I can't wrap my head around how that would do it. Am I misunderstanding something here or is it possible that these indices are actually off?

[1] Abdolmaleki, A., Springenberg, J.T., Tassa, Y., Munos, R., Heess, N. and Riedmiller, M., 2018. Maximum a posteriori policy optimisation. arXiv preprint arXiv:1806.06920.
[2] Munos, R., Stepleton, T., Harutyunyan, A. and Bellemare, M., 2016. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (pp. 1054-1062).

Code related to question 1 (click to expand):

The test case is simplified (e.g. just one action) and I have used a slightly modified version of trfl to make it compatible with tf2, but all the logic should be the correct.

import numpy as np
import tensorflow as tf

from trfl import retrace_ops


lambda_ = 0.99
discount = 0.9
Q_values = np.array([
    [[2.2], [5.2]],
    [[7.2], [4.2]],
    [[3.2], [4.2]],
    [[2.2], [9.2]]], dtype=np.float32)
target_Q_values = np.array([
    [[2.], [5.]],
    [[7.], [4.]],
    [[3.], [4.]],
    [[2.], [9.]]], dtype=np.float32)
actions = np.array([
    [0, 0],
    [0, 0],
    [0, 0],
    [0, 0]])
rewards = np.array([
    [1.9, 2.9],
    [3.9, 4.9],
    [5.9, 6.9],
    [np.nan, np.nan],  # nan marks entries we should never use.
], dtype=np.float32)
pcontinues = np.array([
    [0.8, 0.9],
    [0.7, 0.8],
    [0.6, 0.5],
    [np.nan, np.nan]], dtype=np.float32)
target_policy_probs = np.array([
    [[np.nan] * 1, [np.nan] * 1],
    [[1.0], [1.0]],
    [[1.0], [1.0]],
    [[1.0], [1.0]]], dtype=np.float32)
behavior_policy_probs = np.array([
    [np.nan, np.nan],
    [1.0, 1.0],
    [1.0, 1.0],
    [1.0, 1.0]], dtype=np.float32)


def retrace_original_v1(
        lambda_,
        discount,
        target_Q_values,
        actions,
        rewards,
        target_policy_probs,
        behavior_policy_probs):
    actions = actions[1:, ...]
    rewards = rewards[:-1, ...]

    target_policy_probs = target_policy_probs[1:, ...]
    behavior_policy_probs = behavior_policy_probs[1:, ...]

    traces = lambda_ * np.minimum(
        1.0, target_policy_probs / behavior_policy_probs[..., None])

    deltas = (
        rewards[..., None]
        + discount * target_Q_values[1:]
        - target_Q_values[:-1])
    retraces = []
    for i in range(tf.shape(traces)[0]):
        sum_terms = []
        for t in range(i, tf.shape(traces)[0]):
            trace = tf.reduce_prod([
                traces[k]
                for k in range(i + 1, t + 1)
            ], axis=0)
            sum_term = discount ** (t - i) * trace * deltas[t]
            sum_terms.append(sum_term)

        result = tf.reduce_sum(sum_terms, axis=0)
        retraces.append(result)

    retraces = tf.stack(retraces) + target_Q_values[:-1]
    return retraces


output_original_v1 = retrace_original_v1(
    lambda_,
    1.0,
    target_Q_values,
    actions,
    rewards,
    target_policy_probs,
    behavior_policy_probs)
print(f"output_original_v1:\n{output_original_v1.numpy().round(3)}\n")

output_original_discounted_v1 = retrace_original_v1(
    lambda_,
    discount,
    target_Q_values,
    actions,
    rewards,
    target_policy_probs,
    behavior_policy_probs)
print(f"output_original_discounted_v1:\n{output_original_discounted_v1.numpy().round(3)}\n")


output_trfl_v1 = retrace_ops.retrace(
    lambda_,
    Q_values,
    target_Q_values,
    actions,
    rewards,
    tf.ones_like(rewards),
    target_policy_probs,
    behavior_policy_probs,
).extra.target[..., None]


tf.debugging.assert_near(output_original_v1, output_trfl_v1)  # succeeds
tf.debugging.assert_near(output_original_discounted_v1, output_trfl_v1)  # fails

Tensorflow2.0 support? and Request example notebooks(sample projects)

Is this support TF 2.0 now?
and please give some sample notebooks tutorials (colab )

Issue with pip install trfl on MacOs

Hello,

I get the following error when using pip install trfl

Could not find a version that satisfies the requirement trfl (from versions: )
No matching distribution found for trfl

I have tensorflow 1.13.1 & tensorflow-probability 0.60

Do you have an idea what the issue could be?
Thanks in advance for your help

Raise "error: could not create 'build': File exists" while installing

When I firstly install trfl, it raised error almost at the end of installation,
Failed building wheel for trfl Running setup.py clean for trfl Failed to build trfl Installing collected packages: trfl Running setup.py install for trfl ... error
The further issue is like

running install running build running build_py creating build error: could not create 'build': File exists

ImportError: cannot import name gen_distribution_ops

When I try to import trfl, similarly to this public trfl colab notebook online, I get

(Note I tried this in both python 2 and 3 notebooks, met with the same results)

<ipython-input-3-dd69192d7d7c> in <module>()
----> 1 import trfl

/usr/local/lib/python2.7/dist-packages/trfl/__init__.py in <module>()
     29 from trfl.discrete_policy_gradient_ops import discrete_policy_gradient_loss
     30 from trfl.discrete_policy_gradient_ops import sequence_advantage_actor_critic_loss
---> 31 from trfl.dist_value_ops import categorical_dist_double_qlearning
     32 from trfl.dist_value_ops import categorical_dist_qlearning
     33 from trfl.dist_value_ops import categorical_dist_td_learning

/usr/local/lib/python2.7/dist-packages/trfl/dist_value_ops.py in <module>()
     31 import tensorflow as tf
     32 from trfl import base_ops
---> 33 from trfl import distribution_ops
     34 
     35 Extra = collections.namedtuple("dist_value_extra", ["target"])

/usr/local/lib/python2.7/dist-packages/trfl/distribution_ops.py in <module>()
     28 import tensorflow as tf
     29 import tensorflow_probability as tfp
---> 30 from trfl import gen_distribution_ops
     31 
     32 

ImportError: cannot import name gen_distribution_ops

(Also, if I install trfl via pip instead of cloning from git, error messages look similar with this added on the end)


/usr/local/lib/python2.7/dist-packages/trfl/gen_distribution_ops.py in <module>()
      1 import tensorflow as tf
----> 2 _op_lib = tf.load_op_library(tf.resource_loader.get_path_to_datafile("_gen_distribution_ops.so"))
      3 project_distribution = _op_lib.project_distribution
      4 del _op_lib, tf

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.pyc in load_op_library(library_filename)
     59     RuntimeError: when unable to load the library or get the python wrappers.
     60   """
---> 61   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     62 
     63   op_list_str = py_tf.TF_GetOpList(lib_handle)

Help installing on Windows 10 via Anaconda Environment

I ran the recommended install command and got this:

>pip install git+git://github.com/deepmind/trfl.git
Collecting git+git://github.com/deepmind/trfl.git
  Cloning git://github.com/deepmind/trfl.git to c:\users\julius\appdata\local\temp\pip-req-build-8py9u2uh
  Error [WinError 2] The system cannot find the file specified while executing command git clone -q git://github.com/deepmind/trfl.git C:\Users\Julius\AppData\Local\Temp\pip-req-build-8py9u2uh
Cannot find command 'git' - do you have 'git' installed and in your PATH?

I am running Windows 10 and using an anaconda environment #