takuseno / d3rlpy Goto Github PK

View Code? Open in Web Editor NEW

1.2K 28.0 218.0 21.63 MB

An offline deep reinforcement learning library

Home Page: https://takuseno.github.io/d3rlpy

License: MIT License

Shell 0.15% Python 98.53% Jupyter Notebook 1.23% Dockerfile 0.10%

deep-reinforcement-learning deep-learning pytorch offline-rl

d3rlpy's Introduction

📖 About me

💻 Research Scientist @ Sony Research (2020/10/1 - Present)
🎓 Ph.D @ Keio University (2023)
🔥 IPA MITOU super creator (2020)
⌨️ Vimmer (a whole time)
👀 Visit here for more information

🚀 GitHub Projects

As an owner

As a contributor

d3rlpy's People

Contributors

Stargazers

Watchers

Forkers

tandakun gitter-badger meokz denden047 lqnew alxlampe araffin amrmkayid guyk1971 ritou11 staminatang navidmdn jamartinh sparthib jkbjh yangrui2015 shenqianli wx-b vmbbc yingfan-bot rl-code-lib wwchung91 longfeizhang617 ianwangg mohan-zhang-u zjuyichen phoenixera tutss pilgrimygy mobius1d megan-klaiber astrojuanlu geneliuxe pstansell xtwentian3 techthiyanes asdqsczser leehe228 yuan776 siomvas renzhenxuexidemaimai khalidlabs orrivlin bobosui tzvetomir rainwangphy heyx97 zhehui-huang robbjr99 leocorelli grushton11 shunian-chen hearsch-jariwala t6-thu kimiria cp268 dadam1026 sarnadabhilash mxm32 zggl ryanzhang111 prafael18 zhangry868 smorad takuyahiraoka j2021s aiueola mcx gamecicn tengxiao1 deanhuang-git yangxiaohan57 paulhan14 zhlzhl yuwei-z pkulwj1994 mingzhe-li linlinlin97 hyeonhoonlee thanhkaist nusmadrl takuyamagata maxsobolmark dianadanzhang abhiksingla uotter cyzhao1991 syuntoku14 2019chengong thatscotdatasci junming-yang pinakigupta drothermel zlapp kouroshhakha yang0110 sungjinl tominku blankshc taylorcoders

d3rlpy's Issues

[DOCS] "Convert d3rlpy to use SB3 helpers" example code broken

The second example on this page gives an error

https://d3rlpy.readthedocs.io/en/v0.51/references/sb3.html

I think it's because observations is used instead of obs.

[REQUEST] Add additional info for MDPDatSetd

Hi, it would be very useful to have additional info dict into the MDPDataset.

I think It can be implemented at two levels:

Additional info as "INFO" dict per "row" where user can add any other relevant measures such as "second reward", "risk" etc, of course this can be incorporated to observations but this will break the meaning of observations and could make the dataset non usable for the learning API.
Additional "METADATA " dict at Dataset Level, in here one can specify things such as "Observation column names", "action var names", suggested "action scaling" etc

[BUG] tentative bug in dataset.extend(dataset)

Describe the bug

Unexpected TypeError from dataset.extend(dataset). (I'm relatively new to Python, so I'm not entirely sure it's a bug, but it seems very strange to me).

To Reproduce

The following script gives the error below:

import numpy as np
from d3rlpy.dataset import MDPDataset

N = 10
observations = np.random.random((N, 3))
actions = np.random.random((N, 1))
rewards = np.random.random(N)
terminals = np.append(np.zeros(N-1, dtype='int'), [1])

dataset = MDPDataset(observations, actions, rewards, terminals)
#print(len(dataset))

dataset.extend(dataset)
print(len(dataset))

Error

Traceback (most recent call last):
  File "bug_dataset.py", line 13, in <module>
    dataset.extend(dataset)
  File "d3rlpy/dataset.pyx", line 458, in d3rlpy.dataset.MDPDataset.extend
  File "d3rlpy/dataset.pyx", line 443, in d3rlpy.dataset.MDPDataset.append
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'list'

However, if I uncomment the print statement on line 11 the code runs without error.

Additional context

I'm using the following

Python            3.8.0
d3rlpy            0.51
numpy             1.19.5

[BUG] Cannot add last observation to episode and implicit terminals

At first, thank you very much @takuseno for this super nice and clean written RL library! 🥇

Describe the bug

1. Truncated episodes:
In my RL project, I am running complete episodes and add them to the replay buffer afterwards by creating Episode objects and using buffer.append_episode(episode).
I saw, that one cannot pass in next_observations and terminals to create episodes. When looking at the internals how you create transitions, it makes sense, because next_observations do not have the corresponding next_actions and next_rewards available.
Since you are creating next_observations from observations, the last observation (which is next_observations[-1]) of the episode is always missing.

Furthermore, if next_observations[-1] is a real terminal state, where terminals[-1]=True, the agent would never see it in training.

The question is, are you using next_actions and next_rewards somewhere explicitly in the algorithms?

2. Implicit terminals:
I also noticed, that you are implicitly setting terminal=True for the last observation. There are use cases, where there are no real terminal states, e.g. in environments with infinite time horizon (control environments like Pendulum-v0). Setting terminal=True at the end of an episode for those environments would result in wrong TD targets.

To Reproduce

In Pendulum-v0 environment, the episode is truncated after 200 time steps. The info dict, that is returned each time step indicates the truncation with info={'TimeLimit.truncated': True}.

import gym
import numpy as np

from d3rlpy.dataset import Episode

if __name__ == '__main__':
    env = gym.make('Pendulum-v0')

    states = []
    actions = []
    rewards = []
    next_states = []
    dones = []

    done = False
    state = env.reset()
    while not done:
        action = env.action_space.sample()
        next_state, reward, done, info = env.step(action)

        states.append(state)
        rewards.append(reward)
        actions.append(action)
        next_states.append(next_state)
        dones.append(done)

        state = next_state

    # info of last env step indicates time truncated episode -> no terminal state
    print(info)
    if info.get('TimeLimit.truncated', False):
        dones[-1] = False

    states = np.stack(states, axis=0)
    actions = np.stack(actions, axis=0)
    rewards = np.stack(rewards, axis=0)

    episode = Episode(env.observation_space.shape,
                      env.action_space.low.size,
                      states,
                      actions,
                      rewards)
    # len(episode)==199 since next_states not being used to create it
    print(len(episode))

Expected behavior
I would expect, that one can control if terminals=True should be set at the end of the episode.
To get an episode length of 200 in the Pendulum-v0 example, one can extend action and reward vectors with dummy values at the last time step but at the risk that this dummy values are used somewhere in the library.

A workaround is to use buffer.append in a for loop, to add the episode to the buffer. But one must not forget to set buffer.prev_observation, buffer.prev_actions etc. to None afterwards.

Edit:

I dived a bit deeper into your algo update functions and found that you are using rew_tp1 to calculate the TD target with

y = rew_tp1 + gamma * q_tp1

If I add transitions to the buffer as mentioned above, the reward of the current time step should be used instead of rew_tp1 (as far as I know).
Furthermore, I see that you are not taking the entropy bonus (e.g. in SAC) into account in the TD target calculation. Is this intended? In other libraries, it was added here.
Sorry for mixing up so much topics. Should I create separate issues for discussion?

TODO

[DOCS] Code typo correction

This is extremely minor, but you have a typo in your code on this page
https://d3rlpy.readthedocs.io/en/latest/references/off_policy_evaluation.html

You have spelt mixed incorrectly on this line of your example

dataset, env = get_pybullet('hopper-bullet-miexed-v0')

[BUG] Kernel size can't be greater than actual input size.

First of all, thank you for creating such a great library.

Describe the bug
I am running online training with DQN on gym atari environment. But I guess because of hardcoded part, observation size is smaller than the kernel size which causes error. If the kernel size is hardcoded, could you change it?

To Reproduce

def dqn_train(n_steps=10000, use_gpu=False):

  # setup DQN algorithm
  dqn = DQN(n_frames=4, learning_rate=1e-3, target_update_interval=100, use_gpu=use_gpu)

  # setup explorers
  explorer = LinearDecayEpsilonGreedy(start_epsilon=1.0,
                                      end_epsilon=0.1,
                                      duration=10000)

  # setup replay buffer
  buffer = ReplayBuffer(maxlen=50000, env=env)

  # start training
  dqn.fit_online(env,
                 buffer,
                 explorer = explorer,
                 eval_env=eval_env,
                 n_steps=20000, #n_steps=50000,
                 n_steps_per_epoch=10000, #n_steps_per_epoch=10000,
                 show_progress=True)
  return buffer, dqn



env = gym.make('Breakout-v0') 
eval_env = gym.make('Breakout-v0')
buffer, dqn = dqn_train(n_steps=20000)

The code is more or less like this.

Expected behavior
It gives the following error.
RuntimeError: Calculated padded input size per channel: (160 x 3). Kernel size: (8 x 8). Kernel size can't be greater than actual input size.

Additional context
Environment's observation space is Box(0, 255, (210, 160, 3), uint8).

Could you help me with this? Thanks in advance.

[PROPOSAL] Stable-Baselines3 Wrappers

Hello,

Would you be interested in wrappers/convert scripts to make this repo works with Stable-Baselines3 (SB3)?

The thing is that d3rlpy API is quite nice, close to SB3 one and the offline algorithms in d3rlpy complements the one available in SB3.
We could also add a section in our documentation explaining how to use d3rlpy with SB3 ;)

The idea would be to have different wrappers:

one that adds the learn() method to continue training after pretraining with offline data and that makes the predict() method,match the SB3 signature (it returns actions and hidden states in the case of LSTM policies instead of just states)
this would allow to integrate d3rlpy to the rl-zoo
one that convert d3rlpy dataset to SB3 replay buffer or the otherway around

and maybe a wrapper to load pre-trained policies from d3rlpy into SB3 models.

Later other wrappers would be to convert SB3 models to use d3rlpy api (that would give you access to all algorithms from SB3).

[QUESTION] bug title

Hi, I am doing something wrong ?

I have a continuous action space with values from 0 to 100.
I created first an MDPDataset with actions recorder in the range 0 to 100 but the output was allways the same value 1.

Now I created the same but with values in [0,1] and after many many epochs the algo always predicts 1 for each step.

I tested this with CQL and BEAR with same behavior.
Any clues ?

cql = CQL(use_gpu=True)
bear = BEAR(q_func_type = 'iqn', use_gpu=True)

Thanks !

[BUG] unused parameter 'kernel_type' when reproducing BEAR results

Describe the bug
I tried to reproduce the results published in Kumar et al., Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction. <https://arxiv.org/abs/1906.00949> by running the script located in reproductions/offline/bear.py. I guess in line 29 you made a slight mistake as the parameter name is expected to be mmd_kernel according to d3rlpy/algos/bear.py rather than kernel_type. It results in blindly choosing laplacian kernel for all environments including halfcheetah, while according to the original paper, gaussian kernel is a better choice for halfcheetah.

To Reproduce
python reproductions/offline/bear.py --dataset halfcheetah-medium-v2
Then the logger will report a warning for the unused parameter 'kernel_type'. And I checked params.json and found that laplacian kernel was chosen as it is the default option.

I also tried to modify reproductions/offline/bear.py by replacing 'kernel_type' with 'mmd_kernel' and the warning was silenced. For halfcheetah-medium-v2, there is a ~1k performance gap between two kernel choices (gaussian ~6000, laplacian ~5000, average results with random seeds [1,2,3]).

Expected behavior
gaussian kernel for halfcheetah and laplacian kernel for the rest.

Additional context
Thanks for your amazing and powerful toolkits! Wish you guys all the best :)

[REQUEST] plot using uniform_filter1d instead of np.convolve

In cli.py, it might be nicer to use uniform_filter1d from scipy instead of np.convolve, because the former deals better with the start and end of the data.

Here is an illustration of the difference when using window lengths of 1 and 10 in your command

d3rlpy plot --window 1 d3rlpy_logs/mixed_seed101/environment.csv

For window = 1 the uniform_filter1d and np.convolve plots are identical. They look like this in my example:

However, when window = 10 the plots look very different. Here is the result of using np.convolve:

Here is the result from using uniform_filter1d:

Here is the diff:

--- cli.py      2021-01-24 18:20:25.398841672 +0000
+++ cli_new.py  2021-01-24 18:20:11.424962052 +0000
@@ -7,6 +7,7 @@
 
 import numpy as np
 import click
+from scipy.ndimage.filters import uniform_filter1d
 
 from . import algos
 from ._version import __version__
@@ -70,7 +71,7 @@
         data = np.loadtxt(p, delimiter=",")
 
         # moving average
-        y_data = np.convolve(data[:, 2], np.ones(window) / window, mode="same")
+        y_data = uniform_filter1d(data[:, 2], size=window)
 
         # create label
         if len(p.split(os.sep)) > 1:

[REQUEST] Add the ability to manually indicate the episode number or to add episode

Is your feature request related to a problem? Please describe.
It seems that episodes are created by observing terminals, this is not clearly defined in the docs however makes sense.

Describe the solution you'd like
For even more clear control over this please add a method to append episode or add episode instead of a whole datasets ad the same time.

[BUG] Identical models being saved multiple times

Hello @takuseno, I think identical models are being saved, unnecessarily, multiple times.

As far as I understand, the logic in train_single_env() from d3rlpy/online/iterators.py can be summarised by the following example with n_steps_per_epoch = 3 and save_interval = 2

import numpy as np

n_steps = 16
n_steps_per_epoch = 3
save_interval = 2

total_step = np.arange(n_steps)
# array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

epoch = total_step // n_steps_per_epoch
# array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5])

epoch % save_interval
# array([0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1])

The model is saved when epoch % save_interval == 0, therefore it's being saved 3 times when epoch = 0, 3 times when epoch = 2, 3 times when epoch = 4, etc, and I think it's correct to say that the epoch number is incremented every time the model is updated, so identical models are being saved multiple times.

[REQUEST] Allow to disable logging

Is your feature request related to a problem? Please describe.
Currently logging is enabled by default and it creates folders with the logs of each run inside.

Describe the solution you'd like
It would be nice to be able to deactivate that logging (as it is possible for tensorboard only for now: https://d3rlpy.readthedocs.io/en/v0.23/references/logging.html)

[BUG] Segmentation fault from example

Describe the bug

The simple script below ends with Segmentation fault (core dumped). The script is a sightly simplified version of the example here

https://d3rlpy.readthedocs.io/en/latest/getting_started.html

To Reproduce

Run the following script

from d3rlpy.datasets import get_cartpole

dataset, env = get_cartpole()

from sklearn.model_selection import train_test_split
train_episodes, test_episodes = train_test_split(dataset, test_size=0.2)

from d3rlpy.algos import DQN
dqn = DQN(use_gpu=False)
dqn.build_with_dataset(dataset)

from d3rlpy.metrics.scorer import average_value_estimation_scorer
dqn.fit(train_episodes,
        eval_episodes=test_episodes,
        n_epochs=1,
        scorers={'value_scale': average_value_estimation_scorer})

quit()

If run on the python command-line the fault does not occur until after quit() is executed.

I've tested it on machines with and without an NVIDIA GPU.

On a machines with a GPU, no more warning massages are given, however, on a machine without a GPU I also see this message, even though use_gpu=False is specified in the script

/Home/user/miniconda3/envs/d3rlpy/lib/python3.9/site-packages/torch/autograd/__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)

Additional context

I'm using the following

Python            3.9.1 
d3rlpy            0.51
gym               0.18.0
numpy             1.19.5
scikit-learn      0.24.0
scipy             1.6.0
torch             1.7.1

[REQUEST] Benchmarking Prior Methods and Different Task Domains for Offline RL

Hi,

Thanks for releasing such an great package!

I am a Ph.D. student working on offline RL and would like to use your datasets for my research project. When comparing your offline RL datasets with D4RL (https://arxiv.org/pdf/2004.07219.pdf), may I ask the following:

Do you have benchmarking results for prior methods like the Appendix Table 2 & 3 (page 15-16) of the D4RL paper?
Does the d4rl-pybullet dataset supports other task domain than the Gym-MuJoCo alike, such as the navigation tasks Maze2D, AntMaze, etc. in the D4RL paper (page 13). I don't think these tasks require MuJoCo engine, but it is impossible to import D4RL package without checking MuJoCo activation keys.

[REQUEST] additional parameters written to params.json

Hello @takuseno,

I really like the fact that d3rlpy writes a record of the run parameters to the params.json file, and to the screen when verbose=True.

However, it seems that the following two encoder parameters are missing:

hidden_units
use_dense

Maybe there is a specific reason why you don't include these parameters, but if not, please could we have these parameters included in params.json and the screen output for each encoder.

[CRR] algorithm supported?

I did not find the CRR algorithm in the package, but CRR was listed in the support algorithm table.

When will add CRR?

[REQUEST] Adding model-based offline RL with image inputs like LOMPO and COMBO

Is your feature request related to a problem? Please describe.
Model-based offline RL algorithms which are able to handle image inputs are necessary for some environments.

Describe the solution you'd like
Adding an implementation of algorithms like LOMPO and COMBO would be great. In the papers, they mention that these are based on MOPO implementation which is implemented in TensorFlow.

[REQUEST] Add the ability to include directly Pytorch Leaning Rate schedulers

Describe the solution you'd like
Add the ability to include Learning Rate schedulers directly fomr Pytorch already implemented.

https://pytorch.org/docs/stable/optim.html

[BUG] Assertion error on running example

Describe the bug
Whenever running the example code for creating an MDPdataset instance, I get an assertion error.

To Reproduce
in order to reproduce, run the following code:
observations = np.random.random((1000, 100))
actions = np.random.random((1000, 4))
rewards = np.random.random(1000)
terminals = np.random.randint(2, size=1000)
dataset = MDPDataset(observations, actions, rewards, terminals)

Additional context
I get the following error using python 3.6 and 3.7.
Traceback (most recent call last):
File "/Users/jens/Documents/CameraShading/Artificial-Eye/jj.py", line 77, in
dataset = MDPDataset(observations, actions, rewards, terminals)
File "d3rlpy/dataset.pyx", line 138, in d3rlpy.dataset.MDPDataset.init
AssertionError

[BUG] When `use_gpu = False` see message "Found no NVIDIA driver on your system"

Hello @takuseno,

When I use d3rlpy on a machine with no GPU and I specify use_gpu = False I still see the following warning message

/home/user/miniconda3/envs/d3rlpy/lib/python3.8/site-packages/torch/autograd/__init__.py:130: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  Variable._execution_engine.run_backward(
(

Here is a minimal example that gives the warning

import gym
from d3rlpy.algos import SAC
from d3rlpy.online.buffers import ReplayBuffer

env = gym.make('Pendulum-v0')
sac = SAC(batch_size = 10, use_gpu = False)
buffer = ReplayBuffer(maxlen = 10, env = env)
sac.fit_online(env, buffer, n_steps = 12)

[REQUEST] Implement CRR/AWAC

Hello,

Thanks for the resource.
It would be nice to implement Critic Regularized Regression (CRR) or Advantage Weighted Actor Critic (AWAC) (they share the same idea) and report the results on PyBullet envs.

Some implementations can be found in the ACME repo or in rlkit.
I also did a "quick and dirty" one in SB3 (dev branch), see here

PS: sorry but I do not have the time to do a clean implementation myself... (already quite busy with SB3)

[REQUEST] Add algorithm, Critic Regularized Regression

Thank you for creating the great repository.
It is helping me to study reinforcement learning.

if you would not mind, could you implement an offline reinforcement learning algorithm called Critic Regularized Regression (CRR)?
This URL is the paper on the algorithm.
https://arxiv.org/abs/2006.15134

I appreciate this great repository.

[BUG] unintuitive numbering for saved models

Hello @takuseno,

This is not really a bug, but slightly unexpected behaviour that can lead to wasted run time if one is not careful.

In the fit method for offline learning, if, for example, one specifies

fit(dataset, n_epochs = 1000, save_interval = 500)

I think the expectation is that the models from the 500th and the final, 1000th, epochs would be saved, however, the models that are saved are

model_0.pt
model_500.pt

This means the second half of the run is wasted (and there is not much use in saving model_0.pt which has hardly been trained). Obviously, this is because the 1000th epoch is labelled with 999 because of the counting from zero, and models are saved when epoch % save_interval == 0.

Would it be possible to add 1 to the epoch number when the models are saved so that at the end of the first epoch the model is called model_1.pt instead of model_0.pt. Something like the following, maybe, in d3rlpy/base.py:

# save model parameters
if epoch + 1 % save_interval == 0:
    logger.save_model(epoch + 1, self)

Then the fit method specified above would save the following models (which seems more intuitive to me):

model_500.pt
model_1000.pt

Papers

dataset

D4RL: Datasets for Deep Data-Driven Reinforcement Learning
- https://arxiv.org/abs/2004.07219
An Optimistic Perspective on Offline Reinforcement Learning
- https://ai.googleblog.com/2020/04/an-optimistic-perspective-on-offline.html

methods

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
- https://arxiv.org/abs/1906.00949
An Optimistic Perspective on Offline Reinforcement Learning
- https://ai.googleblog.com/2020/04/an-optimistic-perspective-on-offline.html
Off-Policy Deep Reinforcement Learning without Exploration
- https://arxiv.org/abs/1812.02900
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
- https://arxiv.org/abs/1910.00177
MOPO: Model-based Offline Policy Optimization
- https://arxiv.org/abs/2005.13239

evaluation

Off-Policy Evaluation via Off-Policy Classification
- https://arxiv.org/abs/1906.01624

other techniques

Distributional Reinforcement Learning with Quantile Regression
- https://arxiv.org/abs/1710.10044
Implicit Quantile Networks for Distributional Reinforcement Learning
- https://arxiv.org/abs/1806.06923
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction
- https://arxiv.org/abs/2003.07305
Network Randomization: A Simple Technique for Generalization in Deep Reinforcement Learning
- https://arxiv.org/abs/1910.05396
Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels
- https://arxiv.org/abs/2004.13649

survey

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
- https://arxiv.org/abs/2005.01643
Benchmarking Batch Deep Reinforcement Learning Algorithms
- https://arxiv.org/abs/1910.01708

[BUG] q_func_type vs. q_func_factory

Hi @takuseno It seems that in some moment the API changed between using parameter "q_func_type" to using "q_func_factory"

If this is so, the algorithms that use q_func_type are using then the default "mean" ?

If so, then all examples and doc should be updated with the right name "q_func_factory" ?

Best,
Jose

[REQUEST] Classmethods Episode.from_transitions and MDPDataset.from_episodes

Is your feature request related to a problem? Please describe.
I want to create Episode objects from a list of Transition objects and MDPDataset objects from a list of episodes, since there is some internal processing like setting terminals as mentioned in #18.

To bypass this, I was trying to create episodes from transitions and then add all episodes to a new MDPDataset but both require to input observations, actions, rewards and terminals on object creation.

Describe the solution you'd like
It would be nice to have something like this:

# create episode from transitions
transitions = create_transitions()
episode = Episode.from_transitions(transitions)

# create dataset from episodes
episodes = create_episodes()
dataset = MDPDataset.from_episodes(episodes)

# adding more episodes to an existing dataset then would also make sense
dataset.append_episode(episodes)

What do you think?

[BUG] AttributeError from example code in documentation

Describe the bug

There are AttributeError from the example code on this page https://d3rlpy.readthedocs.io/en/latest/references/model_based.html

from d3rlpy.dynamics import MOPO
mopo = MOPO()

mopo.horizon = 5
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/home/ps/miniconda3/envs/d3rlpy/lib/python3.8/site-packages/d3rlpy/base.py", line 151, in __setattr__
#     super().__setattr__(name, value)
# AttributeError: can't set attribute

mopo.n_transitions = 400
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/home/ps/miniconda3/envs/d3rlpy/lib/python3.8/site-packages/d3rlpy/base.py", line 151, in __setattr__
#     super().__setattr__(name, value)
# AttributeError: can't set attribute

No module named 'd3rlpy.iterators'

No sure if it is a bug or not, however I get:
No module named 'd3rlpy.iterators'
When trying to run d3rlpy

[DOCS] docstrings in code but not on readthedocs

I can see docstrings, with nice mathematical descriptions of the algorithms, in files like this one

https://github.com/takuseno/d3rlpy/blob/master/d3rlpy/algos/cql.py

but the same documentation does not seem to be present on readthedocs

https://d3rlpy.readthedocs.io/en/latest/index.html

Is this a problem with the processing of the documentation?

[REQUEST] support of new algorithms

This repo is really great and helpful! I enjoy working with that. Thank you for your great work:)

I find your repo so potential and I wonder if you have a plan to support some new algorithms such as Deep Variational Reinforcement Learning or if it possible to provide some API to use the algorithms implemented by others under the framework of d3rlpy?

Thank you!

[DOCS] using seed values to force repeatablity

Hello @takuseno, I can't see anything in the documentation describing how to use a seed value to force different runs to given identical results. It would be nice to add this to the documentation. From my experiments, I think it's sufficient to add the following lines, for example:

seed = 100
env.seed(seed)
eval_env.seed(seed)
d3rlpy.seed(seed)

[REQUEST] Support for MultiBinary action spaces

Hi,

Thanks for such superb work.

I am working with an environment that has MultiBinary action spaces. I have successfully created the dataset. However, when I try to train BC or DCQL, it shows action_size=2 and gives an error if evaluated during training. If I don't evaluate during training, then it gets trained but with action_size=2 (which I can't use for testing purposes).

Could you please help me out with this issue? many thanks in advance.

Port encoders.py and torch_utility.py to Cython

Describe the solution you'd like

A cProfile scan of a AWAC experiment I'm running suggests that models/torch/encoders.py consumes ~16% of the run time and torch_utility.py consumes ~10% of the total run time. Both have many calls to the same function and each call is running comparatively slow. Porting them to Cython would offer a significant speed up.

total run time: 2.213e+04

11999990 | 266.3 | 2.219e-05 | 3643 | 0.0003036 | encoders.py:217(_fc_encode)
3999997 | 247 | 6.176e-05 | 1082 | 0.0002704 | policies.py:13(squash_action)
1999998 | 620.9 | 0.0003104 | 2011 | 0.001006 | torch_utility.py:14(soft_sync)

Problem with loading trained model

I am trying to load a trained model with CQL.load_model(..full model [path).
I first got fname is missing I tried fname=..full_model_path
I then got self is missing
I added self It still doesn't load the model.
no attribute 'impl' ...

[BUG] Unbounded action (CRR)

Describe the bug
When trying to use CRR predict() with an env which has a normalized action space (low=-1, high=1), CRR outputs actions outside that range.

Also related, the implementation of "NormalPolicy" is a bit weird to me, as it is a "squashed" normal, no? (as in SAC)

d3rlpy/d3rlpy/models/torch/policies.py

Line 126 in 7d983f9

class NormalPolicy(Policy):

I can produce a minimal working example if needed ;)

Expected behavior
The actions outputted should be in a valid range specified by the environment.

Additional context
Using master version of the library.

(I'm currently working with the SB3 wrapper and it works fine for behavior cloning)

Where is 'fit function' definition?

Maybe sounds stupid. But I can only find fit_online and fit_batch_online in most algo file (e.g., dqn). I wonder where the fit function is defined? Thank you. I'm trying to build some algorithms based on your repo:)

[BUG] AssertionError: self._impl is not None

Describe the bug

Thank you so much for sharing your code. I've just started exploring it and noticed an assertion error when following your example here https://d3rlpy.readthedocs.io/en/latest/getting_started.html

To Reproduce

Executing the following lines from https://d3rlpy.readthedocs.io/en/latest/getting_started.html

from d3rlpy.datasets import get_cartpole
dataset, env = get_cartpole()

from sklearn.model_selection import train_test_split
train_episodes, test_episodes = train_test_split(dataset, test_size=0.2)

from d3rlpy.algos import DQN
dqn = DQN(use_gpu=False)

from d3rlpy.metrics.scorer import td_error_scorer
from d3rlpy.metrics.scorer import average_value_estimation_scorer
td_error = td_error_scorer(dqn, test_episodes)

gives this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/envs/d3rlpy/lib/python3.9/site-packages/d3rlpy/metrics/scorer.py", line 89, in td_error_scorer
    values = algo.predict_value(batch.observations, batch.actions)
  File "/home/user/miniconda3/envs/d3rlpy/lib/python3.9/site-packages/d3rlpy/algos/base.py", line 157, in predict_value
    assert self._impl is not None
AssertionError

Then, continuing with these lines from https://d3rlpy.readthedocs.io/en/latest/getting_started.html

from d3rlpy.metrics.scorer import evaluate_on_environment
evaluate_scorer = evaluate_on_environment(env)
rewards = evaluate_scorer(dqn)

gives this error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/envs/d3rlpy/lib/python3.9/site-packages/d3rlpy/metrics/scorer.py", line 466, in scorer
    action = algo.predict([observation])[0]
  File "/home/user/miniconda3/envs/d3rlpy/lib/python3.9/site-packages/d3rlpy/algos/base.py", line 115, in predict
    assert self._impl is not None
AssertionError

Expected behavior

No errors.

Additional context

I'm using the following

Python            3.9.1 
d3rlpy            0.51
gym               0.18.0
numpy             1.19.5
scikit-learn      0.24.0
scipy             1.6.0
torch             1.7.1

[REQUEST] Categorical critic (C51)

Dear @takuseno,
I was trying build a categorical critic (c51) from the paper A Distributional Perspective on Reinforcement Learning into your d3rlpy library.
But after inspecting the interfaces of your implemented Q-functions, I think it is not possible without much refactoring since C51 outputs a reward distribution with supports (atoms) instead of reward values or quantiles like all of your implemented Q-functions do.

For example, SACImpl.compute_target() expects that targ_q_func.compute_target() outputs reward values:

d3rlpy/d3rlpy/algos/torch/sac_impl.py

Lines 102 to 106 in 523f89a

    
           def compute_target(self, x): 
        
               with torch.no_grad(): 
        
                   action, log_prob = self.policy.sample(x, with_log_prob=True) 
        
                   entropy = self.log_temp.exp() * log_prob 
        
                   return self.targ_q_func.compute_target(x, action) - entropy

Am I right or can you give me a hint, how to implement a categorical critic into your library?
Thank you very much.

[REQUEST] support multiple evaluation datasets

While training an offline RL agent, I need to monitor evaluation metrics on both training and evaluation datasets in order to early stop my training process. I think a general approach is supporting a list of evaluation datasets like many other platforms. Please let me know if there is any other way I can do this.

torch has no attribuate atanh

I believe there is an issue in awac implementation

    "advantage": discounted_sum_of_advantage_scorer,
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/base.py", line 423, in fit
    loss = self.update(epoch, total_step, batch)
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/algos/awac.py", line 197, in update
    batch.observations, batch.actions
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/torch_utility.py", line 172, in wrapper
    return f(self, *args, **kwargs)
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/torch_utility.py", line 154, in wrapper
    return f(self, *tensors, **kwargs)
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/algos/torch/awac_impl.py", line 102, in update_actor
    targets=["obs_t"],
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/augmentation/pipeline.py", line 137, in process
    ret += func(**kwargs)
  File "/home/mls5/Softwares/d3rlpy/d3rlpy/algos/torch/awac_impl.py", line 121, in _compute_actor_loss
    unnormalized_act_t = torch.atanh(act_t.clamp(-0.999999, 0.999999))
AttributeError: module 'torch' has no attribute 'atanh'
Epoch 0:   0%|                                            | 0/1 [00:00<?, ?it/s]

[REQUEST] Save model less frequently than metrics

Hello, when running fit_online I'd like to be able to save the metrics regularly (eg, once every episode, which is 200 timesteps for the pendulum environment) without having to save the model .pt files at the same high frequency (because the model files are quite large).

Put another way, I'd like to be able to write data to the evaluation.csv file without having to write a model_?????.pt file every time.

I can't see how this is possible in the current code. If it's not possible, I'd like to request it as a feature. Thanks!

[BUG] MDPDataset: unexpected keyword argument 'create_mask'

Hello @takuseno,

Running the following code with the most recent version of d3rlpy gives the error below:

from d3rlpy.datasets import get_pendulum
dataset_expert, env = get_pendulum()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/d3rlpy/d3rlpy/datasets.py", line 93, in get_pendulum
    dataset = MDPDataset(
  File "d3rlpy/dataset.pyx", line 141, in d3rlpy.dataset.MDPDataset.__init__
TypeError: __init__() got an unexpected keyword argument 'create_mask'

The problem seems to have been introduced in the "Refactor mask creation" commit here

93b38dd (code fails)

as the code runs in the commit before that, ie,

d9a9d6e (code runs)

[REQUEST] Can the online learning buffers be used to create an MDPDataset?

Describe the solution you'd like
When fit_online, it should be very useful to save all the experiences into an MDPDataset so that we can use it for offline RL to improve the policy.

Has sense to use the Buffers already done for online learning? or should we think in another mechanism, such as openai gym wrappers monitor to make this?

Perhaps not all online learning algos use a Buffer, perhaps a new param to the fit_online so save every transition into history and every save time we also save the corresponding MDPDataset?

[BUG] Unable to import package on colab

Describe the bug

Gives error: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

To Reproduce
!pip install d3rply
import d3rlpy

Expected behavior
Expected package to be imported.

[REQUEST] conda recipe for conda forge

Describe the solution you'd like
Able to install with conda install -c conda-forge d3rlpy

[REQUEST] action range

Hi
I was running CQL algorithm and I believe the action values sent to env in fit_online and also stored in replay buffer are scaled [-1,1]. Is my understanding correct and shouldn't the action be scaled back to env action range prior to env.ste(action) step?

Also, if this is the case, do we need to scale the actions before building MDPdataset for the offline fit?

Many thanks

Illegal instruction (core dumped)

Hi
I just ran the Atari 2600 test case in Readme but when running I get the following error

Epoch 0: 0%| | 0/24760 [00:00<?, ?it/s]Illegal instruction (core dumped)

Any idea why I am getting the error? the only change I made was dropping use_gpu=True

[REQUEST] ReplayBuffer from MDP Dataset code example

def ReplayBuffer_from_dataset(dataset,maxlen=0):
    class FakeGym:
        observation_space = gym.spaces.Box(low=0, high=1,shape=dataset.get_observation_shape())
        action_space = gym.spaces.Box(low=0, high=1,shape=(dataset.get_action_size(),))
        action_space.n = dataset.get_action_size()

    
    maxlen = max(maxlen,dataset.rewards.shape[0])
    buffer = ReplayBuffer(maxlen=maxlen, env = FakeGym(), episodes=dataset.episodes)
    return buffer

	def compute_target(self, x):
	with torch.no_grad():
	action, log_prob = self.policy.sample(x, with_log_prob=True)
	entropy = self.log_temp.exp() * log_prob
	return self.targ_q_func.compute_target(x, action) - entropy

takuseno / d3rlpy Goto Github PK

d3rlpy's Introduction

📖 About me

🚀 GitHub Projects

As an owner

As a contributor

d3rlpy's People

Contributors

Stargazers

Watchers

Forkers

d3rlpy's Issues

dataset

methods

evaluation

other techniques

survey

Recommend Projects

Recommend Topics

Recommend Org