Comments (7)
I finally solved the mystery. The training seemingly stopping at 25000 steps is to perform an evaluation. In fact, it's very easy to change this number by just using the "--eval-freq" argument.
Both of the problems that I mentioned here as bugs were from my own creation because I didn't even read the train.py source code or didn't properly check my own code before posting an issue.
So I wanted to publicly apologize to @qgallouedec and specially @araffin for wasting their time on this.
I'll try to be more careful when asking for help in the future.
By the way, I noticed this thanks to the incredible speed of the sbx implementation. It allowed me to wait past my impatience for enough time (just a few seconds) for the evaluation to finish and the counter to go above 25000. So yeah, thanks for making that awesome library too.
from rl-baselines3-zoo.
Sorry, I didn't provide any code initially because it's way too long (assuming this is the checklist point I was missing). Here is my custom environment:
from enum import Enum
import sys
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3.common.vec_env import DummyVecEnv
class Actions(Enum):
RIGHT = 0
UP = 1
LEFT = 2
DOWN = 3
class ContinuousWorldEnv(gym.Env):
metadata = {"render_modes": [None]}
def __init__(self, size=50, render_mode=None):
self.least_distance: np.inf
self.last_action: 0
self.size = size # The size of the square grid
self.window_size = 512 # The size of the PyGame window
# Observations are dictionaries with the agent's and the target's location (normalized MultiDiscrete).
self.observation_space = spaces.Dict(
{
"agent": spaces.Box(low=0, high=1, shape=(2,), dtype=np.float64),
"target": spaces.Box(low=0, high=1, shape=(2,), dtype=np.float64),
}
)
self._agent_location = np.array([-1, -1], dtype=int)
self._target_location = np.array([-1, -1], dtype=int)
# We have 4 actions, corresponding to "right", "up", "left", "down"
self.action_space = spaces.Discrete(4)
"""
The following dictionary maps abstract actions from `self.action_space` to
the direction we will walk in if that action is taken.
i.e. 0 corresponds to "right", 1 to "up" etc.
"""
self._action_to_direction = {
Actions.RIGHT.value: np.array([1, 0], dtype=int),
Actions.UP.value: np.array([0, 1], dtype=int),
Actions.LEFT.value: np.array([-1, 0], dtype=int),
Actions.DOWN.value: np.array([0, -1], dtype=int),
}
def _normalize_observation(self, observation, size):
normalized_observation = {}
for key, value in observation.items():
normalized_observation[key] = value / (size - 1)
return normalized_observation
def _get_distance(self):
return np.linalg.norm(self._agent_location - self._target_location, ord=1)
def _get_obs(self):
return {"agent": self._agent_location, "target": self._target_location}
def _get_info(self):
return {"distance": self._get_distance()}
def reset(self, seed=None, options=None):
# We need the following line to seed self.np_random
super().reset(seed=seed)
# Choose the agent's location uniformly at random
self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)
# We will sample the target's location randomly until it does not coincide with the agent's location
self._target_location = self._agent_location
while np.array_equal(self._target_location, self._agent_location):
self._target_location = self.np_random.integers(0, self.size, size=2, dtype=int)
self.last_action = 0
self.least_distance = self._get_distance()
# observation = self._get_obs()
observation = self._normalize_observation(self._get_obs(), self.size)
info = self._get_info()
return observation, info
def _get_reward(self, action):
distance = self._get_distance()
reward = 0.0
# If the distance to the target is less than 1, the agent has reached the target
if distance < 1:
print("Agent has reached the target")
reward += 1.0
# If the distance to the target is less than the least distance, the agent is getting closer to the target
elif distance < self.least_distance:
self.least_distance = distance
# print("Agent is getting closer to the target")
reward += 0.01
# If the distance to the target is greater than the least distance, the agent is getting farther from the target
else:
reward -= 0.01
if action != self.last_action:
reward += 0.01
else:
reward -= 0.01
return reward
def step(self, action):
# Map the action (element of {0,1,2,3}) to the direction we walk in
direction = self._action_to_direction[action]
# We use `np.clip` to make sure we don't leave the grid
self._agent_location = np.clip(self._agent_location + direction, 0, self.size - 1)
# An episode is done if the agent has reached the target
terminated = np.array_equal(self._agent_location, self._target_location)
reward = self._get_reward(action)
self.last_action = action
# observation = self._get_obs()
observation = self._normalize_observation(self._get_obs(), self.size)
info = self._get_info()
return observation, reward, terminated, False, info
def close(self):
sys.exit("\nExiting...\n")
from rl-baselines3-zoo.
def close(self):
sys.exit("\nExiting...\n")
This is your issue.
I'm not sure why you use sys.exit()
as it kills python script.
from rl-baselines3-zoo.
Wow, that's a little embarrassing... Thanks for pointing it out though.
After changing that line to:
print("Closing the environment")
And adding a very simple truncating mechanism in the step function:
truncated = False
if self.timestep_counter >= 30000:
truncated = True
print("Episode truncated")
else:
self.timestep_counter += 1
The Optuna optimization starts normally, and it does seem to work alright. Just for good measure I'll add a partial log of the optimization procress:
========== ContinuousWorld-v0 ==========
Seed: 1257939581
Loading hyperparameters from: C:\Users\silve\AppData\Local\Programs\Python\Python311\Lib\site-packages\rl_zoo3\hyperparams\ppo.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 256),
('clip_range', 'lin_0.2'),
('ent_coef', 0.0),
('gae_lambda', 0.8),
('gamma', 0.98),
('learning_rate', 'lin_0.001'),
('n_envs', 8),
('n_epochs', 20),
('n_steps', 32),
('n_timesteps', 100000.0),
('policy', 'MultiInputPolicy')])
Using 8 environments
Overwriting n_timesteps with n=10000
Doing 1 intermediate evaluations for pruning based on the number of timesteps. (1 evaluation every 100k timesteps)
Closing the environment
Optimizing hyperparameters
C:\Users\silve\AppData\Local\Programs\Python\Python311\Lib\site-packages\optuna\samplers\_tpe\sampler.py:319: ExperimentalWarning: ``multivariate`` option is an experimental feature. The interface can change in the future.
warnings.warn(
Sampler: tpe - Pruner: median
[I 2024-04-04 11:50:17,023] A new study created in memory with name: no-name-bcadd434-e207-47bc-9e99-3d0a25e03ea7
Agent has reached the target
Agent has reached the target
Episode truncated
Episode truncated
Episode truncated
Episode truncated
Episode truncated
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
[I 2024-04-04 11:54:18,366] Trial 1 finished with value: -239.58799999999997 and parameters: {'batch_size': 16, 'n_steps': 64, 'gamma': 0.9999, 'learning_rate': 0.0003767443448677662, 'ent_coef': 0.0006887248484492611, 'clip_range': 0.4, 'n_epochs': 5, 'gae_lambda': 0.95, 'max_grad_norm': 0.6, 'vf_coef': 0.22057943399371072, 'net_arch': 'medium', 'activation_fn': 'relu'}. Best is trial 1 with value: -239.58799999999997.
Agent has reached the target
Episode truncated
... etc ...
As expected, the results are quite bad, so I'll be tweaking the timesteps and few other things.
On the first topic though, the training hanging at 25000 timesteps, it is still happening. What do you think might be the problem? Take into account that this does not happen when training using custom code like this:
# Function that creates a new environment
def make_env():
def _init():
return ContinuousWorldEnv(world_size)
return _init
num_envs = 8
# List of environment creation functions
envs = [make_env() for _ in range(num_envs)]
# Vectorized environment
env = DummyVecEnv(envs)
# Train the agent
model = PPO("MultiInputPolicy", env, verbose=1, ent_coef=0.02, clip_range=0.4, learning_rate=0.0002, n_steps=1028, seed=seed_value)
model.learn(total_timesteps=100000, progress_bar=True)
And one last thing, I don't know if this is intended behavior, but if I want to stop the Optuna optimization process, it does not stop with the typical Crtl + C
on the terminal. To truly stop it I have to kill the Python process from the Windows Task Manager.
from rl-baselines3-zoo.
I am still dealing with the training problem. Can anybody point me in the right direction?
from rl-baselines3-zoo.
Hey, have you checked your environnement with the env checker?
from rl-baselines3-zoo.
Hey, thanks for the reply qgallouedec. Yes, I have checked the environment with both the Gymnasium and the SB3 check_env() functions (I know one of them is a superset of the other, but I wanted to be sure). Neither of them found any issues.
from gymnasium.utils.env_checker import check_env
from stable_baselines3.common.env_checker import check_env as sb3_check_env
from rl-baselines3-zoo.
Related Issues (20)
- [Bug]: Nan Problems for SAC, TQC, for AntBulletEnv-v0, HalfCheetahBulletEnv-v0 HOT 9
- [Error]: I got unexpected error using enjoy() with pretrain model HOT 4
- Training DonkeyCar with TQC algorithm with pretrained AE
- [Bug]: Custom Sub-Hyperparameters during train.py -> Optimize HOT 1
- [Question] You must pass an environment when using `HerReplayBuffer` HOT 1
- [Question] RuntimeError: Unable to sample before the end of the first episode. We recommend choosing a value for learning_starts that is greater than the maximum number of timesteps in the environment. HOT 5
- [Question] Custom Eval Callback for train/optimize HOT 2
- [Bug]: TODO: add test dependencies in the `setup.py` HOT 1
- [Question] Does hyperparameter tuning support custom vectorized environments? HOT 6
- [Question] exp_manager reward and GAE discount factors HOT 1
- [Bug]: Custom environment not found in gym registry, you maybe meant... error message HOT 1
- [Bug]: Optimization log and optimal policy not in `--optimization-log-path` but in `--log-folder` HOT 1
- [Question] Number of parallel environments with hyperparameters optimization HOT 1
- [Question] RL traning could not reach convergence for a customised environment HOT 1
- [Question] How many startup trials in distributed optimization HOT 2
- [Question] The trained agent resets every 1000 episodes. HOT 3
- Stuck at Local Minimum in PPO with CarRacing-v2 Environment
- [Question] How to render "info" on tensorboard HOT 1
- [Bug]: Docker tag is still 2.2.0a2, but latest rl-baselines3-zoo is now 2.3.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rl-baselines3-zoo.