🐛 Bug I'm trying to use the Zoo to speed up the training and hype

def close(self): sys.exit(" Exiting... ") <p dir=

[Bug]: Training suddenly stops at 25000 timesteps and Optuna optimization immediately exits in my custom environment about rl-baselines3-zoo HOT 7 CLOSED

Silver812 commented on June 1, 2024

[Bug]: Training suddenly stops at 25000 timesteps and Optuna optimization immediately exits in my custom environment

from rl-baselines3-zoo.

Comments (7)

Silver812 commented on June 1, 2024 2

I finally solved the mystery. The training seemingly stopping at 25000 steps is to perform an evaluation. In fact, it's very easy to change this number by just using the "--eval-freq" argument.

Both of the problems that I mentioned here as bugs were from my own creation because I didn't even read the train.py source code or didn't properly check my own code before posting an issue.

So I wanted to publicly apologize to @qgallouedec and specially @araffin for wasting their time on this.

I'll try to be more careful when asking for help in the future.

By the way, I noticed this thanks to the incredible speed of the sbx implementation. It allowed me to wait past my impatience for enough time (just a few seconds) for the evaluation to finish and the counter to go above 25000. So yeah, thanks for making that awesome library too.

from rl-baselines3-zoo.

Silver812 commented on June 1, 2024

Sorry, I didn't provide any code initially because it's way too long (assuming this is the checklist point I was missing). Here is my custom environment:

from enum import Enum
import sys
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3.common.vec_env import DummyVecEnv


class Actions(Enum):
    RIGHT = 0
    UP = 1
    LEFT = 2
    DOWN = 3


class ContinuousWorldEnv(gym.Env):
    metadata = {"render_modes": [None]}

    def __init__(self, size=50, render_mode=None):
        self.least_distance: np.inf
        self.last_action: 0
        self.size = size  # The size of the square grid
        self.window_size = 512  # The size of the PyGame window

        # Observations are dictionaries with the agent's and the target's location (normalized MultiDiscrete).
        self.observation_space = spaces.Dict(
            {
                "agent": spaces.Box(low=0, high=1, shape=(2,), dtype=np.float64),
                "target": spaces.Box(low=0, high=1, shape=(2,), dtype=np.float64),
            }
        )
        self._agent_location = np.array([-1, -1], dtype=int)
        self._target_location = np.array([-1, -1], dtype=int)

        # We have 4 actions, corresponding to "right", "up", "left", "down"
        self.action_space = spaces.Discrete(4)

        """
        The following dictionary maps abstract actions from `self.action_space` to
        the direction we will walk in if that action is taken.
        i.e. 0 corresponds to "right", 1 to "up" etc.
        """
        self._action_to_direction = {
            Actions.RIGHT.value: np.array([1, 0], dtype=int),
            Actions.UP.value: np.array([0, 1], dtype=int),
            Actions.LEFT.value: np.array([-1, 0], dtype=int),
            Actions.DOWN.value: np.array([0, -1], dtype=int),
        }

    def _normalize_observation(self, observation, size):
        normalized_observation = {}

        for key, value in observation.items():
            normalized_observation[key] = value / (size - 1)

        return normalized_observation

    def _get_distance(self):
        return np.linalg.norm(self._agent_location - self._target_location, ord=1)

    def _get_obs(self):
        return {"agent": self._agent_location, "target": self._target_location}

    def _get_info(self):
        return {"distance": self._get_distance()}

    def reset(self, seed=None, options=None):
        # We need the following line to seed self.np_random
        super().reset(seed=seed)

        # Choose the agent's location uniformly at random
        self._agent_location = self.np_random.integers(0, self.size, size=2, dtype=int)

        # We will sample the target's location randomly until it does not coincide with the agent's location
        self._target_location = self._agent_location
        while np.array_equal(self._target_location, self._agent_location):
            self._target_location = self.np_random.integers(0, self.size, size=2, dtype=int)

        self.last_action = 0
        self.least_distance = self._get_distance()
        # observation = self._get_obs()
        observation = self._normalize_observation(self._get_obs(), self.size)
        info = self._get_info()

        return observation, info

    def _get_reward(self, action):

        distance = self._get_distance()
        reward = 0.0

        # If the distance to the target is less than 1, the agent has reached the target
        if distance < 1:
            print("Agent has reached the target")
            reward += 1.0

        # If the distance to the target is less than the least distance, the agent is getting closer to the target
        elif distance < self.least_distance:
            self.least_distance = distance
            # print("Agent is getting closer to the target")
            reward += 0.01

        # If the distance to the target is greater than the least distance, the agent is getting farther from the target
        else:
            reward -= 0.01

        if action != self.last_action:
            reward += 0.01
        else:
            reward -= 0.01

        return reward

    def step(self, action):
        # Map the action (element of {0,1,2,3}) to the direction we walk in
        direction = self._action_to_direction[action]
        # We use `np.clip` to make sure we don't leave the grid
        self._agent_location = np.clip(self._agent_location + direction, 0, self.size - 1)
        # An episode is done if the agent has reached the target
        terminated = np.array_equal(self._agent_location, self._target_location)
        reward = self._get_reward(action)
        self.last_action = action
        # observation = self._get_obs()
        observation = self._normalize_observation(self._get_obs(), self.size)
        info = self._get_info()

        return observation, reward, terminated, False, info

    def close(self):
        sys.exit("\nExiting...\n")

from rl-baselines3-zoo.

araffin commented on June 1, 2024

def close(self):
sys.exit("\nExiting...\n")

This is your issue.
I'm not sure why you use sys.exit() as it kills python script.

from rl-baselines3-zoo.

Silver812 commented on June 1, 2024

Wow, that's a little embarrassing... Thanks for pointing it out though.

After changing that line to:
print("Closing the environment")

And adding a very simple truncating mechanism in the step function:

truncated = False

if self.timestep_counter >= 30000:
    truncated = True
    print("Episode truncated")
else:
    self.timestep_counter += 1

The Optuna optimization starts normally, and it does seem to work alright. Just for good measure I'll add a partial log of the optimization procress:

========== ContinuousWorld-v0 ==========
Seed: 1257939581
Loading hyperparameters from: C:\Users\silve\AppData\Local\Programs\Python\Python311\Lib\site-packages\rl_zoo3\hyperparams\ppo.yml
Default hyperparameters for environment (ones being tuned will be overridden):
OrderedDict([('batch_size', 256),
             ('clip_range', 'lin_0.2'),
             ('ent_coef', 0.0),
             ('gae_lambda', 0.8),
             ('gamma', 0.98),
             ('learning_rate', 'lin_0.001'),
             ('n_envs', 8),
             ('n_epochs', 20),
             ('n_steps', 32),
             ('n_timesteps', 100000.0),
             ('policy', 'MultiInputPolicy')])
Using 8 environments
Overwriting n_timesteps with n=10000
Doing 1 intermediate evaluations for pruning based on the number of timesteps. (1 evaluation every 100k timesteps)
Closing the environment
Optimizing hyperparameters
C:\Users\silve\AppData\Local\Programs\Python\Python311\Lib\site-packages\optuna\samplers\_tpe\sampler.py:319: ExperimentalWarning: ``multivariate`` option is an experimental feature. The interface can change in the future.
  warnings.warn(
Sampler: tpe - Pruner: median
[I 2024-04-04 11:50:17,023] A new study created in memory with name: no-name-bcadd434-e207-47bc-9e99-3d0a25e03ea7
Agent has reached the target
Agent has reached the target
Episode truncated
Episode truncated
Episode truncated
Episode truncated
Episode truncated
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
Closing the environment
[I 2024-04-04 11:54:18,366] Trial 1 finished with value: -239.58799999999997 and parameters: {'batch_size': 16, 'n_steps': 64, 'gamma': 0.9999, 'learning_rate': 0.0003767443448677662, 'ent_coef': 0.0006887248484492611, 'clip_range': 0.4, 'n_epochs': 5, 'gae_lambda': 0.95, 'max_grad_norm': 0.6, 'vf_coef': 0.22057943399371072, 'net_arch': 'medium', 'activation_fn': 'relu'}. Best is trial 1 with value: -239.58799999999997.
Agent has reached the target
Episode truncated

... etc ...

As expected, the results are quite bad, so I'll be tweaking the timesteps and few other things.

On the first topic though, the training hanging at 25000 timesteps, it is still happening. What do you think might be the problem? Take into account that this does not happen when training using custom code like this:

# Function that creates a new environment
def make_env():
    def _init():
        return ContinuousWorldEnv(world_size)

    return _init

num_envs = 8

# List of environment creation functions
envs = [make_env() for _ in range(num_envs)]

# Vectorized environment
env = DummyVecEnv(envs)

# Train the agent
model = PPO("MultiInputPolicy", env, verbose=1, ent_coef=0.02, clip_range=0.4, learning_rate=0.0002, n_steps=1028, seed=seed_value)

model.learn(total_timesteps=100000, progress_bar=True)

And one last thing, I don't know if this is intended behavior, but if I want to stop the Optuna optimization process, it does not stop with the typical Crtl + C on the terminal. To truly stop it I have to kill the Python process from the Windows Task Manager.

from rl-baselines3-zoo.

Silver812 commented on June 1, 2024

I am still dealing with the training problem. Can anybody point me in the right direction?

from rl-baselines3-zoo.

qgallouedec commented on June 1, 2024

Hey, have you checked your environnement with the env checker?

from rl-baselines3-zoo.

Silver812 commented on June 1, 2024

Hey, thanks for the reply qgallouedec. Yes, I have checked the environment with both the Gymnasium and the SB3 check_env() functions (I know one of them is a superset of the other, but I wanted to be sure). Neither of them found any issues.

from gymnasium.utils.env_checker import check_env
from stable_baselines3.common.env_checker import check_env as sb3_check_env

from rl-baselines3-zoo.

[Bug]: Training suddenly stops at 25000 timesteps and Optuna optimization immediately exits in my custom environment about rl-baselines3-zoo HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent