faizansana / intersection-driving Goto Github PK

2.0 2.0 0.0 132 KB

Dockerized Container Architecture for Parallel Training of CARLA Gym Environments

License: MIT License

Dockerfile 10.70% Python 82.22% Shell 7.08%

carla carla-reinforcement-learning carla-simulator docker reinforcement-learning stable-baselines3

intersection-driving's Introduction

Training Architecture for CARLA-based Reinforcement Learning Environments

Containerized DRL training architecture for gymnasium based CARLA Simulator environments. Particularly designed for the intersection carla gym repository.

Getting Started

DRL Algorithms Supported

System Requirements

The following are the requirements for running this repository using the provided Docker files:

Operating System: Linux (tested on Ubuntu 20.04/22.04)
NVIDIA GPU with CUDA support (tested on NVIDIA GeForce RTX 3060/3080/3090/4080/4090)

Setup

Clone the repository

git clone https://github.com/faizansana/intersection-driving.git

Run the dev_config.sh file to set the environment variables for docker.
```
bash dev_config.sh
```

From within the working directory, open the .env file to change any specific requirements such as CARLA version, CUDA version etc. The following are the default configurations:

Variable	Description	Default Value
FIXUID	UID of current user	(UID of your current user)
FIXGID	GID of current user	(GID of current user)
CARLA_VERSION	Version of CARLA	0.9.10.1
CARLA_QUALITY	Quality setting for CARLA	Low
GPU_ID_CARLA_MAIN	GPU ID for CARLA main	0
GPU_ID_CARLA_DEBUG	GPU ID for CARLA debug	0
GPU_ID_MAIN_CONTAINER	GPU ID for main container	0
CARLA_SERVER_REPLICAS	Number of CARLA server replicas	5
CARLA_DEBUG_SERVER_REPLICAS	Number of CARLA debug server replicas	0
CUDA_VERSION	Version of CUDA	12.0.0

Note: The GPU IDs are automatically set by checking the least used GPUs on the system.

Pull the already built containers from Docker Hub if available.
```
docker compose pull
```
After the containers have been pulled, start them using the following command.
```
docker compose up -d
```
Open the main_container, and attach it to VS Code using the Remote Explorer extension.

Scripts Usage (from within main container)

The following are the scripts developed for use (found within src folder):

multi_retrain.py: Retrain multiple DRL models using a yaml file with their locations.

Example usage:
```
python multi_retrain.py -f file_with_model_paths.yaml -t number_of_timesteps_to_train
```
multi_testmodel.py: Test multiple models based on the performance metrics defined in test_model.py.

Example usage:
```
python multi_testmodel.py
```
Note: Modify the model_paths list in the script to select the model paths
multi_train.py: Train multiple DRL algorithms in parallel in different CARLA instances

Example usage:
```
python multi_train.py -t number_of_timesteps_to_train
```

test_model.py: Test a single DRL model.

Example usage:

python test_model.py -m path_to_model -v verbosity_level -c carla_host --episodes numberof_episodes -d display_or_not --config-file path_to_environment_config

train.py: Train a single DRL model or retrain a model.

Example usage:

python train.py -m name_of_model -v verbosity_level -c carla_host --episodes numberof_episodes -d display_or_not --config-file path_to_environment_config -p carla_port

intersection-driving's People

Contributors

Stargazers

Watchers

intersection-driving's Issues

During training, if training is restarted, get best mean reward from old model

Currently, if training is restarted due to seg fault, the best mean reward is reset to 0. The best model is then overwritten, even if the model mean reward is lower.

Look for a way to read the best mean reward of the model and then use that for best mean reward calculation.

Add parallel testing script

Similar to how we do parallel training, implement a script for parallelizing testing.

Update docker-publish workflow to build for multiple cuda and carla versions

Currently, it builds the image on the default args (carla 0.9.13 and cuda 11.8.0). Use matrix strategy to update this for at least cuda versions 11.4.0 and CARLA 0.9.10.1

Add recommended VS Code extensions

Specify recommended VS Code extensions so every time a new container is started, it is easy to install all the extensions.

Copy carla egg file based on specified version

Upgrade the docker file to copy the carla egg file from the carla container to the main container in a specific path
Tag the image based on the carla version

Update launch.json for args based on file

Currently, if a configuration is set to for example train.py, it only runs the debugger on that. Read into VS Code documentation to see how to set the profile such that debugger works for any file but if a specific file such as train.py is run, then the provided arguments are taken into consideration.

CARLA ROS Bridge Container only works for versions <0.9.12

For some reason, it does not install the rviz dependency for CARLA versions >=0.9.12. So during catkin build, since carla-rviz depends on rviz, it fails.

It works when doing it within an interactive docker container. This could likely be due to incorrect environment variable setting.

Unify the `--model` argument in train.py

Currently, there are two arguments to pass in a model.

The --model is the name of the model while --model-path is for selecting a path to a model to be retrained. To unify, remove the model-path argument and test if the model argument is a path or a name of a model and handle it within script directly.

Improve Training logging and naming of models

Currently the models are saved based on time.

Do the following:

Save the model based on an experiment name set in potentially either config file or in args
For logging, save the config parameters used for training of that model.
The logging of each file in multi_train.py needs to be enhanced by making a unique one each time.

Fix Recurrent PPO model testing

In the case of Recurrent PPO, since we are using LSTMs, the states need to be stored and passed into the model args for the next prediction.

Upgrade test_model.py for metric calculation

The following metrics need to be incorporated:

Success rate
Collision Rate
Average episode length
Average reward per episode

Add the gym env with python 3.7 to docker environment

Currently only the gymnasium env with python 3.8 is the environment available within the container. If the issue with seg fault is not fixed then add this env to the container too.

Upgrade training script to continue training model

This would be helpful in the case further training of the model is required. Especially in cases where the environment unexpectedly fails.

Update GitHub Action workflows to cache conda env

Currently no caching is performed. Using the cache would allow subsequent runs to be faster.

Could run one workflow, cache the environment and then use it for the rest.

Update docker compose file to run carla containers on different gpus when using scaling

Currently, if scale is provided, it runs the carla containers on the same gpu. This defeats the purpose of the multi-gpu setup since this means that the entire load is on a single gpu, slowing down training.

Automatically get CUDA Version from bare metal machine

Currently, when running dev_config.sh, it automatically sets the CUDA version to be 11.8.0.

Update this to automatically query nvidia-smi and get the CUDA version.

When retraining, error thrown "No data found in the saved file"

When training fails due to seg fault or similar, and retraining is started, sometimes the latest_model.zip does not contain any information.

Env running on server intersection-driving-carla_server-1
connecting to Carla server...
Carla server port 2000 connected!
Loading model from /home/docker/src/src/Training/Models/DDPG/2024-02-23_22-52-16/latest_model.zip
------ custom_carla_gym ------
------ 1,500,000 ------
No data found in the saved file

Possible solutions:

Use best_model.zip if that happens
Enhance saving of latest_model.zip such that it always saves a copy even if error occurs. Enhance exception handling.