Hi, Not sure what is not working here. I followed the implementation of the <a hre

Error trying to use Pytorch with tow GPUs about warp-drive HOT 11 CLOSED

salesforce commented on September 28, 2024

Error trying to use Pytorch with tow GPUs

from warp-drive.

Comments (11)

Emerald01 commented on September 28, 2024

If you want to spin up multiple GPUs, please refer to example trainer script and WarpDrive managed multi gpu training pipeline which is one single line of function here, this script itself supports multiple GPUs, so you can try it first.
https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script_numba.py#L213

Sorry about the mislead. In fact, Pytorch lighting can only manage the multi-GPUs for pytorch model part, but not the environment steps part, actually you know the latter is the key of WarpDrive. We have multi-gpu managers for both env steps and pytorch models. In particular, we have model sync up with pytorch DDP and GPU context manager for step function executables running on multiple GPUs. The problem you see is simply because by default CPU host will always work with the default GPU, and some advanced settings are required to distribute the host to multiple processes and each process to control one GPU. From our benchmark on V100 and A100 GPUs, the speed is almost going linearly with N_GPUs with N_GPUs <= 4 at least.

BTW, since Pytorch lightning has very frequent update on the backend and sometimes it will fail some existing pipelines, I do not really suggest you using lightning for now.

from warp-drive.

Finebouche commented on September 28, 2024

Hi, thanks for the explanation.

I actually tried reusing that distributed function but without success.
Firstly the scripts

python warp_drive/training/example_training_script_pycuda.py --env tag_continuous

and

python warp_drive/training/example_training_script_numba.py --env tag_continuous

were giving me this error :

RuntimeError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 10.92 GiB total capacity; 6.73 GiB already allocated; 1.33 GiB free; 8.34 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Which is weird to me because I would have expect that to work the same way as calling the Environment Warper and Trainer from the notebook.

And when using perform_distributed_training from the notebook I also encountered some other problems as well. I will try to reproduce the error I had and see if I can fix it but I am still puzzled by the error I get from the script.

from warp-drive.

Emerald01 commented on September 28, 2024

The error you saw is simply the OOM. What you can do is to reduce the number of agents set up in the yaml configure https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/run_configs/tag_continuous.yaml

I just tested continuous demo script and it runs well on 2 A-100 GPUs, and you can see both GPUs have equal share.

from warp-drive.

Finebouche commented on September 28, 2024

Hi,
Sorry for the late reply and thank you for your previous help !
I succefully run the script on a single GPU using your advice to reduce the memory usage but I still get a different error trying to use two GPUS.
I have the following stack :

We have successfully found 2 GPUs!
Training with 2 GPU(s).
Starting worker process: 1 
Starting worker process: 0 
ERROR:root:Address already in use
Process NumbaDeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
Address already in use

Pycuda gives an extra :

ERROR:root:Address already in use
Process PyCUDADeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------

No process is being launched in the same time and I am not sure.

I am trying to investigate the context.pop advice by looking at /child_process_base.py but not sure if that is the right idea.

from warp-drive.

Emerald01 commented on September 28, 2024

Not pretty sure about this error as it says "ERROR:root:Address already in use". It sounds to me that the previous GPU context has not been cleared, you may check if the previous task has any zombie process occupied. Since I cannot reproduce your error on my end so I cannot give more clear advice here. What is your multi GPU setup look like?

from warp-drive.

Finebouche commented on September 28, 2024

Hi,
I took me some time. I am using two NVIDIA GeForce GTX 1080 Ti :

I am running my code inside a docker image instantiated using : https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/tree/master/pytorch-notebook which is a fork of the public Jupyter Docker Stacks with GPU-processing capability as explained here

from warp-drive.

Emerald01 commented on September 28, 2024

I just simply use the Docker with NVIDIA maintained base image, this is the most robust image I know so far. It handles every tricky dependency btw drivers and cu libs, you may downgrade the base image based on your hardware.

FROM nvcr.io/nvidia/pytorch:21.10-py3

RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0

from warp-drive.

Finebouche commented on September 28, 2024

Trying to solve this by the end of the week

from warp-drive.

Finebouche commented on September 28, 2024

Hi, I haven't been able to solve my issue, not sure what is going wrong. I did try to use your recommended docker image but got the same error so I am not sure what is wrong here.

from warp-drive.

Emerald01 commented on September 28, 2024

Can you show your error messages and environment setup?

from warp-drive.

Finebouche commented on September 28, 2024

Hi,
Unfortunately, I haven't changed much in the configuration and the errors are exactly the same as last time 👎
So not sure what to try from here.

I am using this docker image https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/blob/master/pytorch-notebook/Dockerfile
and accessing my cluster instance through jupyterhub/jupyterlbab. I have tried the code on GeForce GTX 1080 Ti.

I also tried using the image from which you provided the Dockerfile, run the test example and got the same result.

from warp-drive.

Error trying to use Pytorch with tow GPUs about warp-drive HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent