Giter Site home page Giter Site logo

Comments (11)

Emerald01 avatar Emerald01 commented on September 28, 2024

If you want to spin up multiple GPUs, please refer to example trainer script and WarpDrive managed multi gpu training pipeline which is one single line of function here, this script itself supports multiple GPUs, so you can try it first.
https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script_numba.py#L213

Sorry about the mislead. In fact, Pytorch lighting can only manage the multi-GPUs for pytorch model part, but not the environment steps part, actually you know the latter is the key of WarpDrive. We have multi-gpu managers for both env steps and pytorch models. In particular, we have model sync up with pytorch DDP and GPU context manager for step function executables running on multiple GPUs. The problem you see is simply because by default CPU host will always work with the default GPU, and some advanced settings are required to distribute the host to multiple processes and each process to control one GPU. From our benchmark on V100 and A100 GPUs, the speed is almost going linearly with N_GPUs with N_GPUs <= 4 at least.

BTW, since Pytorch lightning has very frequent update on the backend and sometimes it will fail some existing pipelines, I do not really suggest you using lightning for now.

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Hi, thanks for the explanation.

I actually tried reusing that distributed function but without success.
Firstly the scripts

python warp_drive/training/example_training_script_pycuda.py --env tag_continuous

and

python warp_drive/training/example_training_script_numba.py --env tag_continuous

were giving me this error :

RuntimeError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 10.92 GiB total capacity; 6.73 GiB already allocated; 1.33 GiB free; 8.34 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Which is weird to me because I would have expect that to work the same way as calling the Environment Warper and Trainer from the notebook.

And when using perform_distributed_training from the notebook I also encountered some other problems as well. I will try to reproduce the error I had and see if I can fix it but I am still puzzled by the error I get from the script.

from warp-drive.

Emerald01 avatar Emerald01 commented on September 28, 2024

The error you saw is simply the OOM. What you can do is to reduce the number of agents set up in the yaml configure https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/run_configs/tag_continuous.yaml

I just tested continuous demo script and it runs well on 2 A-100 GPUs, and you can see both GPUs have equal share.
Screenshot 2023-04-12 at 10 07 21 AM

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Hi,
Sorry for the late reply and thank you for your previous help !
I succefully run the script on a single GPU using your advice to reduce the memory usage but I still get a different error trying to use two GPUS.
I have the following stack :

We have successfully found 2 GPUs!
Training with 2 GPU(s).
Starting worker process: 1 
Starting worker process: 0 
ERROR:root:Address already in use
Process NumbaDeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
Address already in use

Pycuda gives an extra :

ERROR:root:Address already in use
Process PyCUDADeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------

No process is being launched in the same time and I am not sure.

I am trying to investigate the context.pop advice by looking at /child_process_base.py but not sure if that is the right idea.

from warp-drive.

Emerald01 avatar Emerald01 commented on September 28, 2024

Not pretty sure about this error as it says "ERROR:root:Address already in use". It sounds to me that the previous GPU context has not been cleared, you may check if the previous task has any zombie process occupied. Since I cannot reproduce your error on my end so I cannot give more clear advice here. What is your multi GPU setup look like?

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Hi,
I took me some time. I am using two NVIDIA GeForce GTX 1080 Ti :
Capture d’écran 2023-04-19 aΜ€ 16 25 42

I am running my code inside a docker image instantiated using : https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/tree/master/pytorch-notebook which is a fork of the public Jupyter Docker Stacks with GPU-processing capability as explained here

from warp-drive.

Emerald01 avatar Emerald01 commented on September 28, 2024

I just simply use the Docker with NVIDIA maintained base image, this is the most robust image I know so far. It handles every tricky dependency btw drivers and cu libs, you may downgrade the base image based on your hardware.

FROM nvcr.io/nvidia/pytorch:21.10-py3

RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Trying to solve this by the end of the week

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Hi, I haven't been able to solve my issue, not sure what is going wrong. I did try to use your recommended docker image but got the same error so I am not sure what is wrong here.

from warp-drive.

Emerald01 avatar Emerald01 commented on September 28, 2024

Can you show your error messages and environment setup?

from warp-drive.

Finebouche avatar Finebouche commented on September 28, 2024

Hi,
Unfortunately, I haven't changed much in the configuration and the errors are exactly the same as last time πŸ‘Ž
So not sure what to try from here.

I am using this docker image https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/blob/master/pytorch-notebook/Dockerfile
and accessing my cluster instance through jupyterhub/jupyterlbab. I have tried the code on GeForce GTX 1080 Ti.

I also tried using the image from which you provided the Dockerfile, run the test example and got the same result.

image

from warp-drive.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.