Comments (11)
If you want to spin up multiple GPUs, please refer to example trainer script and WarpDrive managed multi gpu training pipeline which is one single line of function here, this script itself supports multiple GPUs, so you can try it first.
https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script_numba.py#L213
Sorry about the mislead. In fact, Pytorch lighting can only manage the multi-GPUs for pytorch model part, but not the environment steps part, actually you know the latter is the key of WarpDrive. We have multi-gpu managers for both env steps and pytorch models. In particular, we have model sync up with pytorch DDP and GPU context manager for step function executables running on multiple GPUs. The problem you see is simply because by default CPU host will always work with the default GPU, and some advanced settings are required to distribute the host to multiple processes and each process to control one GPU. From our benchmark on V100 and A100 GPUs, the speed is almost going linearly with N_GPUs with N_GPUs <= 4 at least.
BTW, since Pytorch lightning has very frequent update on the backend and sometimes it will fail some existing pipelines, I do not really suggest you using lightning for now.
from warp-drive.
Hi, thanks for the explanation.
I actually tried reusing that distributed function but without success.
Firstly the scripts
python warp_drive/training/example_training_script_pycuda.py --env tag_continuous
and
python warp_drive/training/example_training_script_numba.py --env tag_continuous
were giving me this error :
RuntimeError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 10.92 GiB total capacity; 6.73 GiB already allocated; 1.33 GiB free; 8.34 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Which is weird to me because I would have expect that to work the same way as calling the Environment Warper and Trainer from the notebook.
And when using perform_distributed_training from the notebook I also encountered some other problems as well. I will try to reproduce the error I had and see if I can fix it but I am still puzzled by the error I get from the script.
from warp-drive.
The error you saw is simply the OOM. What you can do is to reduce the number of agents set up in the yaml configure https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/run_configs/tag_continuous.yaml
I just tested continuous demo script and it runs well on 2 A-100 GPUs, and you can see both GPUs have equal share.
from warp-drive.
Hi,
Sorry for the late reply and thank you for your previous help !
I succefully run the script on a single GPU using your advice to reduce the memory usage but I still get a different error trying to use two GPUS.
I have the following stack :
We have successfully found 2 GPUs!
Training with 2 GPU(s).
Starting worker process: 1
Starting worker process: 0
ERROR:root:Address already in use
Process NumbaDeviceContextProcessWrapper-1:
Traceback (most recent call last):
File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
self._clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
process_group_torch.clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
dist.destroy_process_group()
File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
assert pg is not None
AssertionError
Address already in use
Pycuda gives an extra :
ERROR:root:Address already in use
Process PyCUDADeviceContextProcessWrapper-1:
Traceback (most recent call last):
File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
self._clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
process_group_torch.clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
dist.destroy_process_group()
File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
assert pg is not None
AssertionError
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
No process is being launched in the same time and I am not sure.
I am trying to investigate the context.pop advice by looking at /child_process_base.py
but not sure if that is the right idea.
from warp-drive.
Not pretty sure about this error as it says "ERROR:root:Address already in use". It sounds to me that the previous GPU context has not been cleared, you may check if the previous task has any zombie process occupied. Since I cannot reproduce your error on my end so I cannot give more clear advice here. What is your multi GPU setup look like?
from warp-drive.
Hi,
I took me some time. I am using two NVIDIA GeForce GTX 1080 Ti :
I am running my code inside a docker image instantiated using : https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/tree/master/pytorch-notebook which is a fork of the public Jupyter Docker Stacks with GPU-processing capability as explained here
from warp-drive.
I just simply use the Docker with NVIDIA maintained base image, this is the most robust image I know so far. It handles every tricky dependency btw drivers and cu libs, you may downgrade the base image based on your hardware.
FROM nvcr.io/nvidia/pytorch:21.10-py3
RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0
from warp-drive.
Trying to solve this by the end of the week
from warp-drive.
Hi, I haven't been able to solve my issue, not sure what is going wrong. I did try to use your recommended docker image but got the same error so I am not sure what is wrong here.
from warp-drive.
Can you show your error messages and environment setup?
from warp-drive.
Hi,
Unfortunately, I haven't changed much in the configuration and the errors are exactly the same as last time π
So not sure what to try from here.
I am using this docker image https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/blob/master/pytorch-notebook/Dockerfile
and accessing my cluster instance through jupyterhub/jupyterlbab. I have tried the code on GeForce GTX 1080 Ti.
I also tried using the image from which you provided the Dockerfile, run the test example and got the same result.
from warp-drive.
Related Issues (20)
- 'EnvWrapper' object has no attribute 'use_cuda' HOT 3
- Creating a 4D custom environment from Gridworld 2D env HOT 3
- How to implement CTDE-based MARL algorithms on the platform? HOT 6
- Deprecated PyTorch Lightning Hooks was removed HOT 1
- Slow initialization for large-scale scenarios HOT 1
- ModuleNotFoundError: No module named 'warp_drive.managers.pycuda_managers' HOT 6
- Confusion of step function HOT 4
- Using a custom policy model HOT 8
- AssertionError: the customized environment is expected to be a valid PYTHONPATH HOT 4
- ERROR:root:module 'warp_drive.numba_includes.env_runner' has no attribute 'NumbaCustomEnvStep' HOT 7
- found invalid values error HOT 1
- GPU requirements HOT 1
- Failure when running test script HOT 3
- Correct way to wrap a gymnasium environment. HOT 6
- Confusion about environment reset on device and save_and_apply_at_reset HOT 10
- support gymnasium >= 0.26? HOT 2
- Compatability with the new pytorch lightning HOT 2
- Unique env with mixed # of threads/block and chained CUDA kernels. Is Warp-Drive appropriate? HOT 2
- fail install package with GPU HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warp-drive.