Comments (17)
Thank you both for helping!
Yes, @omlins, I was able to get a segfault just using MPI and CUDA with a broadcast/receive pattern. The OOM I am getting above sometimes leads to a segfault, which leads me to believe that there is UB behavior somewhere, as you say with the allocation of memory on the GPUs.
We have factorial combinations of cuda drivers, compilers, MPI's and different little things that are making my life difficult in building a reproducible example. Also the system is new, and these things get ironed out in the first few weeks in my experience (as admins correct things). When I manage to do so I will file upstream or discuss on discourse as you advise.
Feel free to close here for now, we can re-open if I determine that the issue pertains to ParallelStencil
or ImplicitGlobalGrid
, which are great, regardless :)
from parallelstencil.jl.
I simplified further and getting segfaults with
CUDA
andMPI
interaction. I will investigate further, and come back here if the error is not coming from upstream.
Hi @raminammour , do I understand right that you could reproduce the issue without ImplicitGlobalGrid and ParallelStencil? Then, you would best write in the GPU section in Julia Discourse in order to get help. I would guess that something is wrong with the GPU memory allocation to your processes independent of what code you run...
from parallelstencil.jl.
OK, I will close the issue then. Don't hesitate to open a topic on Julia discourse - it could well be that somebody will immediately know what's going on...
PS: thanks for your nice words about ParallelStencil and ImplicitGlobalGrid!
from parallelstencil.jl.
Here is a smaller reproducer (which again, works with 1 process) if it helps:
$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
ERROR: Out of GPU memory trying to allocate 232.742 KiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)
from parallelstencil.jl.
Hi @raminammour,
Thanks for reporting. I just tried your MVE and it works fine for my both with -n 1
and -n 2
. Before we further dive into debugging, are you sure that running on more than one MPI process, each MPI process has access to its dedicated GPU (and thus be sure that both MPI process are not trying to init on the same GPU) ? This could be a reason for the error you get.
Also, what type of multi-GPU system are you running on, how many GPU per node ?
Here my output
[lraess@node]$ $mpirun_ -n 1 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 31x31x31 (nprocs: 1, dims: 1x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
[lraess@node]$ $mpirun_ -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
0.0
from parallelstencil.jl.
Hey, thanks for the reply.
These are nodes with 4 gpus, I reserve with slurm and just tried to be more explicit: srun -n 4 --gres=gpu:4 --cpus-per-gpu=1 --pty bash -l
. Is there a way to make sure that each mpi process has a dedicated gpu? (seems that the slurm installation we have does not recognize ntasks-per-gpu
or bind-gpu
options).
from parallelstencil.jl.
Welcome, maybe you could try with running -n 2
on 2 different nodes with 1 GPU per node to see if this fixes the error and confirms there is an issue regarding GPU overriding by multiple MPI processes on multi-GPU nodes?
Another try could be to add --ntasks-per-node=4
instead of --cpus-per-gpu=1
(but I don't have much experience with SLURM)
Is there a way to make sure that each mpi process has a dedicated gpu?
On the application side, it should be handled by select_device()
from here, which gets local MPI rank (per node) and assigns it as GPU_ID
. You could check it by printing the output as such (@show me=select_device()
):
$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);@show me=select_device();@show sum(@zeros(siz,siz,siz))"
from parallelstencil.jl.
Seems to be correctly selecting different devices
--------------------------------------------------------------------------
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
me = select_device() = 1
me = select_device() = 0
I simplified further and getting segfaults with CUDA
and MPI
interaction. I will investigate further, and come back here if the error is not coming from upstream.
For now feel free to close the issue if you see fit, or keep it open and I will notify you of any resolution.
Cheers!
from parallelstencil.jl.
Indeed, seems to be ok regarding device selection. Could you try running on different nodes ?
I simplified further and getting segfaults with CUDA and MPI interaction
That's not nice. Do you use CUDA-aware MPI ?
You certainly did already, but these two resources may give further hints on how to set-up the environment to run Multi-GPU:
from parallelstencil.jl.
yeah, MPI reports that it has_cuda()
but segfaults still :(
from parallelstencil.jl.
Could you maybe try export IGG_CUDAAWARE_MPI=0
to disable CUDA-awareness within ImplicitGlobalGrid.jl, and see if this would fix the segfault ?
Is the machine you are running on using MPICH as MPI ?
from parallelstencil.jl.
This machine is using OpenMPI.
export IGG_CUDAAWARE_MPI=0
does not fix the OOM error above.
from parallelstencil.jl.
Does the MPI install use UCX instead of openib ? I had issues getting MPI to work properly if OpenMPI was configured with UCX.
from parallelstencil.jl.
I am on a managed system, so cannot play with any installations unfortunately.
from parallelstencil.jl.
If you have multiple nodes, it would be interesting to see if the error occurs when running one one GPU/node on 2 nodes.
Also, do you get any errors (segfaults) running on 2 MPI processes with Thread
backend ?
from parallelstencil.jl.
It works with one gpu per node with two mpi processes, but not 2 gpus on one node!
from parallelstencil.jl.
Interesting.
- What type of node and GPUs are you running on ?
- Which version of MPI, CUDA is installed (both in Julia and system ones) ?
- Could you check with your sysadmin if the slurm launchparams are the correct ones for the type of job you are trying to run on multi-GPU per node.
from parallelstencil.jl.
Related Issues (20)
- AMDGPU v0.5.0 compat HOT 1
- Add device_sync
- sync issues on AMDGPU backend
- Make CellArrays mutable HOT 4
- finite volume method HOT 3
- [JuliaCon/proceedings-review] @parallel keyword argument `loopopt` deprecated? HOT 1
- ParallelStencil on 1.10 HOT 6
- [JuliaCon/proceedings-review] DOI of paper by Besard et al. HOT 2
- [JuliaCon/proceedings-review] Community guidelines HOT 1
- [JuliaCon/proceedings-review] Performance metrics HOT 4
- Type unstable Data.Number HOT 2
- GPU memory management issue when running multi-GPU code HOT 10
- Add support for Polyester's `@batch` HOT 20
- Generalize loopopt
- Create and update GPU unit tests
- Thread (CPU) Float32/Float64 performance comparison on miniapp acoustic2D HOT 12
- Example for init_global_grid_usage HOT 3
- How to implement custom finite differencing operators HOT 8
- CUDA Crash with julia 1.9.0 HOT 8
- Non cartesian gather! HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parallelstencil.jl.