Hello, Thank you for a great package! I am trying one of your exampl

Thank you both for helping! Yes, <a class="user-mention notranslate"

I simplified further and getting segfaults with <code class="notranslate"

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Seems to be correctly selecting different devices <div class="snippet-clipboard-co

OOM error when running with more than one mpi rank about parallelstencil.jl HOT 17 CLOSED

omlins commented on May 30, 2024

OOM error when running with more than one mpi rank

from parallelstencil.jl.

Comments (17)

raminammour commented on May 30, 2024 2

Thank you both for helping!

Yes, @omlins, I was able to get a segfault just using MPI and CUDA with a broadcast/receive pattern. The OOM I am getting above sometimes leads to a segfault, which leads me to believe that there is UB behavior somewhere, as you say with the allocation of memory on the GPUs.

We have factorial combinations of cuda drivers, compilers, MPI's and different little things that are making my life difficult in building a reproducible example. Also the system is new, and these things get ironed out in the first few weeks in my experience (as admins correct things). When I manage to do so I will file upstream or discuss on discourse as you advise.

Feel free to close here for now, we can re-open if I determine that the issue pertains to ParallelStencil or ImplicitGlobalGrid, which are great, regardless :)

from parallelstencil.jl.

omlins commented on May 30, 2024 1

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

Hi @raminammour , do I understand right that you could reproduce the issue without ImplicitGlobalGrid and ParallelStencil? Then, you would best write in the GPU section in Julia Discourse in order to get help. I would guess that something is wrong with the GPU memory allocation to your processes independent of what code you run...

from parallelstencil.jl.

omlins commented on May 30, 2024 1

OK, I will close the issue then. Don't hesitate to open a topic on Julia discourse - it could well be that somebody will immediately know what's going on...

PS: thanks for your nice words about ParallelStencil and ImplicitGlobalGrid!

from parallelstencil.jl.

raminammour commented on May 30, 2024

Here is a smaller reproducer (which again, works with 1 process) if it helps:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
ERROR: Out of GPU memory trying to allocate 232.742 KiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

from parallelstencil.jl.

luraess commented on May 30, 2024

Hi @raminammour,
Thanks for reporting. I just tried your MVE and it works fine for my both with -n 1 and -n 2. Before we further dive into debugging, are you sure that running on more than one MPI process, each MPI process has access to its dedicated GPU (and thus be sure that both MPI process are not trying to init on the same GPU) ? This could be a reason for the error you get.
Also, what type of multi-GPU system are you running on, how many GPU per node ?

Here my output

[lraess@node]$ $mpirun_ -n 1 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 31x31x31 (nprocs: 1, dims: 1x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0

[lraess@node]$ $mpirun_ -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
0.0

from parallelstencil.jl.

raminammour commented on May 30, 2024

Hey, thanks for the reply.

These are nodes with 4 gpus, I reserve with slurm and just tried to be more explicit: srun -n 4 --gres=gpu:4 --cpus-per-gpu=1 --pty bash -l. Is there a way to make sure that each mpi process has a dedicated gpu? (seems that the slurm installation we have does not recognize ntasks-per-gpu or bind-gpu options).

from parallelstencil.jl.

luraess commented on May 30, 2024

Welcome, maybe you could try with running -n 2 on 2 different nodes with 1 GPU per node to see if this fixes the error and confirms there is an issue regarding GPU overriding by multiple MPI processes on multi-GPU nodes?

Another try could be to add --ntasks-per-node=4 instead of --cpus-per-gpu=1 (but I don't have much experience with SLURM)

Is there a way to make sure that each mpi process has a dedicated gpu?

On the application side, it should be handled by select_device() from here, which gets local MPI rank (per node) and assigns it as GPU_ID. You could check it by printing the output as such (@show me=select_device()):

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);@show me=select_device();@show sum(@zeros(siz,siz,siz))"

from parallelstencil.jl.

raminammour commented on May 30, 2024

Seems to be correctly selecting different devices

--------------------------------------------------------------------------
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
me = select_device() = 1
me = select_device() = 0

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

For now feel free to close the issue if you see fit, or keep it open and I will notify you of any resolution.

Cheers!

from parallelstencil.jl.

luraess commented on May 30, 2024

Indeed, seems to be ok regarding device selection. Could you try running on different nodes ?

I simplified further and getting segfaults with CUDA and MPI interaction

That's not nice. Do you use CUDA-aware MPI ?

You certainly did already, but these two resources may give further hints on how to set-up the environment to run Multi-GPU:

from parallelstencil.jl.

raminammour commented on May 30, 2024

yeah, MPI reports that it has_cuda() but segfaults still :(

from parallelstencil.jl.

luraess commented on May 30, 2024

Could you maybe try export IGG_CUDAAWARE_MPI=0 to disable CUDA-awareness within ImplicitGlobalGrid.jl, and see if this would fix the segfault ?

Is the machine you are running on using MPICH as MPI ?

from parallelstencil.jl.

raminammour commented on May 30, 2024

This machine is using OpenMPI.
export IGG_CUDAAWARE_MPI=0 does not fix the OOM error above.

from parallelstencil.jl.

luraess commented on May 30, 2024

Does the MPI install use UCX instead of openib ? I had issues getting MPI to work properly if OpenMPI was configured with UCX.

from parallelstencil.jl.

raminammour commented on May 30, 2024

I am on a managed system, so cannot play with any installations unfortunately.

from parallelstencil.jl.

luraess commented on May 30, 2024

If you have multiple nodes, it would be interesting to see if the error occurs when running one one GPU/node on 2 nodes.

Also, do you get any errors (segfaults) running on 2 MPI processes with Thread backend ?

from parallelstencil.jl.

raminammour commented on May 30, 2024

It works with one gpu per node with two mpi processes, but not 2 gpus on one node!

from parallelstencil.jl.

luraess commented on May 30, 2024

Interesting.

What type of node and GPUs are you running on ?
Which version of MPI, CUDA is installed (both in Julia and system ones) ?
Could you check with your sysadmin if the slurm launchparams are the correct ones for the type of job you are trying to run on multi-GPU per node.

from parallelstencil.jl.

OOM error when running with more than one mpi rank about parallelstencil.jl HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent