Giter Site home page Giter Site logo

Comments (17)

raminammour avatar raminammour commented on May 30, 2024 2

Thank you both for helping!

Yes, @omlins, I was able to get a segfault just using MPI and CUDA with a broadcast/receive pattern. The OOM I am getting above sometimes leads to a segfault, which leads me to believe that there is UB behavior somewhere, as you say with the allocation of memory on the GPUs.

We have factorial combinations of cuda drivers, compilers, MPI's and different little things that are making my life difficult in building a reproducible example. Also the system is new, and these things get ironed out in the first few weeks in my experience (as admins correct things). When I manage to do so I will file upstream or discuss on discourse as you advise.

Feel free to close here for now, we can re-open if I determine that the issue pertains to ParallelStencil or ImplicitGlobalGrid, which are great, regardless :)

from parallelstencil.jl.

omlins avatar omlins commented on May 30, 2024 1

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

Hi @raminammour , do I understand right that you could reproduce the issue without ImplicitGlobalGrid and ParallelStencil? Then, you would best write in the GPU section in Julia Discourse in order to get help. I would guess that something is wrong with the GPU memory allocation to your processes independent of what code you run...

from parallelstencil.jl.

omlins avatar omlins commented on May 30, 2024 1

OK, I will close the issue then. Don't hesitate to open a topic on Julia discourse - it could well be that somebody will immediately know what's going on...

PS: thanks for your nice words about ParallelStencil and ImplicitGlobalGrid!

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

Here is a smaller reproducer (which again, works with 1 process) if it helps:

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
ERROR: Out of GPU memory trying to allocate 232.742 KiB
Effective GPU memory usage: 1.03% (416.000 MiB/39.586 GiB)
CUDA allocator usage: 0 bytes
Memory pool usage: 0 bytes (0 bytes allocated, 0 bytes cached)

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Hi @raminammour,
Thanks for reporting. I just tried your MVE and it works fine for my both with -n 1 and -n 2. Before we further dive into debugging, are you sure that running on more than one MPI process, each MPI process has access to its dedicated GPU (and thus be sure that both MPI process are not trying to init on the same GPU) ? This could be a reason for the error you get.
Also, what type of multi-GPU system are you running on, how many GPU per node ?

Here my output

[lraess@node]$ $mpirun_ -n 1 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 31x31x31 (nprocs: 1, dims: 1x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0

[lraess@node]$ $mpirun_ -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);select_device();@show sum(@zeros(siz,siz,siz))"
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
sum(#= none:1 =# @zeros(siz, siz, siz)) = sum(#= none:1 =# @zeros(siz, siz, siz)) = 0.0
0.0

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

Hey, thanks for the reply.

These are nodes with 4 gpus, I reserve with slurm and just tried to be more explicit: srun -n 4 --gres=gpu:4 --cpus-per-gpu=1 --pty bash -l. Is there a way to make sure that each mpi process has a dedicated gpu? (seems that the slurm installation we have does not recognize ntasks-per-gpu or bind-gpu options).

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Welcome, maybe you could try with running -n 2 on 2 different nodes with 1 GPU per node to see if this fixes the error and confirms there is an issue regarding GPU overriding by multiple MPI processes on multi-GPU nodes?

Another try could be to add --ntasks-per-node=4 instead of --cpus-per-gpu=1 (but I don't have much experience with SLURM)

Is there a way to make sure that each mpi process has a dedicated gpu?

On the application side, it should be handled by select_device() from here, which gets local MPI rank (per node) and assigns it as GPU_ID. You could check it by printing the output as such (@show me=select_device()):

$JULIA_DEPOT_PATH/bin/mpiexecjl -n 2 julia --project=@. -e "using ParallelStencil,MPI,ImplicitGlobalGrid;@init_parallel_stencil(CUDA, Float64, 3); siz=31; me, dims, nprocs, coords, comm = init_global_grid(siz,siz,siz);@show me=select_device();@show sum(@zeros(siz,siz,siz))"

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

Seems to be correctly selecting different devices

--------------------------------------------------------------------------
Global grid: 60x31x31 (nprocs: 2, dims: 2x1x1)
me = select_device() = 1
me = select_device() = 0

I simplified further and getting segfaults with CUDA and MPI interaction. I will investigate further, and come back here if the error is not coming from upstream.

For now feel free to close the issue if you see fit, or keep it open and I will notify you of any resolution.

Cheers!

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Indeed, seems to be ok regarding device selection. Could you try running on different nodes ?

I simplified further and getting segfaults with CUDA and MPI interaction

That's not nice. Do you use CUDA-aware MPI ?

You certainly did already, but these two resources may give further hints on how to set-up the environment to run Multi-GPU:

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

yeah, MPI reports that it has_cuda() but segfaults still :(

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Could you maybe try export IGG_CUDAAWARE_MPI=0 to disable CUDA-awareness within ImplicitGlobalGrid.jl, and see if this would fix the segfault ?

Is the machine you are running on using MPICH as MPI ?

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

This machine is using OpenMPI.
export IGG_CUDAAWARE_MPI=0 does not fix the OOM error above.

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Does the MPI install use UCX instead of openib ? I had issues getting MPI to work properly if OpenMPI was configured with UCX.

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

I am on a managed system, so cannot play with any installations unfortunately.

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

If you have multiple nodes, it would be interesting to see if the error occurs when running one one GPU/node on 2 nodes.

Also, do you get any errors (segfaults) running on 2 MPI processes with Thread backend ?

from parallelstencil.jl.

raminammour avatar raminammour commented on May 30, 2024

It works with one gpu per node with two mpi processes, but not 2 gpus on one node!

from parallelstencil.jl.

luraess avatar luraess commented on May 30, 2024

Interesting.

  • What type of node and GPUs are you running on ?
  • Which version of MPI, CUDA is installed (both in Julia and system ones) ?
  • Could you check with your sysadmin if the slurm launchparams are the correct ones for the type of job you are trying to run on multi-GPU per node.

from parallelstencil.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.