Comments (26)
In the mpirun command, you can call a wrapper script before pxgemm_miniapp that sets a different GPU device to each of the 2 ranks before pxgemm_miniapp is invoked. I believe NCCL does not allow multiple ranks to use the same GPU device.
For instance, if you were using OpenMPI, the wrapper script may look like this:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}
exec $*
and you would call the miniapp as follows:
mpirun -np 2 ./wrapper.sh ./pxgemm_miniapp <args>
from cosma.
This is interesting. I do not see the same thing, so this tells me that it could have something to do with your MPI configuration. I am not an expert here, so will let others chime in.
from cosma.
Since there are 10 repetitions, the output contains a sorted timings array from the fastest to the slowest one in ms.
This means, the fastest time was 1068ms and the slowest one was 10368ms. This is indeed quite unusual that there is so much overhead. Usually it should be much less substantial difference.
from cosma.
That's true, I'm trying now to build on another computer.
For the moment, that's what I have on this second computer:
mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 32768 -n 32768 -k 32768 --type=double -r 4
COSMA TIMES [ms] = 5811 5814 5814 6523
mpirun -np 2 ./wrapper.sh ./cosma_miniapp -m 32768 -n 32768 -k 32768 --type=double -r 4
COSMA TIMES [ms] = 4278 4305 4306 10292
Shows some speedup, right?
But I'll rebuild cosma with the instructions you sent. I'm waiting for the reservation to be ready and I'll send you a feedback.
Many thanks!
from cosma.
Thank you for your reply. I'm going to test it and I'll send you a feedback.
from cosma.
Hello,
using the wrapper I can see 1 process for each GPU, thanks!
However, I have no idea how to interpret the outputs.
For instance, my system has 2 gpus and I launched the miniapp as follows:
mpirun -np 2 -npernode 2 -machinefile $PBS_NODEFILE ./wrapper.sh ./pxgemm_miniapp -m 18000 -n 18000 -k 18000 --p_grid=1,1 --block_a=1024,1024 --block_b=1024,1024 --block_c=1024,1024 --type=double --algorithm=cosma -r 1
What I see -- two times the same output
Running PDGEMM on the following problem
(2x)
GLOBAL MAT. SIZES
A = 18000 x 18000
B = 18000 x 18000
C = 18000 x 18000
...
...
COSMA TIMES [ms] = 8758
COSMA TIMES [ms] = 10471
....
So, it looks like two single-GPU versions of the miniapp are executed and the load is not divided among the GPU devices.
Is this correct? Are there additional mpi flags/pxgemm parameters do set in order to use multiple GPUs?
Best regards
from cosma.
Can you try setting --p_grid=2,1
?
from cosma.
Sure!
mpirun -np 2 ./wrapper.sh ./pxgemm_miniapp -m 18000 -n 18000 -k 18000 --p_grid=2,1 --block_a=1024,1024 --block_b=1024,1024 --block_c=1024,1024 --type=double --algorithm=cosma -r 1
..........................
COSMA(pxgemm_miniapp.cpp): warning: number of processors in the grid must be equal to P, setting grid to 1xP instead.
COSMA(pxgemm_miniapp.cpp): warning: number of processors in the grid must be equal to P, setting grid to 1xP instead.
Running PDGEMM on the following problem:
...
...
...
COSMA TIMES [ms] = 3562
COSMA TIMES [ms] = 3522
..............................
Is the multi-gpu building instructions enough? Or should I also build for GPU-Aware MPI?
Thanks!
from cosma.
OK, let me try to rebuild everything and see what I get.
In your side you see just one output?
Is your OpenMPI the Official NVIDIA version of OpenMPI?
from cosma.
I run with OpenMPI that I build from source. I also run on AMD GPUs, but on your system, you should try the OpenMPI that is distributed with Nvidia's hpc toolkit. You don't need GPU aware MPI support for using NCCL/RCCL, but having it shouldn't hurt.
You can also check if Intel MPI has some parameters that result in such behavior (duplication of tasks instead of one parallel task with multiple ranks).
from cosma.
Ok, so there is an mpi in the source? Maybe it is interesting to use it instead of the modules from the system.
In this case, I'm using the OpenMPI from a module, the intel one does not have that global rank variable to help with the GPU affinity.
from cosma.
Btw, I'm using -DCOSMA_SCALAPACK=MKL instead of COSTOM. Is it ok?
from cosma.
A bit more info...
This behavior can be seen also for the cosma_miniapp.
It looks like things are executed on the first GPU and then on the next one.
$ mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 1056
mpirun -np 2 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 2
Overlap of communication and computation: ON.
Communication-thread policy (for overlap): busy-waiting (using blocking one-sided MPI).
Divisions strategy:
parallel (k / 2)
Required memory per rank (in #elements): 671088640
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 3532
from cosma.
@tcarneirop I see that in the first example, the overlap of comm. and comp. is off and in the other one it's on. Can you run both examples, but turning off the overlap of communication and computation by setting: export COSMA_OVERLAP_COMM_AND_COMP=OFF
, just so that we can have consistent configurations?
from cosma.
Hello, Thanks for your reply.
Tried and things are the same. Launches 4 mpi process, each one on a different GPU, but it looks like each one is solving a 16k x 16k matrix multiplication instead of sharing the load:
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 3729
vs
mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 1100
My cfgs:
ml CMake/3.24.3-GCCcore-11.3.0
ml GCC/11.3.0
ml imkl
ml OpenMPI/4.1.4-GCC-11.3.0
ml CUDA/11.7.0
export NCCL_ROOT=home/user/nccl/
export NCCL_LIB_DIR=home/user/nccl/lib/
export NCCL_INCLUDE_DIR=home/user/nccl/include/
export COSMA_OVERLAP_COMM_AND_COMP=OFF
cmake -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=ON -DCMAKE_INSTALL_PREFIX=~/cosmaAGAIN ..
Btw, the wrapper.sh is the one provided a few answers ago.
Thanks!!
from cosma.
Nothing? =(
from cosma.
Hi @tcarneirop! I think I know what the problem might be.
However, to be sure, can you please also try running the same on the GPU, but without using NCCL and without using CUDA-aware MPI.
So, in the cmake command you would have the following: -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF
.
from cosma.
Thanks for your reply!
Sure, just one question -- what's the suggested CUDA-aware MPI should I load?
Thanks!
from cosma.
You should just use the standard MPI, without the cuda-aware part, so you can also add -DCOSMA_WITH_GPU_AWARE_MPI=OFF
to cmake, but this is the default value anyway.
We simply want to see the results of COSMA + GPU-backend, but without using nccl and without using cuda-aware MPI.
from cosma.
Hello @kabicm
Thanks for your reply!
Ok, with -DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF -DCOSMA_WITH_GPU_AWARE_MPI=OFF
mpirun -np 1 ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 1
Overlap of communication and computation: OFF.
Divisions strategy:
Required memory per rank (in #elements): 805306368
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 1081
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 1
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 3020
My modules:
ml CMake/3.24.3-GCCcore-11.3.0
ml GCC/11.3.0
ml imkl
ml OpenMPI/4.1.4-GCC-11.3.0
ml CUDA/11.7.0
Thanks!!
from cosma.
Thanks a lot for detailed benchmarking!
I see two possibilities here:
- I just realized that the number of repetitions is set to 1. The initial run is always allocating the memory pools (both CPU and GPU), so the timing here includes all the one-time overheads. For this reason, It would be better to set
-r 3
or higher, because then the subsequent runs will reuse the pre-allocated memory, so that we can see the actual compute time without the overheads. - this might still be a small matrix size to fully utilize the 4xGPUs, so you should try with larger sizes as well. On P100, I was running m=n=k=30k (double precision) on a single GPU. If there is not enough work, then a single GPU might be faster.
I am looking forward to seeing how it went!
from cosma.
Thanks!!
So should I continue with
-DCOSMA_BLAS=CUDA -DCOSMA_SCALAPACK=MKL -DCOSMA_WITH_NCCL=OFF -DCOSMA_WITH_GPU_AWARE_MPI=OFF
Trying right away!
from cosma.
Exactly, I would turn off the nccl and the gpu-aware-mpi for the time being. Great, thanks!
from cosma.
Now trying
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 16384 -n 16384 -k 16384 --type=double -r 10
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
(sorry for using 16k, now running with something bigger)
But I just don't get the output -- How should I read these numbers?
...
...
Strategy = Matrix dimensions (m, n, k) = (16384, 16384, 16384)
Number of processors: 4
Overlap of communication and computation: OFF.
Divisions strategy:
parallel (n / 2)
parallel (k / 2)
Required memory per rank (in #elements): 469762048
Available memory per rank (in #elements): not specified (assumed: infinite)
COSMA TIMES [ms] = 1068 1083 1170 1236 1295 3184 9896 10132 10139 10368
?
from cosma.
This already looks much better!
For fine-tuning, you can also try playing with the tile sizes for GPUs (see:
https://github.com/eth-cscs/COSMA#tunable-parameters). Once the matrix dimensions are split among the ranks, each rank is further splitting the local matrices into tiles and pipelines these tiles to GPUs. You can choose the sizes of these tiles in number of elements, by setting:
export COSMA_GPU_MAX_TILE_M=4000
export COSMA_GPU_MAX_TILE_N=4000
export COSMA_GPU_MAX_TILE_K=4000
This means all the tiles would be 4k x 4k elements.
You can also try larger sizes, e.g. by setting all the tiles to 8k. The optimal tile size depends on the GPUs that you are using. For P100, the optimal sizes were 4-5k.
You can just set these environment variables before running, you don't have to recompile the code, they are used at runtime.
Just in case you want to experiment with it!
from cosma.
Hello @kabicm
many thanks!!
It looks like we've got some speedup:
mpirun -np 1 ./wrapper.sh ./cosma_miniapp -m 36000 -n 36000 -k 36000 --type=double -r 4
COSMA TIMES [ms] = 16278 16278 16279 16766
mpirun -np 4 ./wrapper.sh ./cosma_miniapp -m 36000 -n 36000 -k 36000 --type=double -r 4
COSMA TIMES [ms] = 9173 9184 9282 12963
Question --
32k on one gpu uses -- 1719MiB / 40960MiB
on 26769MiB / 40960MiB
Is it ok or is there something strange?
As I'm in a hurry, I could not test it yet with the settings you suggested.
As soon as I get another time-slot on the cluster I'll re-run the experiments .
Thanks again,
We have to buy you a coffee =)
from cosma.
Related Issues (20)
- COSMA cublas crash after job finished HOT 5
- Fixing CI/CD issues HOT 1
- timings in comsa_miniapp HOT 2
- GPU-Aware MPI Version HOT 2
- Error crash at the end of the job execution HOT 7
- Unable to use an internal RCCL build HOT 4
- COSMA crash on Perlmutter when dealing with complex values HOT 9
- Crashes with the latest COSMA release HOT 25
- (Still) Excessive memory usage HOT 3
- cmake project version v2.6.1 does not match git tag v2.6.2
- Add ability to disable NCCL at runtime HOT 1
- build failure with nccl HOT 1
- enable overlap comm with computation for cosma_miniapp HOT 2
- Configure fails to find costa submodule and build fails to build with costa-2.2 HOT 1
- Link errors HOT 6
- cmake project version v2.6.4 does not match git tag v2.6.5
- COSMA build fails on CRAY HOT 1
- undefined reference to `void costa::transform<float>(std::__1::vector<std::__1:: ... HOT 3
- Switching to a proper memory-pool implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cosma.