Comments (44)
The other script i sent before is just another variant.
I this case the warpx are already installed within the container and all warpx_1,2,3d
executables can then be
executed directly.
from warpx.
@ax3l BTW i created different singularity definition files which could eventually allow any user to run WarpX on any HPC system which provide singularity/apptainer as container technique.
Defintiion files for CPUs and GPU (AMD+ROCM) are provided.
May be interesting for WarpX user?
from warpx.
Yes, this is actually exactly the way i do.
from warpx.
this is now set, waiting for the next release to solve that eventually.
from warpx.
Yes, I have some thoughts on how to handle this.
from warpx.
Thanks for your interest in the code.
Could you share the content of your run-file.sh
script? In particular, are you calling an MPI runner, such as mpirun
, srun
or jsrun
from inside run-file.sh
?
from warpx.
for example submitting with the above slurm command gives at initialisation:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
HIP initialized with 1 device.
AMReX (23.05) initialized
PICSAR (1903ecfff51a)
WarpX (23.05)
__ __ __ __
\ \ / /_ _ _ __ _ __\ \/ /
\ \ /\ / / _` | '__| '_ \\ /
\ V V / (_| | | | |_) / \
\_/\_/ \__,_|_| | .__/_/\_\
|_|
Level 0: dt = 1.530214125e-17 ; dx = 5.580357143e-09 ; dz = 8.081896552e-09
Grids Summary:
Level 0 29 grids 9977856 cells 100 % of domain
smallest grid: 2688 x 128 biggest grid: 2688 x 128
Should it be HIP initialised with 4 devices
and 1 GPU device per MPI rank
?
from warpx.
run-file.sh
:
#!/bin/bash
#export CONT=/lustre/rz/dbertini/containers/prod/gpu/rlx8_rocm-5.4.3_warpx.sif
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx
export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
srun --export=ALL -- $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck
from warpx.
OK, thanks.
Based on the output that you sent, I think that WarpX is actually using 4 GPUs. I think that the message
HIP initialized with 1 device.
is to be understood per MPI rank.
@WeiqunZhang Could you confirm that this is the case?
from warpx.
unfortunately on the node where the job is running i can notice the use of only one GPU, as it is showed by the utility
program rocm-smi
from warpx.
this for example a snapshot of the rocm-smi
output when the job is running:
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 71.0c 198.0W 1502Mhz 1200Mhz 0% auto 290.0W 84% 99%
1 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 2% 0%
2 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 2% 0%
3 39.0c 34.0W 300Mz 1200Mhz 0% auto 290.0W 2% 0%
4 59.0c 40.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 38.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 40.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
=====40===============9========================================================
=====40===============4====== End o ROCm SMI Log ==============================
one can see that only GPU at index 0 is used the other 4 requested are idle... any idea?
from warpx.
HIP initialized with 1 device.
means only one device is being used by all the processes. This is most likely a job script issue.
from warpx.
yes the question for me it still not clear what could be possibly wring in my job script ...
from warpx.
It depends on how slurm is configured on that system. Maybe try to change --cpus-per-task 1
. Figure out how many CPUs you have on a node and divide that by 4. The issue might be all your processes were using cpus that were close one GPU and that GPU was mapped to all 4 processes. There is nothing we can do in a C++ code, if the GPUs are not visible to us.
from warpx.
i have 96
processors on one machine, so if i change to --cpus-per-task 24
i still use only one GPU, does not help
from warpx.
Maybe instead of --gres=gpu:4
, you can try --gpus-per-taks=1
and --gpu-bind=verbose,single:1
.
from warpx.
doing your change i got the following error:
gpu-bind: usable_gres=0x8; bit_alloc=0xF; local_inx=4; global_list=3; local_list=3
gpu-bind: usable_gres=0x1; bit_alloc=0xF; local_inx=4; global_list=0; local_list=0
gpu-bind: usable_gres=0x4; bit_alloc=0xF; local_inx=4; global_list=2; local_list=2
gpu-bind: usable_gres=0x2; bit_alloc=0xF; local_inx=4; global_list=1; local_list=1
amrex::Abort::1::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::2::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
SIGABRT
amrex::Abort::3::HIP error in file /tmp/warp/WarpX/build_2d/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 163 no ROCm-capable device is detected !!!
from warpx.
so it seems that from warpx
or amrex
perspective, there is only one GPU device on the node ?
from warpx.
resubmitting with --ntasks-per-node 1
works fine, only one GPU is visible to warpx
from warpx.
The error message means processes 1, 2 and 3 see zero GPUs as reported by hipGetDeviceCount
. Only process 0 sees a GPU.
You can also run rocm-smi
instead of warpx. I suspect you will see the same behavior. That is only one GPU in total is available.
from warpx.
that is a good idea !
from warpx.
so rocm-smi
sees all the 8 devices
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 45.0c 35.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
1 43.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
2 43.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
3 44.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
4 41.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 40.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 39.0c 38.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
from warpx.
and it has been launched using the same slurm command as for warpx
from warpx.
Did you run it under srun?
from warpx.
yes same command
from warpx.
additionnaly i ran another pic code based on gpu i.e picongpu
and this seems to see all devices and use all GPUS on the machine without problem
from warpx.
srun --export=ALL -- $CONT rocm-smi $WDIR/scripts/inputs/warpx_opmd_deck
?
from warpx.
no without the input deck file just
srun --export=ALL -- $CONT rocm-smi
from warpx.
What is /cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
?
from warpx.
this is a singularity container which contains all the software stack to run warpx
from warpx.
included rocm
from warpx.
Could you add the following lines after line 256 (device_id = my_rank % gpu_device_count;) of build/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp
and recompile it?
amrex::AllPrint() << "Proc. " << ParallelDescriptor::MyProc()
<< ": nprocspernode = " << ParallelDescriptor::NProcsPerNode()
<< ", my_rank = " << my_rand << ", device count = "
<< gpu_device_count << "\n";
Hopefully this can give us more information.
from warpx.
resintalling with v 23.06
and the modifs on the AMREX code you asked for gives:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 1: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 0, device count = 4
HIP initialized with 1 device.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)
strange... my_rank
is always 0 ?
from warpx.
That my_rank
is the rank in a subcommunicator with the type of MPI_COMM_TYPE_SHARED. The issue is there is only one process per "node", probably because of the container. That is in this configuration, the CPUs and their memory are not shared, whereas the GPUs are shared.
You can try to map only one GPU to each MPI task in the slurm job script maybe with more explicit gpu mapping. Or you can modify your amrex souce code. You can make the following change that should work for your specific case.
diff --git a/Src/Base/AMReX_GpuDevice.cpp b/Src/Base/AMReX_GpuDevice.cpp
index d709531440..cfd7a39e5c 100644
--- a/Src/Base/AMReX_GpuDevice.cpp
+++ b/Src/Base/AMReX_GpuDevice.cpp
@@ -253,6 +253,7 @@ Device::Initialize ()
// ranks to GPUs, assuming that socket awareness has already
// been handled.
+ my_rank = ParallelDescriptor::MyProc();
device_id = my_rank % gpu_device_count;
// If we detect more ranks than visible GPUs, warn the user
We will try to fix this in the next release.
from warpx.
But as well gpu_device_count
is not correct, it should be 8
and not 4
in my case...
from warpx.
I think that's because of --ntasks-per-node 4 --cpus-per-task 1 --gres=gpu:4
.
from warpx.
ah that is correct ! thx !
from warpx.
Your patch seems to work:
MPI initialized with 4 MPI processes
MPI initialized with thread support level 3
Initializing HIP...
Proc. 1: nprocspernode = 1, my_rank = 1, device count = 4
Proc. 3: nprocspernode = 1, my_rank = 3, device count = 4
Proc. 0: nprocspernode = 1, my_rank = 0, device count = 4
Proc. 2: nprocspernode = 1, my_rank = 2, device count = 4
HIP initialized with 4 devices.
AMReX (23.06) initialized
PICSAR (1903ecfff51a)
WarpX (Unknown)
But it is quite a change in the AMReX logic !
from warpx.
and controlling with rocm-smi
indeed shows the proper GPU usage:
======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 63.0c 100.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 92%
1 61.0c 326.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
2 72.0c 249.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
3 62.0c 252.0W 1502Mhz 1200Mhz 0% auto 290.0W 79% 99%
4 43.0c 39.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
5 42.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
6 42.0c 34.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
7 40.0c 37.0W 300Mhz 1200Mhz 0% auto 290.0W 0% 0%
================================================================================
============================= End of ROCm SMI Log ==============================
But i see that the GPU usage is varying between 0-99 %
, Is this correct ?
Is there a way with WarpX to measure the usage efficiency when running with GPU ?
from warpx.
ROCm has a profiling tool. You can try that.
from warpx.
@denisbertini just curious about the Singularity container, which I used before.
Without the patch above that @WeiqunZhang suggests, doesn't one usually start them as:
$ srun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE.sif> </PATH/TO/BINARY/WITHIN/CONTAINER>
https://docs.sylabs.io/guides/3.3/user-guide/mpi.html
So in your case
export CONT=/cvmfs/phelix.gsi.de/sifs/gpu/rlx8_rocm-5.4.3_warpx.sif
export WDIR=/lustre/rz/dbertini/gpu/warpx
export OMPI_MCA_io=romio321
export APPTAINER_BINDPATH=/lustre/rz/dbertini/,/cvmfs
srun --export=ALL singularity exec $CONT warpx_2d $WDIR/scripts/inputs/warpx_opmd_deck
?
from warpx.
Awesome. One hint for parallel sims, you can also try out our dynamic load balancing capabilties.
With the AMReX block size, you can aim to create 4-12 blocks per GPU so the algorithm can move them around based on the cost function you pick. (Of course, your problem needs to be large enough to not underutilize the GPUs with too little work.) Generally, the Knapsack distribution works well and you can use CPU and GPU timers or heuristics for cost estimates.
- https://warpx.readthedocs.io/en/latest/usage/domain_decomposition.html
- https://warpx.readthedocs.io/en/latest/usage/parameters.html#distribution-across-mpi-ranks-and-parallelization
- Rowan ME, Gott KN, Deslippe J, Huebl A, Thevenet M, Lehe R, Vay JL. In-situ assessment of device-side compute work for dynamic load balancing in a GPU-accelerated PIC code. PASC ‘21: Proceedings of the Platform for Advanced Scientific Computing Conference. 2021 July, 10, pages 1-11. DOI:10.1145/3468267.3470614
Is your original issue addressed? We could continue in new issues if this is all set now :)
from warpx.
@WeiqunZhang just checking, are you working on a related AMReX PR that we should link & track? :)
@denisbertini awesome 🤩 moved to #3994 to keep things organized :)
from warpx.
@denisbertini Could you please give AMReX-Codes/amrex#3382 a try and let us know if it works for you?
from warpx.
Related Issues (20)
- PICMI documentation is gone HOT 4
- Issue with Limiting External Electric Field to Specific Boundary in 3D Simulation HOT 9
- Clean code: remove `tmp_particle_data`
- Bug: Invalid memory access occurred during AMReX::GpuDevice::streamSynchronize HOT 10
- Convergence behavior of electrostatic solver changes whether embedded boundary support is on or off HOT 2
- Cannot get FFT method to work for electrostatic solver HOT 5
- NumPy 2.0 Compatibility HOT 5
- Runtime error with Laser Ion acceleration test run HOT 9
- Clean code: avoid duplication in Source/Parallelization/WarpXComm.cpp
- AMReX: Conduit with SoA Particles Broken HOT 4
- How to utilize WarpX to investigate the evolution of relativistic electron beams in vacuum HOT 2
- Laser injected at an angle having non-physical effects HOT 2
- WarpX SENSEI in-situ with VisIt got segmentation fault HOT 1
- CFL limitation for electrostatic simulations HOT 5
- Minimum pb alpha particle energy lower than expected
- Particle injector and scaling for astrophysical plasma setup HOT 2
- An initial attempt to run WarpX at Los Alamos
- Installation on lxplus fails
- Laser focal position using PICMI inputs
- Too much memory required for checking domain with moving window
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from warpx.