I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases,

BTW, <a class="user-mention notranslate" data-hovercard-type="user" data-

Update: <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

I'll continue the discussion I had with <a class="user-mention notranslate" data-hover

DBCSR performs very poorly on GH200, when there are large blocks about dbcsr HOT 14 OPEN

abussy commented on August 20, 2024

DBCSR performs very poorly on GH200, when there are large blocks

from dbcsr.

Comments (14)

hfp commented on August 20, 2024 1

BTW, @hfp any libxsmm for ARM to be included in CP2K?

I will work on it. I have a few PRs pending for LIBXSMM; ideally, this should happen asap.

from dbcsr.

oschuett commented on August 20, 2024 1

I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). ... From my tests, this seems to be fairly affordable, but certainly not ideal.

That's super interesting! I didn't think an incremental migration would be feasible. I'll look into this.

Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.

Sub-matrix should be fairly easy to add and complex matrices are only used by CP2K in ~3 places which can be refactored.

from dbcsr.

hfp commented on August 20, 2024

Hi Augustin,

I am interested to see if the OpenCL based acceleration in DBCSR can be of use. For some access/dev-time on Alps, you can help me getting this permitted (perhaps private messaging/email). In the past (Daint), OpenCL was not well supported due to GPU mode set to "exclusive" (nvidia-smi) and the ominous environment variable CRAY_CUDA_MPS did not cover toggling the mode. I think it would be good to have this better setup for upcoming Alps. Regarding OpenCL, it's a shot and I can basically tune kernels although the OpenCL backend permits untuned usage (reasonable default kernel parameters). I would also try/tune the new OpenCL support in DBM and bring-up the recipe in CP2K to make this more accessible.

Pretty much all keywords in the &GLOCAL%DBCSR input section of CP2K: no noticeable difference

Same experience. Although bumping the number of MMs per stack can help a bit, but it can also induce imbalance due to unfavorable remainder-work.

Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs. Additionally, it slows down the benchmarks/QS/H2O-XXX.inp tests.

Can you elaborate on how to achieve this (other than for work going through TAS/DBM directly)? Perhaps this is something to become a more regular choice rather than code changes.

Mapping all DBCSR calls to DBM: it helps for this benchmark, but it is still slower than DBCSR on CPUs.

This is entirely possible with contemporary higher-end CPUs. My experience is, if the system contains multiple GPUs anyway, one can harvest them "for free" and get beyond a contemporary high-end CPU in the same system. If the CPU was chosen weaker on purpose (due to emphasis on GPU), the picture can turn in favor of the GPU(s). This is of course more emphasized if the workload has a high portion of DBT/DBM otherwise it's an uphill battle against Amdahl's law.

Tuned new DBCSR kernels for the H100 GPU architecture. I am currently using kernels for A100. There was no noticeable difference.

ACK. You can at least compile the A100 kernels with compute capability corresponding to H100. In any case, I would not expect big impact. Also, consider contributing your tuned parameters.

One possible way to address this would be the possibility of disabling DBCSR acceleration at run time, given a keyword in the input file.

That would be welcome.

from dbcsr.

hfp commented on August 20, 2024

With GPU acceleration enabled, the time spent in DBCSR is increased by more than 15x. Profiling revealed that MPI communication is the main culprit.

I had this for CP2K/DBM recently as well like one of the MPI-enabled function appearing high in the profile (it was even intra-node) in one of our labs but not in the other (same CPU kind). I blamed this to F's ALLOCATE being much slower due to compiler or more likely to the OS flavor. One resolution was to LD_PRELOAD an alternative more scalable malloc implementation, e.g., TBB's malloc proxy. Btw, I have not found time to fix this particular issue at code level let alone upstreaming a change (my plan was to take a look at OpenMP's memory allocation as this is a established programming model in CP2K).

from dbcsr.

abussy commented on August 20, 2024

Hi Hans, thanks a lot for all these insights!

I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64 at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.

I have a branch where I experimented with offloading DBCSR calls to DBM (see cp_dbcsr_multiplication.F). As things stand, it is not ideal because each dbcsr_multiply call involves a copy of the DBCSR matrix to a DBM one. From my tests, this seems to be fairly affordable, but certainly not ideal. Feel free to try it. Note also that DBCSR still has more features than DBM, so complex matrices, or multiplications involving sub-matrices, are still done in DBCSR.

I've tuned the H100 kernels based on the A100 options. However, the A100 parameters are still way more complete, as they also include predicted kernels. I have not been able to run the predicting framework, I think because of filesystem limitations. So at this point, the A100 kernels are still better.

I'll see if I can try your malloc solution, that's an interesting one!

from dbcsr.

alazzaro commented on August 20, 2024

Update: @abussy shared (in private) the CP2K logs with me and I gave a fast look to them.
The drop in performance is due to a corner case of the test where the stack size is too small (52 in average!) and we have large blocks (a lot of single computation). Nothing related to the GPU kernels itself, basically the library is not meant for such cases... Suggested some options, otherwise I think the CPU switch flag can be a good idea...

BTW, @hfp any libxsmm for ARM to be included in CP2K?

from dbcsr.

abussy commented on August 20, 2024

I'll continue the discussion I had with @alazzaro here, so that everybody who is interested can follow.

I was asked to test running with export DBCSR_MULTREC_LIMIT=1048576 and/or with a single OMP thread. Here is what I get from this experiment: simply setting export DBCSR_MULTREC_LIMIT=1048576 does nothing for the timings. However, when running with a single OMP thread, the CPU and GPU versions of DBCSR yield very similar timings on 1 node:

	Total	dbcsr_multiply_generic
with `-D__DBCSR_ACC`	1078.298	77.591
without `-D__DBCSR_ACC`	1048.504	39.448

Going from 1 thread to 8 makes dbcsr_multiply calls ~4x more expensive. The overall timings are slower due to other parts of the code not benefiting from OMP. On multiple nodes, the CPU version scales slightly better.

I am not sure that running with many MPI ranks and a small number of OMP threads is always a good solution though. There are 72 cores per GPU on GH200, and oversubscribing the GPU too much can be detrimental too. Also, if we go to multiple nodes, we might run into scaling issues due to the large number of ranks.

@hfp I also tried TBB's malloc proxy. I only got marginal gains for this benchmark though.

from dbcsr.

abussy commented on August 20, 2024

This case can be solved by setting the environment variable DBCSR_N_STACKS=0. Then, the GPU accelerated version of DBCSR behaves normally again (negligible timings). Note that this issue also triggered PR #801.

from dbcsr.

hfp commented on August 20, 2024

I tried building DBCSR with OpenCL, but it seems the cuda does not provide OpenCL on aarch64 at the momemt (e.g. here). If you happen to know a way around it, I'd be happy to try.

On x86, NVidia's implementation of OpenCL is simply part of every CUDA installation (which in turn can be part of an NVHPC installation). However, I had an issue like yours on a Jetson-AGX system (aarch64 as well) quite some time ago. It's an embedded system with customized OS. My solution at that time was upgrading it to stock-Ubuntu. Of course, that's not a solution in your case. I think it can be useful to get ALPS setup with OpenCL (support request). For the time being, can you check if the CUDA installation simply carries OpenCL? Perhaps something like which nvcc gets you to the point of installation, and once more find /path/to/cuda -type f -name libOpenCL.so*.

from dbcsr.

abussy commented on August 20, 2024

I can confirm to you that OpenCL is not distributed with CUDA on Alps. I'll get the word out, and we'll see if somebody comes up with something.

from dbcsr.

abussy commented on August 20, 2024

PR #801 solves this issue. While this is not an automatic fix, it allows the user to run efficiently when encountering this issue (by setting a environment variable).

from dbcsr.

alazzaro commented on August 20, 2024

Let's keep it open for future improvements...

from dbcsr.

Schroedingers0216 commented on August 20, 2024

To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?

from dbcsr.

alazzaro commented on August 20, 2024

To measure the execution time of the dbcsr_multiply_generic module in CP2K, what settings do I need to configure?

Just look the CP2K output timings and search for dbcsr_multiply_generic, e.g.:

 dbcsr_multiply_generic            2286 12.5    0.133    0.133   26.843   26.896

The last two columns are the inclusive time (average across ranks, max for all ranks).

from dbcsr.

DBCSR performs very poorly on GH200, when there are large blocks about dbcsr HOT 14 OPEN

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent