Discussion on DBCSR about dbcsr HOT 43 CLOSED

zhl201226 commented on August 20, 2024

Discussion on DBCSR

from dbcsr.

Comments (43)

hfp commented on August 20, 2024

If possible, can you share the input file and perhaps the profile output when running the workload? The profile output is what contains the timings printed by CP2K at the end. What's clear already, this is not only about DBCSR but also CP2K's GRID components (collocate/integrate), perhaps even some PW, etc.

Regarding, "H2D -> LaunchKernel -> D2H" - this is idealized assuming only a single transfer/array is the input of such kernel and in turn for the output/result as well.

from dbcsr.

zhl201226 commented on August 20, 2024

I tried setting the DBCSR backend to others, and I didn't find a large number of H2D transfers in HIPprof. Therefore, I believe DBCSR is causing the issue. Additionally, it might be due to the transpose_d kernel. I couldn't locate the specific code responsible for the numerous H2D transfers. Below, I have attached the test file and output file. Thank you. @hfp
test.tar.gz

from dbcsr.

hfp commented on August 20, 2024

For the record, if there are "unnecessary" data transfers like can be combined or avoided, this issue applies to all backends as well as GPUs/vendors. The hint on transposes might be a first step.

@zhl201226 you may try DBCSR_RUN_ON_GPU=0 environment variable and recapture the GPU-profile. This environment variable disables DBCSR on GPUs even if the support is compiled into the application (and leaves the other uses of CP2K on GPUs intact).

from dbcsr.

hfp commented on August 20, 2024

Looking at CP2K's profile, local GEMMs (cp_fm_gemm) consume ~25% of the TTS on this system (just as a note). However, multiply_cannon* and dbcsr_mm_hostdrv_process are interesting. Given dbcsr_mm_hostdrv_process is relatively high, it seems there is a reasonable portion of fallbacks happening. Given previous implementation, the fallbacks may be accompanied by transfers without actually launching a kernel.

from dbcsr.

zhl201226 commented on August 20, 2024

I have identified the H2D issue occurring in the dbcsr_mm_accdrv_process module. Is this module dividing the data into small chunks for transfer? Can it be merged into larger chunks for transfer? Additionally, I previously did not use ACC to accelerate DBCSR, but it seems to be taking longer now. Therefore, I am not sure if DBCSR_RUN_ON_GPU=0 is effective. Could you please provide more optimization suggestions?

from dbcsr.

hfp commented on August 20, 2024

Sorry, I guess DBCSR_RUN_ON_GPU is only supported in the most recent if not unreleased version. This was not meant as an optimization suggestion but rather something to systematically rule-out or blame DBCSR. Your example input is worth looking at for contributors.

from dbcsr.

zhl201226 commented on August 20, 2024

How do I contact contributors?
@hfp

from dbcsr.

hfp commented on August 20, 2024

Just give some time they will see this open issue ;-)

from dbcsr.

zhl201226 commented on August 20, 2024

Just give some time they will see this open issue ;-)

thank you ：-）

from dbcsr.

hfp commented on August 20, 2024

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

from dbcsr.

hfp commented on August 20, 2024

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

from dbcsr.

zhl201226 commented on August 20, 2024

( Side note, GLOBAL| CPU model name does not show up in the log ;-)

"By the way, using DBCSR_RUN_ON_GPU=0 did not significantly improve performance. The CPU model name has been hidden for other reasons, but I can provide it if needed."

from dbcsr.

zhl201226 commented on August 20, 2024

Regarding the test input, it's missing the restart file for the SCF initial guess. Commenting it out, starts from an unreasonable guess and fails in the Cholesky decomposition.

This restart file is too large to upload. Is there another way to send it to you?

from dbcsr.

hfp commented on August 20, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

from dbcsr.

zhl201226 commented on August 20, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

from dbcsr.

hfp commented on August 20, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

from dbcsr.

zhl201226 commented on August 20, 2024

Hmm, others may have the same request so Dropbox or something like this comes to mind. My e-mail is my . name @ intel . com.

I have already sent it to you via email. thank you

( Let's see, the e-mail did not arrive yet perhaps size restrictions )

I have resent it to [email protected]. Please check it. Best regards

from dbcsr.

hfp commented on August 20, 2024

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

from dbcsr.

zhl201226 commented on August 20, 2024

I have resent it to [email protected]. Please check it. Best regards

Literally? I envisioned my.name would be my name taken from https://github.com/hfp (hans.pabst). Sorry for the confusion.

sure，I also sent an email to [email protected], and my email address is [[email protected]]

from dbcsr.

alazzaro commented on August 20, 2024

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.

Thank you.

The important CP2K timers for your execution are the following:

grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616

Now, I would assume you are running COSMA on the GPU, so you cannot gain more there.
Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7

Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

Run the tuning procedure for the parameters you are interested and contribute to the current list.
You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU
Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

from dbcsr.

zhl201226 commented on August 20, 2024

I am writing to seek your assistance. When running CP2K simulations on the AMD MI50 platform with the DBCSR backend set to HIP, the execution time is longer than using the CPU. Using HIPprof to examine the API calls, I noticed that the results show a large number of H2D (Host to Device) transfers but no kernel launches. Normally, the call flow should be H2D -> LaunchKernel -> D2H (Device to Host). I would like to understand why there are so many H2D transfers and where in the code this is occurring. Below, I have attached the JSON file for you to open in chrome://tracing.
Thank you.

The important CP2K timers for your execution are the following:
grid_integrate_task_list         340.326
grid_collocate_task_list         377.996
multiply_multrec                 523.278
cp_fm_syevd_base                 637.115
cp_fm_redistribute_end           639.139
dbcsr_mm_hostdrv_process        1229.836
cp_gemm_cosma                   2335.899
CP2K_Total                      8183.616
Now, I would assume you are running COSMA on the GPU, so you cannot gain more there. Then I see cp_fm_syevd_base, no sure if ELPA can give some benefit. It can also be the case for https://github.com/eth-cscs/DLA-Future. The grid parts are already running on the GPU.

Concerning DBCSR, the important part is the DBCSR kernel output:
 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops     1 x     1 x     1                 3610       0.0%    100.0%      0.0%
 flops     1 x     1 x     5                19040       0.0%    100.0%      0.0%
...
 flops total                       537.243062E+12       0.0%     96.5%      3.5%
 flops max/rank                     35.731363E+12       0.0%     96.5%      3.5%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                        22844500215       0.0%     98.8%      1.2%
 number of processed stacks               3196393       0.0%     92.7%      7.3%
 average stack size                                     0.0    7614.1    1217.7
Basically, 98.8% of the blocks multiplications are running on the CPU (SMM column), only 1.2% is running on the GPU. The reason is that your kernels are not presented on the GPU tuning parameters. There are several ways to improve the situation (in order of preference):

Run the tuning procedure for the parameters you are interested and contribute to the current list.

You can try to set export DBCSR_MM_DENSE=1, you can see that the list of kernels should change and possibly more kernels will run on the GPU

Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I will debug based on your suggestions later, but since the process is relatively long, I will temporarily close the question. Thank you very much.

from dbcsr.

hfp commented on August 20, 2024

1. Run the [tuning procedure](https://cp2k.github.io/dbcsr/develop/page/3-developer-guide/3-programming/2-accelerator-backend/2-libsmm_acc/index.html) for the parameters you are interested and contribute to the current list.

2. You can try to set `export DBCSR_MM_DENSE=1`, you can see that the list of kernels should change and possibly more kernels will run on the GPU

3. Use the latest DCBSR (v2.7.0-rc2) which provides a default GPU kernel when the tuned kernels are not available.

I am sure the OpenCL backend can be mixed with HIP as well (just like with CUDA). However, I have not spent any time to exercise this. It comes down to support in build system on CP2K's side. In any case, I will keep HIP in mind when taking this task (it's still open for me to get DBM/DBT and DBCSR based on OpenCL into CP2K's CMake).

from dbcsr.

Discussion on DBCSR about dbcsr HOT 43 CLOSED

Comments (43)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent