Dear COSMA developers, I am one of the CP2K developers and have rece

As <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Crashes with the latest COSMA release,about eth-cscs/cosma

kabicm commented on July 29, 2024 2

Simon managed to reproduce this errror within COSMA, we are working on it!

from cosma.

oschuett commented on July 29, 2024 1

It seems the crashes happen because cudaMemcpy2DAsync is called with invalid arguments.

I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp:

dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77  <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77

Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync.

from cosma.

oschuett commented on July 29, 2024 1

Did you link cp2k to cosma_prefixed_pxgemm library:

You can get the linker line from the regtest report:

LIBS        = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp  -lmpifort -lmpicxx -lmpi  -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt

from cosma.

kabicm commented on July 29, 2024

Hi Frederick,

Unfortunately, it seems I can't access the cscs infrastructure anymore.

Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.

Maybe @teonnik or @simonpintarelli could have a look?

from cosma.

kabicm commented on July 29, 2024

As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:

export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000

By default these values are 5k, so you can try reducing them.

However, the gpu memory footprint has not changed since the last version, so this should not be a problem.

from cosma.

fstein93 commented on July 29, 2024

Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?

from cosma.

simonpintarelli commented on July 29, 2024

I can't reproduce the bug using the miniapps (test.pdgemm, test.multiply).
@fstein93 Do you know what the matrix sizes in the cp2k regtest are?

from cosma.

fstein93 commented on July 29, 2024

I am not familiar with all of them. I can provide more details in the following cases:

QS/regtest-ri-rpa, it is n=m=83, k=76 (H2O), n=m=14, k=0 (!) or k=22 (H) and n=m=97, k=78 or k=104 (CH3).
I will do some checks tomorrow with the tests lr because here, the sizes of n=m depend on the numerics.
QS/regtest-gw/G0W0_H2O_PBE_periodic.inp, it is probably n=m=83, k=148.
LIBTEST/test_cp_fm_gemm_01.inp check the input file, and the source code.

In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.

I hope that provides already a few hints.

from cosma.

fstein93 commented on July 29, 2024

Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).

from cosma.

kabicm commented on July 29, 2024

Thank you Frederick for more details and thanks Simon for chiming in!

@fstein93 regarding your questions above:

having k=0 in test cases is not a problem! cosma_pxgemm is well tested for those cases.
if not all the ranks own the data is also not a problem! COSMA will in fact reduce the number of ranks further if the problem size is too small. Also, in most of the RPA cases, the matrix C is only distributed to few ranks.

Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:

uncomment these lines:
- COSMA/src/cosma/cosma_pxgemm.cpp
  
  Line 111 in fe6eb59
  
  #ifdef DEBUG
- COSMA/src/cosma/cosma_pxgemm.cpp
  
  Line 146 in fe6eb59
  
  #endif
- COSMA/src/cosma/cosma_pxgemm.cpp
  
  Line 192 in fe6eb59
  
  #ifdef DEBUG
- COSMA/src/cosma/cosma_pxgemm.cpp
  
  Line 198 in fe6eb59
  
  #endif
rerun some problematic RPA. When those lines are uncommented, all the parameters of each pdgemm call will be written in the output.
send us the output file

Then, Simon could rerun it using the miniapp on daint. Would it be possible?

from cosma.

kabicm commented on July 29, 2024

Thanks @oschuett for debugging it!

Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.

from cosma.

oschuett commented on July 29, 2024

Voilà: H2O-sos-mp2-lr.txt

from cosma.

kabicm commented on July 29, 2024

@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.

However, few things from your file caught my attention:

It seems the error happens within cholesky decomposition?
Did you link cp2k to cosma_prefixed_pxgemm library:

COSMA/src/cosma/CMakeLists.txt

Line 102 in b5ba79e

add_library(cosma_prefixed_pxgemm scalapack.cpp

or to cosma_pxgemm library:

COSMA/src/cosma/CMakeLists.txt

Line 79 in b5ba79e

add_library(cosma_pxgemm scalapack.cpp

The difference is that cosma_prefixed_pxgemm only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm, cosma_psgemm and the complex versions. On the other hand cosma_pxgemm implements both the prefixed versions + overwrites default scalapack routines.

Since cp2k anyway calls the cosma_pdgemm and cosma_psgemm routines, I think you should link to prefixed_cosma_pxgemm instead of cosma_pxgemm. This way, COSMA will not be used in cholesky.

from cosma.

fstein93 commented on July 29, 2024

All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).

from cosma.

kabicm commented on July 29, 2024

Thanks @fstein93 for clarifications! It seems I misunderstood the output then.

Hope Simon will be able to reproduce it by running the newly added tests.

Btw, do we know if export CUDA_LAUNCH_BLOCKING=1 resolves the issue?

from cosma.

kabicm commented on July 29, 2024

@oschuett just a quick question: after you added those print statements, what is in your line:
at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475

I want to see if the error occurred within: round_robin or within round_robin_without_copy_c

from cosma.

kabicm commented on July 29, 2024

@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?

from cosma.

fstein93 commented on July 29, 2024

@kabicm the tests passed with the previous version. There is only one which I added recently.

from cosma.

kabicm commented on July 29, 2024

@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?

from cosma.

fstein93 commented on July 29, 2024

In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.

from cosma.

oschuett commented on July 29, 2024

You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.

from cosma.

kabicm commented on July 29, 2024

We would surely make a new release once we are sure this fixes the failing tests.

from cosma.

kabicm commented on July 29, 2024

It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.

from cosma.

kabicm commented on July 29, 2024

The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!

from cosma.

kabicm commented on July 29, 2024

I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.

from cosma.

Crashes with the latest COSMA release about cosma HOT 25 CLOSED

Comments (25)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent