Giter Site home page Giter Site logo

Comments (25)

kabicm avatar kabicm commented on July 29, 2024 2

Simon managed to reproduce this errror within COSMA, we are working on it!

from cosma.

oschuett avatar oschuett commented on July 29, 2024 1

It seems the crashes happen because cudaMemcpy2DAsync is called with invalid arguments.

I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp:

dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77  <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77

Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync.

from cosma.

oschuett avatar oschuett commented on July 29, 2024 1

Did you link cp2k to cosma_prefixed_pxgemm library:

You can get the linker line from the regtest report:

LIBS        = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp  -lmpifort -lmpicxx -lmpi  -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt 

from cosma.

kabicm avatar kabicm commented on July 29, 2024

Hi Frederick,

Unfortunately, it seems I can't access the cscs infrastructure anymore.

Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.

Maybe @teonnik or @simonpintarelli could have a look?

from cosma.

kabicm avatar kabicm commented on July 29, 2024

As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:

export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000

By default these values are 5k, so you can try reducing them.

However, the gpu memory footprint has not changed since the last version, so this should not be a problem.

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?

from cosma.

simonpintarelli avatar simonpintarelli commented on July 29, 2024

I can't reproduce the bug using the miniapps (test.pdgemm, test.multiply).
@fstein93 Do you know what the matrix sizes in the cp2k regtest are?

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

I am not familiar with all of them. I can provide more details in the following cases:

  1. QS/regtest-ri-rpa, it is n=m=83, k=76 (H2O), n=m=14, k=0 (!) or k=22 (H) and n=m=97, k=78 or k=104 (CH3).
  2. I will do some checks tomorrow with the tests lr because here, the sizes of n=m depend on the numerics.
  3. QS/regtest-gw/G0W0_H2O_PBE_periodic.inp, it is probably n=m=83, k=148.
  4. LIBTEST/test_cp_fm_gemm_01.inp check the input file, and the source code.

In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.

I hope that provides already a few hints.

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).

from cosma.

kabicm avatar kabicm commented on July 29, 2024

Thank you Frederick for more details and thanks Simon for chiming in!

@fstein93 regarding your questions above:

  • having k=0 in test cases is not a problem! cosma_pxgemm is well tested for those cases.
  • if not all the ranks own the data is also not a problem! COSMA will in fact reduce the number of ranks further if the problem size is too small. Also, in most of the RPA cases, the matrix C is only distributed to few ranks.

Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:

Then, Simon could rerun it using the miniapp on daint. Would it be possible?

from cosma.

kabicm avatar kabicm commented on July 29, 2024

Thanks @oschuett for debugging it!

Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.

from cosma.

oschuett avatar oschuett commented on July 29, 2024

Voilà: H2O-sos-mp2-lr.txt

from cosma.

kabicm avatar kabicm commented on July 29, 2024

@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.

However, few things from your file caught my attention:

  1. It seems the error happens within cholesky decomposition?
  2. Did you link cp2k to cosma_prefixed_pxgemm library:
    add_library(cosma_prefixed_pxgemm scalapack.cpp
    or to cosma_pxgemm library:
    add_library(cosma_pxgemm scalapack.cpp

The difference is that cosma_prefixed_pxgemm only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm, cosma_psgemm and the complex versions. On the other hand cosma_pxgemm implements both the prefixed versions + overwrites default scalapack routines.

Since cp2k anyway calls the cosma_pdgemm and cosma_psgemm routines, I think you should link to prefixed_cosma_pxgemm instead of cosma_pxgemm. This way, COSMA will not be used in cholesky.

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).

from cosma.

kabicm avatar kabicm commented on July 29, 2024

Thanks @fstein93 for clarifications! It seems I misunderstood the output then.

Hope Simon will be able to reproduce it by running the newly added tests.

Btw, do we know if export CUDA_LAUNCH_BLOCKING=1 resolves the issue?

from cosma.

kabicm avatar kabicm commented on July 29, 2024

@oschuett just a quick question: after you added those print statements, what is in your line:
at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475

I want to see if the error occurred within: round_robin or within round_robin_without_copy_c

from cosma.

kabicm avatar kabicm commented on July 29, 2024

@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

@kabicm the tests passed with the previous version. There is only one which I added recently.

from cosma.

kabicm avatar kabicm commented on July 29, 2024

@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?

from cosma.

fstein93 avatar fstein93 commented on July 29, 2024

In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.

from cosma.

oschuett avatar oschuett commented on July 29, 2024

You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.

from cosma.

kabicm avatar kabicm commented on July 29, 2024

We would surely make a new release once we are sure this fixes the failing tests.

from cosma.

kabicm avatar kabicm commented on July 29, 2024

It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.

from cosma.

kabicm avatar kabicm commented on July 29, 2024

The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!

from cosma.

kabicm avatar kabicm commented on July 29, 2024

I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.

from cosma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.