Comments (25)
Simon managed to reproduce this errror within COSMA, we are working on it!
from cosma.
It seems the crashes happen because cudaMemcpy2DAsync
is called with invalid arguments.
I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp
:
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77 <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77
Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync
.
from cosma.
Did you link cp2k to cosma_prefixed_pxgemm library:
You can get the linker line from the regtest report:
LIBS = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp -lmpifort -lmpicxx -lmpi -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt
from cosma.
Hi Frederick,
Unfortunately, it seems I can't access the cscs infrastructure anymore.
Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.
Maybe @teonnik or @simonpintarelli could have a look?
from cosma.
As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:
export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000
By default these values are 5k, so you can try reducing them.
However, the gpu memory footprint has not changed since the last version, so this should not be a problem.
from cosma.
Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?
from cosma.
I can't reproduce the bug using the miniapps (test.pdgemm
, test.multiply
).
@fstein93 Do you know what the matrix sizes in the cp2k regtest are?
from cosma.
I am not familiar with all of them. I can provide more details in the following cases:
- QS/regtest-ri-rpa, it is n=m=83, k=76 (H2O), n=m=14, k=0 (!) or k=22 (H) and n=m=97, k=78 or k=104 (CH3).
- I will do some checks tomorrow with the tests lr because here, the sizes of n=m depend on the numerics.
- QS/regtest-gw/G0W0_H2O_PBE_periodic.inp, it is probably n=m=83, k=148.
- LIBTEST/test_cp_fm_gemm_01.inp check the input file, and the source code.
In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.
I hope that provides already a few hints.
from cosma.
Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).
from cosma.
Thank you Frederick for more details and thanks Simon for chiming in!
@fstein93 regarding your questions above:
- having k=0 in test cases is not a problem! cosma_pxgemm is well tested for those cases.
- if not all the ranks own the data is also not a problem! COSMA will in fact reduce the number of ranks further if the problem size is too small. Also, in most of the RPA cases, the matrix C is only distributed to few ranks.
Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:
-
uncomment these lines:
COSMA/src/cosma/cosma_pxgemm.cpp
Line 111 in fe6eb59
COSMA/src/cosma/cosma_pxgemm.cpp
Line 146 in fe6eb59
COSMA/src/cosma/cosma_pxgemm.cpp
Line 192 in fe6eb59
COSMA/src/cosma/cosma_pxgemm.cpp
Line 198 in fe6eb59
-
rerun some problematic RPA. When those lines are uncommented, all the parameters of each pdgemm call will be written in the output.
-
send us the output file
Then, Simon could rerun it using the miniapp on daint. Would it be possible?
from cosma.
Thanks @oschuett for debugging it!
Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.
from cosma.
Voilà: H2O-sos-mp2-lr.txt
from cosma.
@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.
However, few things from your file caught my attention:
- It seems the error happens within cholesky decomposition?
- Did you link cp2k to
cosma_prefixed_pxgemm
library:
COSMA/src/cosma/CMakeLists.txt
Line 102 in b5ba79e
cosma_pxgemm
library:
COSMA/src/cosma/CMakeLists.txt
Line 79 in b5ba79e
The difference is that cosma_prefixed_pxgemm
only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm
, cosma_psgemm
and the complex versions. On the other hand cosma_pxgemm
implements both the prefixed versions + overwrites default scalapack routines.
Since cp2k anyway calls the cosma_pdgemm
and cosma_psgemm
routines, I think you should link to prefixed_cosma_pxgemm
instead of cosma_pxgemm
. This way, COSMA will not be used in cholesky.
from cosma.
All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).
from cosma.
Thanks @fstein93 for clarifications! It seems I misunderstood the output then.
Hope Simon will be able to reproduce it by running the newly added tests.
Btw, do we know if export CUDA_LAUNCH_BLOCKING=1
resolves the issue?
from cosma.
@oschuett just a quick question: after you added those print statements, what is in your line:
at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475
I want to see if the error occurred within: round_robin or within round_robin_without_copy_c
from cosma.
@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?
from cosma.
@kabicm the tests passed with the previous version. There is only one which I added recently.
from cosma.
@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?
from cosma.
In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.
from cosma.
You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.
from cosma.
We would surely make a new release once we are sure this fixes the failing tests.
from cosma.
It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.
from cosma.
The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!
from cosma.
I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.
from cosma.
Related Issues (20)
- COSMA cublas crash after job finished HOT 5
- Fixing CI/CD issues HOT 1
- timings in comsa_miniapp HOT 2
- GPU-Aware MPI Version HOT 2
- Error crash at the end of the job execution HOT 7
- Unable to use an internal RCCL build HOT 4
- COSMA crash on Perlmutter when dealing with complex values HOT 9
- (Still) Excessive memory usage HOT 3
- cmake project version v2.6.1 does not match git tag v2.6.2
- Add ability to disable NCCL at runtime HOT 1
- build failure with nccl HOT 1
- enable overlap comm with computation for cosma_miniapp HOT 2
- Configure fails to find costa submodule and build fails to build with costa-2.2 HOT 1
- Link errors HOT 6
- cmake project version v2.6.4 does not match git tag v2.6.5
- COSMA build fails on CRAY HOT 1
- How to run pdgemm on multiple GPUs? HOT 26
- undefined reference to `void costa::transform<float>(std::__1::vector<std::__1:: ... HOT 3
- Switching to a proper memory-pool implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cosma.