Comments (7)
Confirmed, problem fixed.
from cosma.
I am still struggling to reproduce this.
Can you please check if the following resolves the issue? (I want to see if it's related to synchronization)
export HIP_LAUNCH_BLOCKING=1
When using gpu-aware MPI, is it necessary to call device_synchronize
after MPI collective to ensure it has finished the execution and that the buffers can safely be used?
from cosma.
Concretely, should this line be uncommented:
COSMA/src/cosma/gpu/gpu_aware_mpi_utils.cpp
Line 234 in 783803e
from cosma.
I am still struggling to reproduce this.
Can you please check if the following resolves the issue? (I want to see if it's related to synchronization)
export HIP_LAUNCH_BLOCKING=1
Tried it and it doesn't solve the problem. I must say, this is expected (see my comment below).
When using gpu-aware MPI, is it necessary to call
device_synchronize
after MPI collective to ensure it has finished the execution and that the buffers can safely be used?
No, it is implied in the blocking MPI call.
from cosma.
So, we did a bit of investigation on this problem.
Apparently, this is related to the order of destruction of the GPU arrays used in the memory pool versus the finalization of the GPU MPI libraries dependencies within the Cray-MPICH. COSMA does that at the very end, likely after MPI GPU is finalized... More investigation is needed from the MPI side.
from cosma.
Thanks a lot @alazzaro for the feedback! Hope we can fix it soon! :)
from cosma.
@alazzaro Hey Alfio, do we have some more insights about this? Could it still be that this is a COSMA bug?
from cosma.
Related Issues (20)
- COSMA cublas crash after job finished HOT 5
- Fixing CI/CD issues HOT 1
- timings in comsa_miniapp HOT 2
- GPU-Aware MPI Version HOT 2
- Unable to use an internal RCCL build HOT 4
- COSMA crash on Perlmutter when dealing with complex values HOT 9
- Crashes with the latest COSMA release HOT 25
- (Still) Excessive memory usage HOT 3
- cmake project version v2.6.1 does not match git tag v2.6.2
- Add ability to disable NCCL at runtime HOT 1
- build failure with nccl HOT 1
- enable overlap comm with computation for cosma_miniapp HOT 2
- Configure fails to find costa submodule and build fails to build with costa-2.2 HOT 1
- Link errors HOT 6
- cmake project version v2.6.4 does not match git tag v2.6.5
- COSMA build fails on CRAY HOT 1
- How to run pdgemm on multiple GPUs? HOT 26
- undefined reference to `void costa::transform<float>(std::__1::vector<std::__1:: ... HOT 3
- Switching to a proper memory-pool implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cosma.