Comments (3)
I am writing to ask some questions related to CFD model result different by mpi_hosts order.
I would like to hear your opinion theoretically becuase the code is long and complex and it would be difficult to reproduce it through the sample codes.
The current situation is that when nodes in different infiniband switches perform parallel computations, case #1 works well, but case #2 doesn't work well.
"Doesn't work well" means that there is a difference in values.
Background: host01-host04 in IB switch#1 and host99 in IB switch#2.
Case#1: host01, host02, host03, host04, host99(i.e. header node is hosts01)
Case#2: host99, host01, host02, host03, host04(i.e. header node is hosts99)
As far as I can guess(It's a hypothetical scenario with no theoretical basis),
- There are miss communication problems while the header node is on another switch.
- Myranks are reversed while working on MPI_COMM_RANK several times.
- There are some problems(broken or mismatch) in MPI_COMM_WORLD.
- Synchronization excludes header nodes.
First of all, for debugging, I'm putting the print statement in several places to see which subroutine or function changes the value.
(I'll post more when the situation is updated.)
However, no matter what function I finally find, I am not sure it's a part of code-level resolution, so I post to the forum to hear a story about a similar experiences.
Additional#1
Additional information:
-
This program use only MPI library but OpenMP.
I have not tried the structures you recommend (IREQ, THREADPRIVATE, NOVECTOR). -
Test results
As I said before, I tried two technique (ISEND and BARRIER) but it doesn't work.
- ISEND: Even though it works, the same message loss occurs.
- BARRIER: It's a little weird, I'm sure all the procs are going into the subroutine, and they're going into an infinite waiting(hang).
- SBUF,RBUF(Jim): I reduce a deallocation as possible, but the same message loss occurs.
- I think subroutine is not a problem
The reason I don't think it's a subroutine problem,
- When the host is assigned within the same IB switch, the message has never been lost in 20 repeats.
ex)
host004(IB1): 16 17 18 19 20
host003(IB1): 11 12 13 14 15
host002(IB1): 06 07 08 09 10
host001(IB1): 01 02 03 04 05 : always fine - As the East-West communication was always conducted on the same node by adjusting the domain (NROW,NCOL), there was no problem.
(E-W communication uses same subroutine)
It has always been a communication between different IB switches that causes problems in South-North communications.
ex)
host037(IB2): 16 17 18 19 20 <- message lost occurs in S-N communication
host003(IB1): 11 12 13 14 15
host002(IB1): 06 07 08 09 10
host001(IB1): 01 02 03 04 05
- Isn't there a similar reason for hang when using a barrier?
Aren't the one IB switch nodes (host001-host004) waiting for the other IB switch (host037) but host037 passing through the barrier?
Are there any of these mpi options that can be improved and modified?
I'm going to check if other MPI libaries(openmpi, mvapich, mpich) have the same error.
from cmaq.
Thank you for your question and your interest in the CMAQ system. We ask that you please post your question to the CMAS Center Forum: https://forum.cmascenter.org/
We would like this question to be documented on the forum to help other users that may run into similar issues.
Please start a 'New Topic' with an informative title and choose 'CMAQ' as the category. This will ensure you are connected to the appropriate developer and user base. I will also pass your question on to the member of our team most familiar with these types of issues so that he can respond to your Forum post if he has any insight.
from cmaq.
Hi kihangyoun,
Could you please contact me directly at [email protected]? I would like to ask you a few more questions to determine the cause of the problem.
Cheers,
David
from cmaq.
Related Issues (20)
- OMP HOT 1
- Night AOD or aerosol extinction is zero from CCTM_PHOTDIAG3 file simulated by CMAQv5.3.1? HOT 1
- Error to run the assemble command with WRF4.1.1_CMAQ5.3.2_twoway option HOT 1
- Error in SpecDef_cb6r3_ae6_aq.txt definition of VOC in CMAQ v5.3.2 op and dev branches HOT 1
- How to activate VBS in CMAQ v5.2? HOT 1
- CMAQ will not work on macOS HOT 7
- When i use csh assemble command to build the wrf-cmaq i got error HOT 1
- Errors with Hg Bidirectional Flux Simulations HOT 1
- CMAQ version release note does not work HOT 1
- Benchmark model run gives only WRF output; no ACONC output HOT 21
- /usr/bin/time: Command not found in CMAQ v.5.3.3 HOT 1
- WRF-CMAQ: Average ELMO Output File Gives Erroneous Results HOT 5
- DMS and Chlorophyll Notebook issue with Chlorophyll HOT 2
- MEGAN preprocessor HOT 4
- Problems with Mercator Projection in MCIP HOT 5
- Sulfur Tracking issues in CMAQv5.4
- What is the correct way to build and run cctm with parallel io? HOT 1
- Minor Typo in config_cmaq.csh HOT 2
- File change request. Sorry I don't know how to use forks and pulls
- possible bugs in RUNTIME_VARS by calling get_env function HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cmaq.