Giter Site home page Giter Site logo

Comments (3)

kihangyoun avatar kihangyoun commented on July 19, 2024

20200402_mpi_error
Q#1

I am writing to ask some questions related to CFD model result different by mpi_hosts order.
I would like to hear your opinion theoretically becuase the code is long and complex and it would be difficult to reproduce it through the sample codes.

The current situation is that when nodes in different infiniband switches perform parallel computations, case #1 works well, but case #2 doesn't work well.
"Doesn't work well" means that there is a difference in values.

Background: host01-host04 in IB switch#1 and host99 in IB switch#2.
Case#1: host01, host02, host03, host04, host99(i.e. header node is hosts01)
Case#2: host99, host01, host02, host03, host04(i.e. header node is hosts99)

As far as I can guess(It's a hypothetical scenario with no theoretical basis),

  1. There are miss communication problems while the header node is on another switch.
  2. Myranks are reversed while working on MPI_COMM_RANK several times.
  3. There are some problems(broken or mismatch) in MPI_COMM_WORLD.
  4. Synchronization excludes header nodes.

First of all, for debugging, I'm putting the print statement in several places to see which subroutine or function changes the value.
(I'll post more when the situation is updated.)
However, no matter what function I finally find, I am not sure it's a part of code-level resolution, so I post to the forum to hear a story about a similar experiences.

Additional#1

Additional information:

  1. This program use only MPI library but OpenMP.
    I have not tried the structures you recommend (IREQ, THREADPRIVATE, NOVECTOR).

  2. Test results
    As I said before, I tried two technique (ISEND and BARRIER) but it doesn't work.

  1. ISEND: Even though it works, the same message loss occurs.
  2. BARRIER: It's a little weird, I'm sure all the procs are going into the subroutine, and they're going into an infinite waiting(hang).
  3. SBUF,RBUF(Jim): I reduce a deallocation as possible, but the same message loss occurs.
  1. I think subroutine is not a problem
    The reason I don't think it's a subroutine problem,
  1. When the host is assigned within the same IB switch, the message has never been lost in 20 repeats.
    ex)
    host004(IB1): 16 17 18 19 20
    host003(IB1): 11 12 13 14 15
    host002(IB1): 06 07 08 09 10
    host001(IB1): 01 02 03 04 05 : always fine
  2. As the East-West communication was always conducted on the same node by adjusting the domain (NROW,NCOL), there was no problem.
    (E-W communication uses same subroutine)
    It has always been a communication between different IB switches that causes problems in South-North communications.
    ex)
    host037(IB2): 16 17 18 19 20 <- message lost occurs in S-N communication
    host003(IB1): 11 12 13 14 15
    host002(IB1): 06 07 08 09 10
    host001(IB1): 01 02 03 04 05
  1. Isn't there a similar reason for hang when using a barrier?
    Aren't the one IB switch nodes (host001-host004) waiting for the other IB switch (host037) but host037 passing through the barrier?

Are there any of these mpi options that can be improved and modified?
I'm going to check if other MPI libaries(openmpi, mvapich, mpich) have the same error.

from cmaq.

kmfoley avatar kmfoley commented on July 19, 2024

Thank you for your question and your interest in the CMAQ system. We ask that you please post your question to the CMAS Center Forum: https://forum.cmascenter.org/

We would like this question to be documented on the forum to help other users that may run into similar issues.

Please start a 'New Topic' with an informative title and choose 'CMAQ' as the category. This will ensure you are connected to the appropriate developer and user base. I will also pass your question on to the member of our team most familiar with these types of issues so that he can respond to your Forum post if he has any insight.

from cmaq.

dwongepa avatar dwongepa commented on July 19, 2024

Hi kihangyoun,

Could you please contact me directly at [email protected]? I would like to ask you a few more questions to determine the cause of the problem.

Cheers,
David

from cmaq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.