Giter Site home page Giter Site logo

Comments (7)

keisukefukuda avatar keisukefukuda commented on July 18, 2024

Yes, as you said, the expected behaviour of MPI runtime is to kill all child processes and shutdown with an error code.
We sometimes observe the issues as well. It's a known issue for Open MPI <= 2.1.

In the script you showed, the second create_communicator is supposed to raise an error.
I will investigate if there's any way or option to avoid the issue.

It seems that the failed process detection and subprocess shutdown feature were improved in Open MPI 3.0, but the version brings another issue to ChainerMN (see #221 for details)
We are trying to solve #221, but it will take time because it's not a ChainerMN issue.

In the long-term roadmap, we are working hard to remove fault tolerance to ChainerMN.

Thanks

from chainermn.

andremoeller avatar andremoeller commented on July 18, 2024

@keisukefukuda ,

Thank you for looking into this!

Unfortunately, I tried it with Open MPI 3.0.1, but I encountered the same behavior.

I should note that even if I do not create two communicators, this issue still occurs:

import chainermn

def main():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')
    comm = chainermn.create_communicator('naive', mpi_comm)

if __name__ == '__main__':
    main()

Please let me know if you make any progress. Thank you!

from chainermn.

andremoeller avatar andremoeller commented on July 18, 2024

I looked into this some more. It seems like mpi4py isn't handling python exceptions correctly. I tried calling init_ranks with some timeout code that sends a SIGALRM signal after some time, but the handler never gets called.

mpirun -n 2 python -m trace --trace repro.py

this trace shows that mpi_comm operations (gather and scatter) cause execution to hang:

_communication_utility.py(32): global_names=mpi_comm.gather(mpi4py.MPI.Get_processor_name())

from chainermn.

keisukefukuda avatar keisukefukuda commented on July 18, 2024

I've just found a hack to solve the issue.
It works for the tiny script, but we need to check if it works for real-world applications.

Note that the problem happens without chainermn.

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    sys.stderr.write("except_hook. Calling MPI_Abort().\n")
    # NOTE: mpi4py must be imported inside exception handler, not globally.
    # In chainermn, mpi4py import is carefully delayed, because
    # mpi4py automatically call MPI_Init() and cause a crash on Infiniband environment.
    import mpi4py.MPI
    mpi4py.MPI.COMM_WORLD.Abort(1)
    sys.__excepthook__(exctype, value, traceback)
sys.excepthook = global_except_hook

def func1():
    import chainermn
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')
    comm = chainermn.create_communicator('naive', mpi_comm)

def func2():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')

    mpi4py.MPI.COMM_WORLD.Barrier()



if __name__ == '__main__':
    d = {'func1' : func1,
         'func2' : func2}

    fname = sys.argv[1] if len(sys.argv) >= 2 else 'func1'
    d[fname]()

from chainermn.

keisukefukuda avatar keisukefukuda commented on July 18, 2024

Improved version of the error handler:

import sys

# Global error handler
def global_except_hook(exctype, value, traceback):
    import sys
    try:
        import mpi4py.MPI
        sys.stderr.write("\n*****************************************************\n")
        sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
            mpi4py.MPI.COMM_WORLD.Get_rank()))
        from traceback import print_exception
        print_exception(exctype, value, traceback)
        sys.stderr.write("*****************************************************\n\n\n")
        sys.stderr.write("\n")
        sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
        sys.stderr.flush()
    finally:
        try:
            import mpi4py.MPI
            mpi4py.MPI.COMM_WORLD.Abort(1)
        except Exception as e:
            sys.stderr.write("*****************************************************\n")
            sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
            sys.stderr.write("*****************************************************\n")
            sys.stderr.flush()
            raise e

sys.excepthook = global_except_hook


def func():
    import mpi4py.MPI
    mpi_comm = mpi4py.MPI.COMM_WORLD
    if mpi_comm.rank == 0:
        raise ValueError('failure!')

    mpi4py.MPI.COMM_WORLD.Barrier()


if __name__ == '__main__':
    func()

from chainermn.

andremoeller avatar andremoeller commented on July 18, 2024

Great, thanks @keisukefukuda !

I also contacted the mpi4py maintainer, who suggested using [mpi4py.run] (http://mpi4py.readthedocs.io/en/stable/mpi4py.run.html), which also suffices (I believe mpi4py.run also calls MPI_Abort()).

from chainermn.

keisukefukuda avatar keisukefukuda commented on July 18, 2024

Thanks. I will also test the mpi4py 's recommended way if it works with ChainerMN.
I think I can now close the issue. thanks for your contribution!

from chainermn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.