Comments (7)
Yes, as you said, the expected behaviour of MPI runtime is to kill all child processes and shutdown with an error code.
We sometimes observe the issues as well. It's a known issue for Open MPI <= 2.1.
In the script you showed, the second create_communicator
is supposed to raise an error.
I will investigate if there's any way or option to avoid the issue.
It seems that the failed process detection and subprocess shutdown feature were improved in Open MPI 3.0, but the version brings another issue to ChainerMN (see #221 for details)
We are trying to solve #221, but it will take time because it's not a ChainerMN issue.
In the long-term roadmap, we are working hard to remove fault tolerance to ChainerMN.
Thanks
from chainermn.
Thank you for looking into this!
Unfortunately, I tried it with Open MPI 3.0.1, but I encountered the same behavior.
I should note that even if I do not create two communicators, this issue still occurs:
import chainermn
def main():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
comm = chainermn.create_communicator('naive', mpi_comm)
if __name__ == '__main__':
main()
Please let me know if you make any progress. Thank you!
from chainermn.
I looked into this some more. It seems like mpi4py
isn't handling python exceptions correctly. I tried calling init_ranks
with some timeout code that sends a SIGALRM signal after some time, but the handler never gets called.
mpirun -n 2 python -m trace --trace repro.py
this trace shows that mpi_comm operations (gather and scatter) cause execution to hang:
_communication_utility.py(32): global_names=mpi_comm.gather(mpi4py.MPI.Get_processor_name())
from chainermn.
I've just found a hack to solve the issue.
It works for the tiny script, but we need to check if it works for real-world applications.
Note that the problem happens without chainermn
.
import sys
# Global error handler
def global_except_hook(exctype, value, traceback):
sys.stderr.write("except_hook. Calling MPI_Abort().\n")
# NOTE: mpi4py must be imported inside exception handler, not globally.
# In chainermn, mpi4py import is carefully delayed, because
# mpi4py automatically call MPI_Init() and cause a crash on Infiniband environment.
import mpi4py.MPI
mpi4py.MPI.COMM_WORLD.Abort(1)
sys.__excepthook__(exctype, value, traceback)
sys.excepthook = global_except_hook
def func1():
import chainermn
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
comm = chainermn.create_communicator('naive', mpi_comm)
def func2():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
mpi4py.MPI.COMM_WORLD.Barrier()
if __name__ == '__main__':
d = {'func1' : func1,
'func2' : func2}
fname = sys.argv[1] if len(sys.argv) >= 2 else 'func1'
d[fname]()
from chainermn.
Improved version of the error handler:
import sys
# Global error handler
def global_except_hook(exctype, value, traceback):
import sys
try:
import mpi4py.MPI
sys.stderr.write("\n*****************************************************\n")
sys.stderr.write("Uncaught exception was detected on rank {}. \n".format(
mpi4py.MPI.COMM_WORLD.Get_rank()))
from traceback import print_exception
print_exception(exctype, value, traceback)
sys.stderr.write("*****************************************************\n\n\n")
sys.stderr.write("\n")
sys.stderr.write("Calling MPI_Abort() to shut down MPI processes...\n")
sys.stderr.flush()
finally:
try:
import mpi4py.MPI
mpi4py.MPI.COMM_WORLD.Abort(1)
except Exception as e:
sys.stderr.write("*****************************************************\n")
sys.stderr.write("Sorry, we failed to stop MPI, this process will hang.\n")
sys.stderr.write("*****************************************************\n")
sys.stderr.flush()
raise e
sys.excepthook = global_except_hook
def func():
import mpi4py.MPI
mpi_comm = mpi4py.MPI.COMM_WORLD
if mpi_comm.rank == 0:
raise ValueError('failure!')
mpi4py.MPI.COMM_WORLD.Barrier()
if __name__ == '__main__':
func()
from chainermn.
Great, thanks @keisukefukuda !
I also contacted the mpi4py maintainer, who suggested using [mpi4py.run] (http://mpi4py.readthedocs.io/en/stable/mpi4py.run.html), which also suffices (I believe mpi4py.run also calls MPI_Abort()).
from chainermn.
Thanks. I will also test the mpi4py
's recommended way if it works with ChainerMN.
I think I can now close the issue. thanks for your contribution!
from chainermn.
Related Issues (20)
- Don't inicialize global NCCL comm when HOT 2
- Checkpointer doesn't resume current learning rate HOT 8
- Adding allreduce for ndarray HOT 10
- Asynchronous Allreduce HOT 2
- Handle list of dicts in MultiNodeIterator HOT 1
- would you please share hype parameters of GPUs=4 for resnet50 training with us ? HOT 23
- Expose `intra_size`, `inter_rank` and `inter_size` of communicators at readthedocs
- Provide functions for allreduce
- Manual selection for gpus in distributed training HOT 5
- CommunicatorBase.{scatter, allgather} is missing in the document
- Add `force_equal_length` flag to `scatter_dataset` method
- optimizer.setup() created by create_multi_node_optimizer returns an original optimizer HOT 2
- FP16 support HOT 1
- Forcing forkserver spawn earlier HOT 2
- When `in_size=None` is used in `Liner` and it is not used, an error occurs
- NCCL_ERROR_SYSTEM_ERROR: unhandled system error HOT 3
- CUDA streams usage HOT 6
- Non-Blocking Methodology on ChainerMN HOT 3
- Installation should do nothing but omit a warning.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from chainermn.