Comments (8)
Start by fixing the ssh connection drop by adding ServerAliveInterval XXX
(where XXX is a number of seconds) to your ${HOME}/.ssh/config
.
from ompi.
OMPI redirects all output through our own channels, so your ssh connections will drop, killing the processes on the corresponding nodes. OMPI then interpret this as not being able to connect to some processes and bails out.
from ompi.
FWIW, I noted the error message is misleading:
ORTE was unable to reliably start one or more daemons.
Everything suggests the daemons were correctly started, but they died unexpectedly after that.
If prrte
has a similar issue, this is something we could investigate and improve.
from ompi.
@bosilca I will try it. Thanks for your reply!
from ompi.
The problem disappeared after adding ServerAliveInterval 60
. Thanks a lot!
But I wonder why. I think mpirun
connects nodes using TCP, but the ServerAliveInterval
works for ssh. Why does this ssh parameter influence the behavior of TCP connections?
from ompi.
OMPI spawns orted
daemon through ssh
. In my case, the ssh connection working for communicating between localhost mpirun
and remote host orted
was closed due to a long time idle, and that is the reason of the problem I met. Am I correct?
(I find a stackoverflow anwser talking about this.)
from ompi.
The successful workaround suggests this is what happened.
from ompi.
Okey. Thank you.
from ompi.
Related Issues (20)
- Base Allreduce Algorithm Selection/Performance Issue HOT 3
- Issues running OpenMPI 5.0.3 HOT 1
- configure: error: Could not run a simple Fortran program. Aborting. HOT 1
- Use OMPI without LSF integration on LSF HOT 14
- Stuck in third send or recv when connecting two independent mpi applications with MPI_Comm_connect and MPI_Comm_accept with prte HOT 5
- Is UCX working with MPI-Sessions? HOT 4
- libmpi.so is linked with the wrong libopen-pal.so
- Slurm "run" is failing to find "munge" component when run with "container-name" HOT 4
- No output on GPU nodes, not sure where to start. Version 5.0.3 HOT 3
- --hostfile option not working as expected HOT 4
- open mpi issue with orca on Mac M1 pro HOT 1
- ImportError: libopen-pal.so.80: cannot open shared object file: No such file or directory HOT 1
- when i run mpi
- when i run mpi program using ASAN, asan reports some memory leaks HOT 1
- Error when using MPI_Comm_spawn with ULFM enabled HOT 6
- MPI_Status_f082f not part of the mpi_f08 interface HOT 13
- coll_tuned_dynamic_rules_filename option no way to set alltoall_algorithm_max_requests from the rules file
- coll_tuned_use_dynamic_rules wrong scoping for tools interface
- Fflush(stdout) doesn't work as expected. HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ompi.