Giter Site home page Giter Site logo

Stuck in third send or recv when connecting two independent mpi applications with MPI_Comm_connect and MPI_Comm_accept with prte about ompi HOT 5 CLOSED

dariomnz avatar dariomnz commented on August 19, 2024
Stuck in third send or recv when connecting two independent mpi applications with MPI_Comm_connect and MPI_Comm_accept with prte

from ompi.

Comments (5)

bosilca avatar bosilca commented on August 19, 2024

Your code works just fine on my setup. My main issue arise from the fact that prte DVM decide to only use the 'lo' interface which forces me to start all processes from the same host.

from ompi.

dariomnz avatar dariomnz commented on August 19, 2024

Can I get more feedback, such as how do I debug the app to find out why it's stuck?

from ompi.

bosilca avatar bosilca commented on August 19, 2024

As a first step you should attach to the running processes with a debugger (gdb -p) and take a look at the stack trace to see where exactly the processes are blocked.

from ompi.

dariomnz avatar dariomnz commented on August 19, 2024

As I said before the program gets stuck in the third Send-Recv, you can see it in the gdb backtrace as you told me to debug.
gdb server:

gdb -p 593951
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 593951
[New LWP 593952]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd8eed3546e in epoll_wait (epfd=3, events=0x564a1bcc7810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) bt
#0  0x00007fd8eed3546e in epoll_wait (epfd=3, events=0x564a1bcc7810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007fd8ee988469 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#2  0x00007fd8ee97e4a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#3  0x00007fd8eeb3bb53 in opal_progress_events () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#4  0x00007fd8eeb3bc25 in opal_progress () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#5  0x00007fd8ef06ff60 in mca_pml_ob1_recv () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#6  0x00007fd8eeeea9ef in PMPI_Recv () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#7  0x000056433f13e5fd in main (argc=1, argv=0x7ffefe82ca88) at server.c:45
(gdb) 

Client:

gdb -p 593961
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 593961
[New LWP 593962]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007faae290f46e in epoll_wait (epfd=3, events=0x555fdbb5b810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
30      ../sysdeps/unix/sysv/linux/epoll_wait.c: No such file or directory.
(gdb) bt
#0  0x00007faae290f46e in epoll_wait (epfd=3, events=0x555fdbb5b810, maxevents=32, timeout=0) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007faae2562469 in ?? () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#2  0x00007faae25584a5 in event_base_loop () from /lib/x86_64-linux-gnu/libevent_core-2.1.so.7
#3  0x00007faae2715b53 in opal_progress_events () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#4  0x00007faae2715c25 in opal_progress () from /beegfs/home/dariomnz/bin/ompi5/lib/libopen-pal.so.80
#5  0x00007faae2c4e313 in mca_pml_ob1_send () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#6  0x00007faae2aca1f3 in PMPI_Send () from /beegfs/home/dariomnz/bin/ompi5/lib/libmpi.so.40
#7  0x0000562e06ef4500 in main (argc=2, argv=0x7ffdf534e6a8) at client.c:35
(gdb) 

Can I have more information on how to continue debugging to make it work?

from ompi.

dariomnz avatar dariomnz commented on August 19, 2024

After much trial and error the problem was that I needed to compile openmpi with slurm (--with-slurm=/opt/slurm), otherwise this behavior of getting stuck on the third send would happen.

from ompi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.