Giter Site home page Giter Site logo

test_mpi_accs stuck about armci-mpi HOT 5 OPEN

pmodels avatar pmodels commented on June 3, 2024
test_mpi_accs stuck

from armci-mpi.

Comments (5)

jeffhammond avatar jeffhammond commented on June 3, 2024

can you run gstack or attach GDB to the stuck process to see where it is?

from armci-mpi.

wirawan0 avatar wirawan0 commented on June 3, 2024

Hi Jeff, sorry for the delay. Here are the stack traces:

the output of top when test_mpi_accs processes get stuck:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                     
45007 root      20   0   81252  36852   7116 R  82.6  0.0   4:11.94 test_mpi_accs                               
45009 root      20   0   81244  36612   6876 R  82.6  0.0   4:11.46 test_mpi_accs

==> attached by GDB::

(gdb) attach 45007
Attaching to process 45007
Reading symbols from /opt/armci-mpi-master/tests/mpi/test_mpi_accs...done.
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64
(gdb) where
#0  0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
#1  0x00007fb9ed0c3d7b in sock_pe_progress_tx_ctx () from /opt/mpich/lib/libmpi.so.12
#2  0x00007fb9ed0c5056 in sock_pe_progress_ep_tx () from /opt/mpich/lib/libmpi.so.12
#3  0x00007fb9ed0b28b2 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#4  0x00007fb9ed0b2f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#5  0x00007fb9ecba1612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#6  0x00007fb9ecbe0c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#7  0x0000555a27b38356 in main (argc=1, argv=0x7ffc89e639b8) at tests/mpi/test_mpi_accs.c:48
(gdb) attach 45009
Attaching to program: /opt/armci-mpi-master/tests/mpi/test_mpi_accs, process 45009
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45023]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
#1  0x00007f3406008f41 in sock_comm_peek () from /opt/mpich/lib/libmpi.so.12
#2  0x00007f3406003e42 in sock_pe_progress_rx_pe_entry () from /opt/mpich/lib/libmpi.so.12
#3  0x00007f3406006b4e in sock_pe_progress_rx_ctx () from /opt/mpich/lib/libmpi.so.12
#4  0x00007f3406006c36 in sock_pe_progress_ep_rx () from /opt/mpich/lib/libmpi.so.12
#5  0x00007f3405ff58f3 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#6  0x00007f3405ff5f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#7  0x00007f3405ae4612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#8  0x00007f3405b23c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#9  0x000055950d909356 in main (argc=1, argv=0x7ffc580e9478) at tests/mpi/test_mpi_accs.c:48

This "stuck" is a random occasion; it does not always happen.

from armci-mpi.

wirawan0 avatar wirawan0 commented on June 3, 2024

The MPICH version:

root@wahab-01:/opt/mpich/bin# ./mpichversion 
MPICH Version:          3.3.2
MPICH Release date:     Tue Nov 12 21:23:16 CST 2019
MPICH Device:           ch4:ofi
MPICH configure:        --prefix=/opt/mpich --with-device=ch4:ofi
MPICH CC:       x86_64-conda_cos6-linux-gnu-gcc -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include   -O2
MPICH CXX:      x86_64-conda_cos6-linux-gnu-g++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include  -O2
MPICH F77:      /opt/conda/bin/x86_64-conda_cos6-linux-gnu-gfortran -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include  -O2
MPICH FC:       x86_64-conda_cos6-linux-gnu-gfortran   -O2
MPICH Custom Information: 

The line 48 above is the very first time MPI_Win_unlock was called in the code (first loop).

from armci-mpi.

jeffhammond avatar jeffhammond commented on June 3, 2024

It seems that it is stuck in the libfabric sockets conduit. I have never seen this before.

Try running with MPICH_ASYNC_PROGRESS=1 (no need to recompile anything). If that works, we have a clue.

If that works, can you then rebuild ARMCI-MPI with configure --with-progress .. and run with ARMCI_PROGRESS_THREAD=1 and ARMCI_PROGRESS_USLEEP=100? That is a second clue.

If neither works, I have to assume the container situation has broken something. I do not use containers so I don't know how to debug this.

from armci-mpi.

wirawan0 avatar wirawan0 commented on June 3, 2024

With MPICH_ASYNC_PROGRESS=1 the result is rather erratic (sometimes it works but about 25% it would be stuck). Here is a test script to launch the same program again and again and test it:

#!/bin/bash

# second test: with MPICH_ASYNC_PROGRESS=1

source /etc/container_runtime/runtime
load_mpich

: ${COUNT:=100}

export MPICH_ASYNC_PROGRESS=1

for c in $(seq 1 $COUNT); do
    echo "Iteration $c - $(date +%Y-%m-%dT%H:%M:%S)"
    tm_start=$(date +%s.%N)
    mpiexec -n 2 ./test_mpi_accs
    tm_end=$(date +%s.%N)
    awk -v tm_start="$tm_start" -v tm_end="$tm_end" -v c="$c" \
        'BEGIN { printf("- iteration %3d  time = %.3f secs\n", c, tm_end - tm_start) }'
    sleep 5s
done

Here is the example output (filtered to only a few relevant texts)--you see 3x stuck here:

- iteration   1  time = 1.089 secs
- iteration   2  time = 2.035 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration   3  time = 15.499 secs
- iteration   4  time = 2.090 secs
- iteration   5  time = 1.494 secs
- iteration   6  time = 2.926 secs
- iteration   7  time = 3.569 secs
- iteration   8  time = 1.292 secs
- iteration   9  time = 1.506 secs
- iteration  10  time = 1.785 secs
- iteration  11  time = 2.163 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration  12  time = 215.610 secs
- iteration  13  time = 2.391 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration  14  time = 24.370 secs

On a related note, I tried to build armci-mpi outside container, I ran into a different issue, which I post on a separate issue - #32 .

from armci-mpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.