Comments (5)
can you run gstack
or attach GDB to the stuck process to see where it is?
from armci-mpi.
Hi Jeff, sorry for the delay. Here are the stack traces:
the output of top when test_mpi_accs processes get stuck:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
45007 root 20 0 81252 36852 7116 R 82.6 0.0 4:11.94 test_mpi_accs
45009 root 20 0 81244 36612 6876 R 82.6 0.0 4:11.46 test_mpi_accs
==> attached by GDB::
(gdb) attach 45007
Attaching to process 45007
Reading symbols from /opt/armci-mpi-master/tests/mpi/test_mpi_accs...done.
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64
(gdb) where
#0 0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
#1 0x00007fb9ed0c3d7b in sock_pe_progress_tx_ctx () from /opt/mpich/lib/libmpi.so.12
#2 0x00007fb9ed0c5056 in sock_pe_progress_ep_tx () from /opt/mpich/lib/libmpi.so.12
#3 0x00007fb9ed0b28b2 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#4 0x00007fb9ed0b2f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#5 0x00007fb9ecba1612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#6 0x00007fb9ecbe0c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#7 0x0000555a27b38356 in main (argc=1, argv=0x7ffc89e639b8) at tests/mpi/test_mpi_accs.c:48
(gdb) attach 45009
Attaching to program: /opt/armci-mpi-master/tests/mpi/test_mpi_accs, process 45009
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45023]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
(gdb) where
#0 0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
#1 0x00007f3406008f41 in sock_comm_peek () from /opt/mpich/lib/libmpi.so.12
#2 0x00007f3406003e42 in sock_pe_progress_rx_pe_entry () from /opt/mpich/lib/libmpi.so.12
#3 0x00007f3406006b4e in sock_pe_progress_rx_ctx () from /opt/mpich/lib/libmpi.so.12
#4 0x00007f3406006c36 in sock_pe_progress_ep_rx () from /opt/mpich/lib/libmpi.so.12
#5 0x00007f3405ff58f3 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#6 0x00007f3405ff5f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#7 0x00007f3405ae4612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#8 0x00007f3405b23c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#9 0x000055950d909356 in main (argc=1, argv=0x7ffc580e9478) at tests/mpi/test_mpi_accs.c:48
This "stuck" is a random occasion; it does not always happen.
from armci-mpi.
The MPICH version:
root@wahab-01:/opt/mpich/bin# ./mpichversion
MPICH Version: 3.3.2
MPICH Release date: Tue Nov 12 21:23:16 CST 2019
MPICH Device: ch4:ofi
MPICH configure: --prefix=/opt/mpich --with-device=ch4:ofi
MPICH CC: x86_64-conda_cos6-linux-gnu-gcc -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH CXX: x86_64-conda_cos6-linux-gnu-g++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH F77: /opt/conda/bin/x86_64-conda_cos6-linux-gnu-gfortran -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH FC: x86_64-conda_cos6-linux-gnu-gfortran -O2
MPICH Custom Information:
The line 48 above is the very first time MPI_Win_unlock
was called in the code (first loop).
from armci-mpi.
It seems that it is stuck in the libfabric sockets conduit. I have never seen this before.
Try running with MPICH_ASYNC_PROGRESS=1
(no need to recompile anything). If that works, we have a clue.
If that works, can you then rebuild ARMCI-MPI with configure --with-progress ..
and run with ARMCI_PROGRESS_THREAD=1
and ARMCI_PROGRESS_USLEEP=100
? That is a second clue.
If neither works, I have to assume the container situation has broken something. I do not use containers so I don't know how to debug this.
from armci-mpi.
With MPICH_ASYNC_PROGRESS=1
the result is rather erratic (sometimes it works but about 25% it would be stuck). Here is a test script to launch the same program again and again and test it:
#!/bin/bash
# second test: with MPICH_ASYNC_PROGRESS=1
source /etc/container_runtime/runtime
load_mpich
: ${COUNT:=100}
export MPICH_ASYNC_PROGRESS=1
for c in $(seq 1 $COUNT); do
echo "Iteration $c - $(date +%Y-%m-%dT%H:%M:%S)"
tm_start=$(date +%s.%N)
mpiexec -n 2 ./test_mpi_accs
tm_end=$(date +%s.%N)
awk -v tm_start="$tm_start" -v tm_end="$tm_end" -v c="$c" \
'BEGIN { printf("- iteration %3d time = %.3f secs\n", c, tm_end - tm_start) }'
sleep 5s
done
Here is the example output (filtered to only a few relevant texts)--you see 3x stuck here:
- iteration 1 time = 1.089 secs
- iteration 2 time = 2.035 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 3 time = 15.499 secs
- iteration 4 time = 2.090 secs
- iteration 5 time = 1.494 secs
- iteration 6 time = 2.926 secs
- iteration 7 time = 3.569 secs
- iteration 8 time = 1.292 secs
- iteration 9 time = 1.506 secs
- iteration 10 time = 1.785 secs
- iteration 11 time = 2.163 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 12 time = 215.610 secs
- iteration 13 time = 2.391 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 14 time = 24.370 secs
On a related note, I tried to build armci-mpi outside container, I ran into a different issue, which I post on a separate issue - #32 .
from armci-mpi.
Related Issues (20)
- Look into invalid memory references that GA makes on the ghost cells examples.
- Implement remaining functions marked TODO in message.h HOT 1
- Implement TCGMSG interface HOT 1
- use MPI-3 RMA features HOT 2
- Use shared memory windows for intranode optimization
- use slab allocation
- use dynamic windows
- use faster GMR lookup HOT 2
- evaluate request-base RMA
- implement memdev allocation API
- armci-mpi checks segfaulted on OpenMPI/3.1.4 HOT 1
- nwchem fails in multi-node execution with openmpi: ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL" HOT 24
- mpich test failures on s390x HOT 6
- Passive target microbenchmarks: target sleeps on a shared file update. HOT 1
- Buffer checking: Optimize shared buffer self-communication with direct memcpy. HOT 1
- Add upper limit to shared buffer copy size and do flow control. HOT 1
- Allow setting defaults in configure step. HOT 1
- Add a check for ARMCII_Initialized() HOT 1
- Remove C99 dependence HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from armci-mpi.