Comments (6)
I have an idea of the problem. If MPICH fails and Open-MPI succeeds, then I suspect the MPICH datatypes code is broken.
Can you set the MPICH build to also use ARMCI_STRIDED_METHOD=IOV
and ARMCI_IOV_METHOD=BATCHED
on the s390x config?
from armci-mpi.
With ARMCI_STRIDED_METHOD=IOV
and ARMCI_IOV_METHOD=BATCHED
, the five mpi tests still fail with the same error message (including test_mpi_indexed_gets reporting the different symptom), but the other 11 tests pass:
/usr/bin/make check-TESTS
make[3]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
make[4]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
PASS: benchmarks/ping-pong
PASS: benchmarks/ring-flood
PASS: benchmarks/contiguous-bench
PASS: benchmarks/strided-bench
PASS: benchmarks/rmw_perf
PASS: tests/test_onesided
PASS: tests/test_onesided_shared
PASS: tests/test_onesided_shared_dla
PASS: tests/test_mutex
PASS: tests/test_mutex_rmw
PASS: tests/test_mutex_trylock
PASS: tests/test_malloc_irreg
PASS: tests/ARMCI_PutS_latency
PASS: tests/ARMCI_AccS_latency
PASS: tests/test_groups
PASS: tests/test_group_split
PASS: tests/test_malloc_group
PASS: tests/test_accs
PASS: tests/test_accs_dla
PASS: tests/test_puts
PASS: tests/test_puts_gets
PASS: tests/test_puts_gets_dla
PASS: tests/test_putv
PASS: tests/test_igop
PASS: tests/test_rmw_fadd
PASS: tests/test_parmci
PASS: tests/mpi/test_mpi_accs
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
PASS: tests/mpi/test_win_create
PASS: tests/mpi/test_win_model
PASS: tests/ctree/ctree_test
PASS: tests/ctree/ctree_test_rand
PASS: tests/ctree/ctree_test_rand_interval
PASS: tests/contrib/armci-perf
PASS: tests/contrib/armci-test
PASS: tests/contrib/lu/lu-block
PASS: tests/contrib/lu/lu-b-bc
PASS: tests/contrib/transp1D/transp1D-c
PASS: tests/contrib/non-blocking/simple
============================================================================
Testsuite summary for armci 0.1
============================================================================
# TOTAL: 43
# PASS: 38
# SKIP: 0
# XFAIL: 0
# FAIL: 5
# XPASS: 0
# ERROR: 0
There's a small variation in the PMPI function triggering the error. test_mpi_dim references PMPI_Accumulate:
FAIL: tests/mpi/test_mpi_dim
============================
MPI test program (2 processes)
Testing strided gets and puts
(Only std output for process 0 is printed)
--------array[5]--------
local[1:3] -> remote[0:2] -> local[1:3]
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff7e2b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff7e1fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff7e1c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff7e1cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff7e256b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff7e2598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff7e25be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff7e0f9044]
./tests/mpi/test_mpi_dim(+0x2980) [0x2aa1bf02980]
./tests/mpi/test_mpi_dim(main+0x6a) [0x2aa1bf0123a]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff7de24c5e]
./tests/mpi/test_mpi_dim(+0x1314) [0x2aa1bf01314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_dim (exit status: 1)
while the other 3 (apart from test_mpi_indexed_gets) reference PMPI_Win_unlock, e.g.
FAIL: tests/mpi/test_mpi_indexed_accs
=====================================
MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff870b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff86ffc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff86fc6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff86fcce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8704dfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff87070a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8709125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8704fd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff87051b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8705577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff87055ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff87038822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8708c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff870539e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x26237c) [0x3ff8706237c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Win_unlock+0x310) [0x3ff86f0f1c0]
./tests/mpi/test_mpi_indexed_accs(main+0x21e) [0x2aa0d180fa6]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff86c24c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa0d181314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)
(likewise test_mpi_indexed_puts_gets and test_mpi_subarray_accs)
In the original build log, the test_mpi_indexed_accs referenced PMPI_Accumulate not PMPI_Win_unlock, though the other 2 already referenced PMPI_Win_unlock.
from armci-mpi.
Actually, I need to report it might not be so straightforward. When I manually rebuild the original configuration on an s390x porterbox, without adding ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED, I get the same result. The five test_mpi_* tests fail for mpich, the other tests pass. Between the original build test errors and today's tests, our mpich was upgraded from 4.0 to 4.0.1, if that explains why the other tests now pass.
Without adding the extra flags, test_mpi_indexed_accs is triggered from PMPI_Accumulate, as before, not from PMPI_Win_unlock
FAIL: tests/mpi/test_mpi_indexed_accs
=====================================
MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffbbbb3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff8b133d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8b07c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff8b046774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8b04ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8b0cdfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff8b0f0a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8b11125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8b0cfd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff8b0d1b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8b0d577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff8b0d5ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff8b0b8822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8b10c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff8b0d39e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25942e) [0x3ff8b0d942e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff8b0dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff8af79044]
./tests/mpi/test_mpi_indexed_accs(main+0x20e) [0x2aa25d80f96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff8aca4c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa25d81314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)
from armci-mpi.
Can you try again with ARMCI_IOV_METHOD=CONSRV
, ARMCI_IOV_CHECKS=1
, ARMCI_SHR_BUF_METHOD=COPY
, ARMCI_RMA_NOCHECK=0
, and ARMCI_NO_FLUSH_LOCAL=1
? Those are the most conservative settings I can come up with, and might reveal something.
from armci-mpi.
Hmm, with those settings (without ARMCI_STRIDED_METHOD=IOV) I'm back to 15 failures:
FAIL: benchmarks/strided-bench
FAIL: tests/ARMCI_PutS_latency
FAIL: tests/ARMCI_AccS_latency
FAIL: tests/test_accs
FAIL: tests/test_accs_dla
FAIL: tests/test_puts
FAIL: tests/test_puts_gets
FAIL: tests/test_puts_gets_dla
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
FAIL: tests/contrib/armci-perf
FAIL: tests/contrib/armci-test
with a touch more error output, just adding a short description of the test
AIL: benchmarks/strided-bench
==============================
Starting one-sided strided performance test with 2 processes
Trg. Rank Xdim Ydim Get (usec) Put (usec) Acc (usec) Get (MiB/s) Put (MiB/s) Acc (MiB/s)
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff83333d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8327c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff83246774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8324ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff832d6b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff832d98e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff832dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff83179044]
./benchmarks/strided-bench(+0x43ee) [0x2aa37e843ee]
./benchmarks/strided-bench(+0x5828) [0x2aa37e85828]
./benchmarks/strided-bench(main+0x2ea) [0x2aa37e82f32]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff82e24c5e]
./benchmarks/strided-bench(+0x31f4) [0x2aa37e831f4]
internal ABORT - process 0
FAIL benchmarks/strided-bench (exit status: 1)
FAIL: tests/ARMCI_PutS_latency
==============================
ARMCI_PutS Latency - local and remote completions - in usec
Dimensions(array of doubles) Latency-LocalCompeltion Latency-RemoteCompletion
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffb38b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ffb37fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ffb37c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ffb37cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ffb3856b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ffb38598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ffb385be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ffb36f9044]
./tests/ARMCI_PutS_latency(+0x45be) [0x2aa1e3045be]
./tests/ARMCI_PutS_latency(+0x59f8) [0x2aa1e3059f8]
./tests/ARMCI_PutS_latency(main+0x1ae) [0x2aa1e302e96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ffb33a4c5e]
./tests/ARMCI_PutS_latency(+0x33c4) [0x2aa1e3033c4]
internal ABORT - process 0
FAIL tests/ARMCI_PutS_latency (exit status: 1)
from armci-mpi.
If I activate ARMCI_STRIDED_METHOD=IOV alongside ARMCI_IOV_METHOD=CONSRV, ARMCI_IOV_CHECKS=1, ARMCI_SHR_BUF_METHOD=COPY, ARMCI_RMA_NOCHECK=0, and ARMCI_NO_FLUSH_LOCAL=1 then I'm back to the 5 failures.
from armci-mpi.
Related Issues (20)
- Look into invalid memory references that GA makes on the ghost cells examples.
- Implement remaining functions marked TODO in message.h HOT 1
- Implement TCGMSG interface HOT 1
- use MPI-3 RMA features HOT 2
- Use shared memory windows for intranode optimization
- use slab allocation
- use dynamic windows
- use faster GMR lookup HOT 2
- evaluate request-base RMA
- implement memdev allocation API
- test_mpi_accs stuck HOT 5
- armci-mpi checks segfaulted on OpenMPI/3.1.4 HOT 1
- nwchem fails in multi-node execution with openmpi: ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL" HOT 24
- Passive target microbenchmarks: target sleeps on a shared file update. HOT 1
- Buffer checking: Optimize shared buffer self-communication with direct memcpy. HOT 1
- Add upper limit to shared buffer copy size and do flow control. HOT 1
- Allow setting defaults in configure step. HOT 1
- Add a check for ARMCII_Initialized() HOT 1
- Remove C99 dependence HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from armci-mpi.