Giter Site home page Giter Site logo

mpich test failures on s390x about armci-mpi HOT 6 OPEN

pmodels avatar pmodels commented on June 14, 2024
mpich test failures on s390x

from armci-mpi.

Comments (6)

jeffhammond avatar jeffhammond commented on June 14, 2024

I have an idea of the problem. If MPICH fails and Open-MPI succeeds, then I suspect the MPICH datatypes code is broken.

Can you set the MPICH build to also use ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED on the s390x config?

from armci-mpi.

drew-parsons avatar drew-parsons commented on June 14, 2024

With ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED, the five mpi tests still fail with the same error message (including test_mpi_indexed_gets reporting the different symptom), but the other 11 tests pass:

/usr/bin/make  check-TESTS
make[3]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
make[4]: Entering directory '/home/dparsons/armci/armci-mpi-0.3.1~beta/build-mpich'
PASS: benchmarks/ping-pong
PASS: benchmarks/ring-flood
PASS: benchmarks/contiguous-bench
PASS: benchmarks/strided-bench
PASS: benchmarks/rmw_perf
PASS: tests/test_onesided
PASS: tests/test_onesided_shared
PASS: tests/test_onesided_shared_dla
PASS: tests/test_mutex
PASS: tests/test_mutex_rmw
PASS: tests/test_mutex_trylock
PASS: tests/test_malloc_irreg
PASS: tests/ARMCI_PutS_latency
PASS: tests/ARMCI_AccS_latency
PASS: tests/test_groups
PASS: tests/test_group_split
PASS: tests/test_malloc_group
PASS: tests/test_accs
PASS: tests/test_accs_dla
PASS: tests/test_puts
PASS: tests/test_puts_gets
PASS: tests/test_puts_gets_dla
PASS: tests/test_putv
PASS: tests/test_igop
PASS: tests/test_rmw_fadd
PASS: tests/test_parmci
PASS: tests/mpi/test_mpi_accs
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
PASS: tests/mpi/test_win_create
PASS: tests/mpi/test_win_model
PASS: tests/ctree/ctree_test
PASS: tests/ctree/ctree_test_rand
PASS: tests/ctree/ctree_test_rand_interval
PASS: tests/contrib/armci-perf
PASS: tests/contrib/armci-test
PASS: tests/contrib/lu/lu-block
PASS: tests/contrib/lu/lu-b-bc
PASS: tests/contrib/transp1D/transp1D-c
PASS: tests/contrib/non-blocking/simple
============================================================================
Testsuite summary for armci 0.1
============================================================================
# TOTAL: 43
# PASS:  38
# SKIP:  0
# XFAIL: 0
# FAIL:  5
# XPASS: 0
# ERROR: 0

There's a small variation in the PMPI function triggering the error. test_mpi_dim references PMPI_Accumulate:

FAIL: tests/mpi/test_mpi_dim
============================

MPI test program (2 processes)

Testing strided gets and puts
(Only std output for process 0 is printed)

--------array[5]--------
local[1:3] -> remote[0:2] -> local[1:3] 
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff7e2b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff7e1fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff7e1c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff7e1cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff7e256b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff7e2598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff7e25be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff7e0f9044]
./tests/mpi/test_mpi_dim(+0x2980) [0x2aa1bf02980]
./tests/mpi/test_mpi_dim(main+0x6a) [0x2aa1bf0123a]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff7de24c5e]
./tests/mpi/test_mpi_dim(+0x1314) [0x2aa1bf01314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_dim (exit status: 1)

while the other 3 (apart from test_mpi_indexed_gets) reference PMPI_Win_unlock, e.g.

FAIL: tests/mpi/test_mpi_indexed_accs
=====================================

MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff870b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff86ffc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff86fc6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff86fcce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8704dfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff87070a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8709125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8704fd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff87051b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8705577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff87055ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff87038822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8708c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff870539e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x26237c) [0x3ff8706237c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Win_unlock+0x310) [0x3ff86f0f1c0]
./tests/mpi/test_mpi_indexed_accs(main+0x21e) [0x2aa0d180fa6]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff86c24c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa0d181314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)

(likewise test_mpi_indexed_puts_gets and test_mpi_subarray_accs)
In the original build log, the test_mpi_indexed_accs referenced PMPI_Accumulate not PMPI_Win_unlock, though the other 2 already referenced PMPI_Win_unlock.

from armci-mpi.

drew-parsons avatar drew-parsons commented on June 14, 2024

Actually, I need to report it might not be so straightforward. When I manually rebuild the original configuration on an s390x porterbox, without adding ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED, I get the same result. The five test_mpi_* tests fail for mpich, the other tests pass. Between the original build test errors and today's tests, our mpich was upgraded from 4.0 to 4.0.1, if that explains why the other tests now pass.

Without adding the extra flags, test_mpi_indexed_accs is triggered from PMPI_Accumulate, as before, not from PMPI_Win_unlock

FAIL: tests/mpi/test_mpi_indexed_accs
=====================================

MPI RMA Strided Accumulate Test:
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffbbbb3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff8b133d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8b07c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff8b046774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8b04ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24dfde) [0x3ff8b0cdfde]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x270a40) [0x3ff8b0f0a40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x29125c) [0x3ff8b11125c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x24fd46) [0x3ff8b0cfd46]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x251b20) [0x3ff8b0d1b20]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25577a) [0x3ff8b0d577a]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x255ab6) [0x3ff8b0d5ab6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x238822) [0x3ff8b0b8822]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x28c87e) [0x3ff8b10c87e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2539e2) [0x3ff8b0d39e2]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25942e) [0x3ff8b0d942e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff8b0dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff8af79044]
./tests/mpi/test_mpi_indexed_accs(main+0x20e) [0x2aa25d80f96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff8aca4c5e]
./tests/mpi/test_mpi_indexed_accs(+0x1314) [0x2aa25d81314]
internal ABORT - process 0
FAIL tests/mpi/test_mpi_indexed_accs (exit status: 1)

from armci-mpi.

jeffhammond avatar jeffhammond commented on June 14, 2024

Can you try again with ARMCI_IOV_METHOD=CONSRV, ARMCI_IOV_CHECKS=1, ARMCI_SHR_BUF_METHOD=COPY, ARMCI_RMA_NOCHECK=0, and ARMCI_NO_FLUSH_LOCAL=1? Those are the most conservative settings I can come up with, and might reveal something.

from armci-mpi.

drew-parsons avatar drew-parsons commented on June 14, 2024

Hmm, with those settings (without ARMCI_STRIDED_METHOD=IOV) I'm back to 15 failures:

FAIL: benchmarks/strided-bench
FAIL: tests/ARMCI_PutS_latency
FAIL: tests/ARMCI_AccS_latency
FAIL: tests/test_accs
FAIL: tests/test_accs_dla
FAIL: tests/test_puts
FAIL: tests/test_puts_gets
FAIL: tests/test_puts_gets_dla
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
FAIL: tests/contrib/armci-perf
FAIL: tests/contrib/armci-test

with a touch more error output, just adding a short description of the test

AIL: benchmarks/strided-bench
==============================

Starting one-sided strided performance test with 2 processes
   Trg. Rank    Xdim Ydim   Get (usec)   Put (usec)   Acc (usec)  Get (MiB/s)  Put (MiB/s)  Acc (MiB/s)
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ff83333d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ff8327c89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ff83246774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ff8324ce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ff832d6b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ff832d98e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ff832dbe40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ff83179044]
./benchmarks/strided-bench(+0x43ee) [0x2aa37e843ee]
./benchmarks/strided-bench(+0x5828) [0x2aa37e85828]
./benchmarks/strided-bench(main+0x2ea) [0x2aa37e82f32]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ff82e24c5e]
./benchmarks/strided-bench(+0x31f4) [0x2aa37e831f4]
internal ABORT - process 0
FAIL benchmarks/strided-bench (exit status: 1)

FAIL: tests/ARMCI_PutS_latency
==============================

ARMCI_PutS Latency - local and remote completions - in usec 
  Dimensions(array of doubles) Latency-LocalCompeltion Latency-RemoteCompletion
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b3d76) [0x3ffb38b3d76]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fc89e) [0x3ffb37fc89e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6774) [0x3ffb37c6774]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cce1c) [0x3ffb37cce1c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x256b2e) [0x3ffb3856b2e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2598e6) [0x3ffb38598e6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25be40) [0x3ffb385be40]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ffb36f9044]
./tests/ARMCI_PutS_latency(+0x45be) [0x2aa1e3045be]
./tests/ARMCI_PutS_latency(+0x59f8) [0x2aa1e3059f8]
./tests/ARMCI_PutS_latency(main+0x1ae) [0x2aa1e302e96]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ffb33a4c5e]
./tests/ARMCI_PutS_latency(+0x33c4) [0x2aa1e3033c4]
internal ABORT - process 0
FAIL tests/ARMCI_PutS_latency (exit status: 1)

from armci-mpi.

drew-parsons avatar drew-parsons commented on June 14, 2024

If I activate ARMCI_STRIDED_METHOD=IOV alongside ARMCI_IOV_METHOD=CONSRV, ARMCI_IOV_CHECKS=1, ARMCI_SHR_BUF_METHOD=COPY, ARMCI_RMA_NOCHECK=0, and ARMCI_NO_FLUSH_LOCAL=1 then I'm back to the 5 failures.

from armci-mpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.