Giter Site home page Giter Site logo

pmodels / armci-mpi Goto Github PK

View Code? Open in Web Editor NEW
12.0 15.0 6.0 1.04 MB

An implementation of ARMCI using MPI one-sided communication (RMA)

Home Page: https://wiki.mpich.org/armci-mpi/index.php/Main_Page

License: Other

Makefile 1.21% Shell 2.02% C 64.36% M4 32.26% Fortran 0.14%
armci mpi pgas one-sided global-arrays mpi-library

armci-mpi's People

Contributors

ggouaillardet avatar jdinan avatar jeffhammond avatar minsii avatar pavanbalaji avatar raffenet avatar roblatham00 avatar shawnccx avatar sthibaul avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

armci-mpi's Issues

use faster GMR lookup

AVL tree will be faster than linked-list traversal for GMR lookup.

@jdinan noted:

Fortunately, ARMCI-MPI already has an AVL tree: https://github.com/pmodels/armci-mpi/blob/master/src/conflict_tree.c

Not sure why I didn't use this for GMR lookups. I suspect the reason is that when the number of memory regions is small, the performance improvement is negligible.

Obsolete comment by @jeffhammond:

https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/common/avl/avl.c exists but I do not know if CDDL is acceptable in ARMCI-MPI. If not, we'll have to implement from scratch.

Migrated from https://github.com/jeffhammond/armci-mpi/issues/25

Add a check for ARMCII_Initialized()

Reported by jhammond on 14 May 2013 13:49 UTC
From TODO:

Add a check for ARMCII_Initialized()?
+ (ARMCII_GLOBAL_STATE.active && MPI_Initialized && !MPI_Finalized())

use MPI-3 RMA features

Reported by jhammond on 14 May 2013 15:51 UTC

  • ARMCI_Rmw should use MPI_Fetch_and_op.
  • request-based operations in MPI-3 RMA can be used to implement explicit-handle nonblocking ARMCI ops.
  • MPI-3 RMA separates local and remote completion now, just like ARMCI does. We should use flush_local to achieve the same.
  • ARMCI will benefit greatly from the use of win_lock_all and flush_local/flush/flushall-based completion.
  • MPI_Compare_and_swap may or may not be a good idea for ARMCI mutexes. (Jim will likely say that spinning across the network is bad, but if we know contention is going to be low...)

armci-mpi checks segfaulted on OpenMPI/3.1.4

Hi Jeff,

I tried a second attempt to build armci-mpi outside container with OpenMPI 3.1.4 and UCX 1.9. The cluster has Infiniband EDR hardware, OS: Ubuntu Linux 18.04. I tried to build the software with stock gcc/gfortran on the OS.

This time around, I encountered a different set of issues. I have 10 tests that fail consistently:

FAIL: benchmarks/ping-pong
FAIL: benchmarks/ring-flood
FAIL: benchmarks/contiguous-bench
FAIL: benchmarks/strided-bench
FAIL: benchmarks/rmw_perf
FAIL: tests/ARMCI_AccS_latency
FAIL: tests/test_rmw_fadd
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/contrib/armci-perf
FAIL: tests/contrib/armci-test

The cause of error is obvious--due to the function calls PMPI_Accumulate or MPI_Fetch_and_op under the hood:

$ grep -e FAIL -e MPI_ test-suite.log
# XFAIL: 0
# FAIL:  10
FAIL: benchmarks/ping-pong
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7f2cedc2cf51]
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7fa2c6df9f51]
FAIL benchmarks/ping-pong (exit status: 139)
FAIL: benchmarks/ring-flood
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7f85e9cabf51]
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7f248dd0af51]
FAIL benchmarks/ring-flood (exit status: 139)
FAIL: benchmarks/contiguous-bench
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7ffa903f6f51]
FAIL benchmarks/contiguous-bench (exit status: 139)
FAIL: benchmarks/strided-bench
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7fd8ac115f51]
FAIL benchmarks/strided-bench (exit status: 139)
FAIL: benchmarks/rmw_perf
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(MPI_Fetch_and_op+0xf5) [0x7f193bf270c5]
FAIL benchmarks/rmw_perf (exit status: 139)
FAIL: tests/ARMCI_AccS_latency
 8  /shared/apps/auto/openmpi/3.1.4-gcc-7.3.0-kesl/lib/libmpi.so.40(PMPI_Accumulate+0x101) [0x7f0655c6ef51]
FAIL tests/ARMCI_AccS_latency (exit status: 139)
[redacted]

I feel I may have to go to OpenMPI forum to get help to resolve this, but I want to see if you have ever encountered this kind of issue or have any insight.

Wirawan

use slab allocation

Instead of allocating a window for every call to ARMCI_Malloc, allocate a single slab according to GA_Initialize_ltd and suballocate from there. This will obviate the need for any GMR lookups, which helps latency.

A more general approach would be to use a slab of some capacity and then allocate a window per call for allocation in excess of that. Thus, the GMR lookup would start with a simple range check (fast) and only hit the O(n) linked-list traversal when the slab capacity was exceeded.

The key to implementing this is ARMCI_Set_shm_limit, which tells ARMCI what the upper bound on how much storage to allocate:

Global Arrays global/src/base.c line 374:

        if(GA_memory_limited) ARMCI_Set_shm_limit(GA_total_memory);
        if (_ga_initialize_c) {
            if (_ga_initialize_args) {
                ARMCI_Init_args(_ga_argc, _ga_argv);
            }
            else {
                ARMCI_Init();
            }
        }

NWChem always uses ga_initialize_ltd and requires the user to specify the GA allocation size, so no changes to NWChem are required to activate this.

Migrated from https://github.com/jeffhammond/armci-mpi/issues/21

implement memdev allocation API

From GlobalArrays/ga#154:

./.libs/libga.a(base.o): In function `gai_get_devmem':
base.c:(.text+0xdc98): undefined reference to `ARMCI_Malloc_group_memdev'
base.c:(.text+0xdcf9): undefined reference to `ARMCI_Malloc_memdev'
./.libs/libga.a(base.o): In function `pnga_destroy':
base.c:(.text+0x10955): undefined reference to `ARMCI_Free_memdev'

Remove C99 dependence

Reported by Jim Dinan on 14 May 2013 13:50 UTC
Provide non-variadic macro versions of debug and error routines.

Finish implementation of IOV iterators

Reported by jhammond on 14 May 2013 13:45 UTC
This is the first of many tickets that are being created to correspond to the bullets in the TODO file.

mpich test failures on s390x

A build of armci-mpi with mpich 4.0 fails tests on s390x. Tests pass for Intel and ARM architectures (amd64 and arm64 and their lesser counterparts)

The build log is available at https://buildd.debian.org/status/fetch.php?pkg=armci-mpi&arch=s390x&ver=0.3.1%7Ebeta-5&stamp=1645753186&raw=0 .
Tests pass with openmpi but 16 tests fail with mpich:

mpicc.mpich -DHAVE_CONFIG_H -I. -I./src  -I./src -Wdate-time -D_FORTIFY_SOURCE=2  -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security  -pthread -c -o tests/contrib/non-blocking/simple.o tests/contrib/non-blocking/simple.c
/bin/bash ./libtool  --tag=CC   --mode=link mpicc.mpich  -g -O2 -ffile-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security  -pthread  -Wl,-z,relro -o tests/contrib/non-blocking/simple tests/contrib/non-blocking/simple.o libarmci-mpich.la -lm 
libtool: link: mpicc.mpich -g -O2 "-ffile-prefix-map=/<<PKGBUILDDIR>>=." -fstack-protector-strong -Wformat -Werror=format-security -pthread -Wl,-z -Wl,relro -o tests/contrib/non-blocking/simple tests/contrib/non-blocking/simple.o  ./.libs/libarmci-mpich.a -lm -pthread
make[3]: Leaving directory '/<<PKGBUILDDIR>>/build-mpich'
/usr/bin/make  check-TESTS
make[3]: Entering directory '/<<PKGBUILDDIR>>/build-mpich'
make[4]: Entering directory '/<<PKGBUILDDIR>>/build-mpich'
PASS: benchmarks/ping-pong
PASS: benchmarks/ring-flood
PASS: benchmarks/contiguous-bench
FAIL: benchmarks/strided-bench
PASS: benchmarks/rmw_perf
PASS: tests/test_onesided
PASS: tests/test_onesided_shared
PASS: tests/test_onesided_shared_dla
PASS: tests/test_mutex
PASS: tests/test_mutex_rmw
PASS: tests/test_mutex_trylock
PASS: tests/test_malloc_irreg
FAIL: tests/ARMCI_PutS_latency
FAIL: tests/ARMCI_AccS_latency
PASS: tests/test_groups
PASS: tests/test_group_split
PASS: tests/test_malloc_group
FAIL: tests/test_accs
FAIL: tests/test_accs_dla
FAIL: tests/test_puts
FAIL: tests/test_puts_gets
FAIL: tests/test_puts_gets_dla
FAIL: tests/test_putv
PASS: tests/test_igop
PASS: tests/test_rmw_fadd
PASS: tests/test_parmci
PASS: tests/mpi/test_mpi_accs
FAIL: tests/mpi/test_mpi_dim
FAIL: tests/mpi/test_mpi_indexed_accs
FAIL: tests/mpi/test_mpi_indexed_gets
FAIL: tests/mpi/test_mpi_indexed_puts_gets
FAIL: tests/mpi/test_mpi_subarray_accs
PASS: tests/mpi/test_win_create
PASS: tests/mpi/test_win_model
PASS: tests/ctree/ctree_test
PASS: tests/ctree/ctree_test_rand
PASS: tests/ctree/ctree_test_rand_interval
FAIL: tests/contrib/armci-perf
FAIL: tests/contrib/armci-test
PASS: tests/contrib/lu/lu-block
PASS: tests/contrib/lu/lu-b-bc
PASS: tests/contrib/transp1D/transp1D-c
PASS: tests/contrib/non-blocking/simple
============================================================================
Testsuite summary for armci 0.1
============================================================================
# TOTAL: 43
# PASS:  27
# SKIP:  0
# XFAIL: 0
# FAIL:  16
# XPASS: 0
# ERROR: 0

Further details of the errors are listed in the build log

There are essentially only two test errors here. Most of these failures all point at the same error

Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0

e.g.

FAIL: benchmarks/strided-bench
==============================

Starting one-sided strided performance test with 2 processes
   Trg. Rank    Xdim Ydim   Get (usec)   Put (usec)   Acc (usec)  Get (MiB/s)  Put (MiB/s)  Acc (MiB/s)
Assertion failed in file src/mpi/datatype/typerep/dataloop/looputil.c at line 815: *lengthp > 0
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x2b44c6) [0x3ffa1f344c6]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1fcfee) [0x3ffa1e7cfee]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1c6f94) [0x3ffa1e46f94]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x1cd63c) [0x3ffa1e4d63c]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25727e) [0x3ffa1ed727e]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25a036) [0x3ffa1eda036]
/usr/lib/s390x-linux-gnu/libmpich.so.12(+0x25c590) [0x3ffa1edc590]
/usr/lib/s390x-linux-gnu/libmpich.so.12(PMPI_Accumulate+0xa94) [0x3ffa1d79864]
./benchmarks/strided-bench(+0x43ee) [0x2aa3c3843ee]
./benchmarks/strided-bench(+0x5828) [0x2aa3c385828]
./benchmarks/strided-bench(main+0x2ea) [0x2aa3c382f32]
/lib/s390x-linux-gnu/libc.so.6(__libc_start_main+0xe6) [0x3ffa1a24c5e]
./benchmarks/strided-bench(+0x31f4) [0x2aa3c3831f4]
internal ABORT - process 0
FAIL benchmarks/strided-bench (exit status: 1)

looputil.c is actually in mpich not armci-mpi, maybe this is an mpich bug?
Not sure if it's relevant to looputil.c l.813 here, but we caught a bug in incorrect assumptions about how long double alignment was implemented on s390x, exposed in mpi4py, see mpi4py/mpi4py#91

The other error is in test_mpi_indexed_gets:

FAIL: tests/mpi/test_mpi_indexed_gets
=====================================

MPI RMA Strided Get Test:
0: Data validation failed at [318, 0] expected=1.000000 actual=19153196493101324300117002266184609761638785168706969756587673992816090829370440047833267676841021126741158161912149458901300240246916622245811317773215680681469166039489874870997064119253413911245961967859065159680.000000
1: Data validation failed at [318, 0] expected=2.000000 actual=19153196493101324300117002266184609761638785168706969756587673992816090829370440047833267676841021126741158161912149458901300240246916622245811317773215680681469166039489874870997064119253413911245961967859065159680.000000
0: Data validation failed at [345, 0] expected=1.000000 actual=2523265647856334203312318852546941707356501213688096169388892899082082058579155051647685988391931887920786317575791927786357084394346485273106592523647787658186823812963720560115712.000000
1: Data validation failed at [345, 0] expected=2.000000 actual=2523265647856334203312318852546941707356501213688096169388892899082082058579155051647685988391931887920786317575791927786357084394346485273106592523647787658186823812963720560115712.000000

I see an error like this if there is a mismatch in libmpich.so (e.g. on amd64, running armci-mpi tests with libarmci built against mpich 4.0 but then compiling tests using libmpich1.2 from mpich 3.4.1), but that kind of mismatch shouldn't apply to the s390x build-time test failure reported here.

For reference, various tests also fail at build time for other less common architectures, evidently for different reasons. Build logs are collected at https://buildd.debian.org/status/package.php?p=armci-mpi
On mips64el, test_mpi_indexed_gets fails on mpich, all tests pass with openmpi. On mipsel tests pass with mpich but fail with openmpi.

CI runtime (installation) test logs are collected at https://ci.debian.net/packages/a/armci-mpi/ (the version building with mpich is 0.3.1~beta-5 or later), showing the same test failure on s390x.

Implement TCGMSG interface

Reported by jhammond on 14 May 2013 13:53 UTC
As of GA 5.1, TCGMSG moved into ARMCI so now we need to reimplement that as well in order to be a drop-in replacement for ARMCI from PNNL.

This is related to the previous ticket about message.h...

test_mpi_accs stuck

While building armci-mpi library to use on our cluster, I found that the test_mpi_accs program could not progress. I don't have a good information about where the program are stuck yet. The underlying MPI library is MPICH 3.1. This build was taking place in a Singularity container. The compiler is GCC version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) provided by conda, and
the MPICH library was built with that same GCC toolchain.

nwchem fails in multi-node execution with openmpi: ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"

The Debian testing build of nwchem is currently failing to run across multiple nodes. It runs fine on one node.

The nodes form a cluster managed by openstack. 16 cpu per node

Testing against the sample water script at https://nwchemgit.github.io/Sample.html, one node runs successfully with

mpirun -n 16 nwchem water.nw

I can also run successfully on a different (single) node (here launching from node-1 to execute on node-2)

mpirun -H node-2:16 -n 16 nwchem water.nw

The segfault occurs when I try to run on both nodes. Whether with -n 32 or -N 16,

mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw

or

mpirun -H node-1:16,node-2:16 -N 16 nwchem water.nw

both fail the same way.

The error message is:

$ mpirun -H node-1:16,node-2:16 -N 16 nwchem water.nw 
[31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"
[31] Backtrace:
[31]  10 - nwchem(+0x2836605) [0x55fe1ee26605]
[31]   9 - nwchem(+0x282cc1c) [0x55fe1ee1cc1c]
[31]   8 - nwchem(+0x282c358) [0x55fe1ee1c358]
[31]   7 - nwchem(+0x2819f68) [0x55fe1ee09f68]
[31]   6 - nwchem(+0x2819cba) [0x55fe1ee09cba]
[31]   5 - nwchem(+0x2819d76) [0x55fe1ee09d76]
[31]   4 - nwchem(+0x2818fe9) [0x55fe1ee08fe9]
[31]   3 - nwchem(+0x11b79) [0x55fe1c601b79]
[31]   2 - nwchem(+0x12659) [0x55fe1c602659]
[31]   1 - /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xcd) [0x7fb2c8ffa7ed]
[31]   0 - nwchem(+0x1069a) [0x55fe1c60069a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 31 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: node-1
  Local PID:  1264980
  Peer host:  node-2
--------------------------------------------------------------------------

I've tried a fresh rebuild of armci-mpi, ga and nwchem, but the segfault is pervasive.

I've tried setting ARMCI_USE_WIN_ALLOCATE=0 as suggested on the armci-mpi README, but it doesn't avoid the segfault.

I'm not sure where exactly the source of the error is, whether it's a question of build configuration or run-time configuration, or whether it's a bug in armci-mpi, ga or nwchem. Or even MPI or openstack.

mpirun -H node-1:16,node-2:16 -N 16 does run succcessfully for a FEniCS MPI python job (it also uses PETSc). That suggests openmpi might not be at fault.

The error reference to src/gmr.c belongs to armci-mpi, so that seems the more likely source. Hence requesting help here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.