ghex-org / ghex Goto Github PK

View Code? Open in Web Editor NEW

8.0 13.0 14.0 7.85 MB

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes

License: Other

CMake 4.73% C++ 84.23% Cuda 0.09% Makefile 0.06% Python 4.78% Fortran 5.93% Shell 0.18%

ghex's Introduction

GHEX

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes.

Documentation and instructions at GHEX Documentation.

Installation instructions

Requirements

C++17 compiler (gcc or clang)
CMake (3.21 or later)
GridTools (available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
Boost headers (1.80 or later)
MPI
UCX (optional)
Libfabric (optional)
Xpmem (optional)
oomph (0.3.0 or later, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
hwmalloc (0.3.0 or later, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
CUDA Toolkit (optional, 11.0 or later)
Rocm/Hip Toolkit (optional, 4.5.1 or later)
Google Test (only when GHEX_WTIH_TESTING=ON, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
parmetis, metis (optional)
atlas (optional)
gfortran compiler (optional, for Fortran bindings)
Python3 (optional, for Python bindings )
MPI4PY (optional, for Python bindings )
Pybind11 (optional, for Python bindings)

From Source

git clone https://github.com/ghex-org/GHEX.git
cd GHEX
git submodule update --init --recursive
mkdir -p build && cd build
cmake -DGHEX_WITH_TESTING=ON -DGHEX_USE_BUNDLED_LIBS=ON ..
make -j8
make test

CMake options

Option	Allowed Values	Default	Description
`CMAKE_INSTALL_PREFIX=`	`<path>`	`/usr/local`	Choose install path prefix
`CMAKE_BUILD_TYPE=`	`{Release, Debug}`	`Release`	Choose type of build
`GHEX_USE_BUNDLED_LIBS=`	`{ON, OFF}`	`OFF`	Use available git submodules
`GHEX_USE_GPU=`	`{ON, OFF}`	`OFF`	Enable GPU
`GHEX_GPU_TYPE=`	`{AUTO, NVIDIA, AMD}`	`AUTO`	Choose GPU type
`CMAKE_CMAKE_CUDA_ARCHITECTURES=`	list of architectures	`"60;70;75"`	Only relevant for GHEX_GPU_TYPE=NVIDIA
`CMAKE_CMAKE_HIP_ARCHITECTURES=`	list of architectures	`"gfx900;gfx906"`	Only relevant for GHEX_GPU_TYPE=AMD
`GHEX_BUILD_FORTRAN=`	`{ON, OFF}`	`OFF`	Build with Fortran bindings
`GHEX_BUILD_PYTHON_BINDINGS=`	`{ON, OFF}`	`OFF`	Build with Python bindings
`GHEX_PYTHON_LIB_PATH=`	`<path>`	`${CMAKE_INSTALL_PREFIX}/<python-site-packages-path>`	Installation directory for GHEX's Python package
`GHEX_WITH_TESTING=`	`{ON, OFF}`	`OFF`	Build unit tests
`GHEX_USE_XPMEM=`	`{ON, OFF}`	`OFF`	Use Xpmem
`GHEX_TRANSPORT_BACKEND=`	`{MPI, UCX, LIBFABRIC}`	`MPI`	Choose transport backend

Pip Install

python -m venv ghex_venv
. ghex_venv/bin/activate
python -m pip install 'git+https://github.com/ghex-org/GHEX.git#subdirectory=bindings/python'

Pertinent environment variables

Variable	Allowed Values	Default	Description
`GHEX_USE_GPU=`	`{ON, OFF}`	`OFF`	Enable GPU
`GHEX_GPU_TYPE=`	`{AUTO, NVIDIA, AMD}`	`AUTO`	Choose GPU type
`GHEX_GPU_ARCH=`	list of archs	`"60;70;75;80"`/ `"gfx900;gfx906"`	GPU architecture
`GHEX_TRANSPORT_BACKEND=`	`{MPI, UCX, LIBFABRIC}`	`MPI`	Choose transport backend

Acknowledgements

The development of GHEX was supported partly by The Partnership for Advanced Computing in Europe (PRACE). PRACE is an international non-profit association with its seat in Brussels. The PRACE Research Infrastructure provides a persistent world-class high performance computing service for scientists and researchers from academia and industry in Europe. The computer systems and their operations accessible through PRACE are provided by 5 PRACE members (BSC representing Spain, CINECA representing Italy, ETH Zurich/CSCS representing Switzerland, GCS representing Germany and GENCI representing France). The Implementation Phase of PRACE receives funding from the EU’s Horizon 2020 Research and Innovation Programme (2014-2020) under grant agreement 823767. For more information, see www.prace-ri.eu.

ghex's People

Contributors

Stargazers

Watchers

Forkers

mbianco boeschf angainor bettiolm biddisco havogt anstaf fthaler tehrengruber halungge edopao iomaganaris

ghex's Issues

CUDA + OpenMP

Currently, our build assumption is that the entire user application is only one file, which includes all needed headers. That means that the code is either compiled using nvcc, or mpicxx. That is a problem when one wants to use OpenMP with CUDA-enabled code: nvcc does not support it.

Maybe there is a way to fix this with nvcc. If not, we need to somehow add more flexibility to how the applications are built, e.g., by allowing them to be composed of multiple object files. Even now, each file is first compiled to an object file, which is then linked to an executable. It would be enough to allow executables to be composed of multiple object files, and use nvcc only when needed.

GCC 9 Warning

With GCC 9.1.0 there is a warning that I was not able to remove:

/Users/mbianco/Work/GHEX/benchmarks/gcl_test_halo_exchange_3D_generic_full.cpp:202:35: error: implicitly-declared 'constexpr triple_t<false, int>& triple_t<false, int>::operator=(const triple_t<false, int>&)' is deprecated [-Werror=deprecated-copy]
  202 |                     a(ii, jj, kk) = triple_t<USE_DOUBLE, T1>();
      |                     ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated due to -Wfatal-errors.

Now CMakeLists.txt explicitly says

 -Wno-deprecated-copy

as an option to the compiler. It looks like it is a problem with the compiler and it was patched. I'll check with newer versions and see if the problem disappear.

Do we need shared_message?

Maybe we do not need a shared_message class, since we may just pass a std::shared_ptr to a message of any type. This issue is related to #35

Remove issues from compilation

Taking iteration spaces outside of patterns

This can be valuable for instance for using the same pattern code with multiple grid types, like inflated cube.

Fix communication object naming

We should rename communication_object_2 into communication_object, and move the current communication_object into the benchmark folder, so we can test if there are overheads from having the more general co.

[unstructured] Communication object for in-place receive: wait function performances

Try and benchmark different strategies for in-place receive wait function.

Setup CI for GHEX

Requirements/goals (they do not have to be fulfilled at first):

Running from GitHub
Running on multiple platforms (CSCS and Oslo)
Running many combinations of parameters (stochastic testing?)
Performance tests

Resume send_multi

send_multi was dropped since we did not have direct use cases. We will need to re-introduce it to handle upcoming usecases

Upgade GridTools in the container for CI

That removes some warnings

Cancellation is not thread safe

Making it thread safe would require also the other operations to be (they are thread compatible so far). We need to discuss/design/implement some ways the cancellation could be triggered more generically (maybe exposing thread primitives to the user?)

worker and endpoint cleanup

Currently we destroy the endpoint with

ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FORCE);

We have to use UCP_EP_CLOSE_MODE_FORCE instead of UCP_EP_CLOSE_MODE_FLUSH because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault. UCP_EP_CLOSE_MODE_FORCE fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.

compile error with std_thread/primitives.hpp

This line

std::lock_guard<std::mutex> lock(m_guard);

does not like m_guard, which is a recursive lock. Compiles when I use m_cv_guard.

Import of GHEX cmake config in user projects

Currently, when GHEX is configured using CMake and installed, the configured options are hard-coded into the installed GHEX Cmake config file. Since this file is then included by the user in his cmake config, he can only use GHEX in the form it was configured (e.g., with either UCX, or MPI backend).

Since at this point GHEX is header-only, this is not necessary. We could allow the user to change the GHEX configuration in his cmake configuration by, .e.g, exporting the available configuration options in the installed GHEX cmake config, instead of hard-coding them. The user could then choose whatever he wants without having to re-configure and re-install GHEX separately.

Test raw_shared_message

Rigth now the raw_thread_message class i snot tested. We should benchmark the implementation and see if the benefit for real cases is enough to justify the class

Create interface to/integrate HWLOC

This seems the way to go to be able to bind threads to GPUs in a meaningful way. Not sure what is needed by GHEX itself and what is user responsibility.

[[nodiscard]] not understood by the intel compiler

I get a lot of warnings like this when compiling with ICC:

/cluster/projects/nn9999k/marcink/ghex/angainor/GHEX/include/ghex/communication_object_2.hpp(309): warning #1292: unknown attribute "nodiscard"
              [[nodiscard]] std::enable_if_t<std::is_same<Arch,cpu>::value, handle_type>
                ^

blocked CSR for unstructured grids

Often when dealing with problems with many degrees of freedom per node, a blocked CSR is used to represent the compute matrix. In that storage format the grid connectivity itself is represented as with 1 degree of freedom per node (normal graph connectivity matrix), but each vertex is logically responsible for multiple variables, and the sparse matrix contains small, dense blocks of data. Hence, during e.g., sparse matrix vector product it references small contiguous blocks of the input vector.

Exchanging halos for such blocked storage would require a trivial change to GHEX: instead of sending / packing / unpacking a single value from the input vector, GHEX would need to pack / unpack ndof contiguous values. The pointer to the vector position would also need to be trivially scaled by ndof. So effectively GHEX would send/recv [halo_idx*ndof+0, halo_idx*ndof+1, ... halo_idx*ndof+ndof-1] vector subranges.

hardware tag matching fails with UCX backend

Running ghexbench with UCX_DC_MLX5_TM_ENABLE=y causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?

[1615309041.680074] [b2237:256544:0] rc_mlx5_common.c:827  UCX  ERROR ibv_exp_create_srq(device=mlx5_0) failed: Cannot allocate memory

==== backtrace (tid: 110170) ====
 0 0x0000000000052e95 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucs/debug/debug.c:656
 1 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:832
 2 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:844
 3 0x00000000000246bd ucp_worker_get_address()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/core/ucp_worker.c:2241
 4 0x00000000004327a8 gridtools::ghex::tl::ucx::worker_t::worker_t()  ???:0
 5 0x000000000042b646 cartex::runtime::impl::init()  ???:0
 6 0x000000000041da99 cartex::runtime::exchange()  ???:0
 7 0x000000000040afe5 main()  ???:0
 8 0x0000000000022545 __libc_start_main()  ???:0
 9 0x000000000040ca8d _start()  ???:0
=================================

MPI-transport callbacks need fixing

This is a problem to fix after the merge of the UCX branch

Avoiding packing with unstructured grids

It is possible to avoid packing when sending halos with unstructured grids. In order to do this, the ordering of the mesh vertices has to be such that first indices address the internal nodes, then come all the halo nodes for first neighbor, then - for second neighbor, and so on. To illustrate, if you have a local domain of 10 internal nodes and 12 halo nodes, with 4 neighbors, then the vector should be stored in memory such

0 0 0 0 0 0 0 0 0 0 | 1 1 1 | 2 2 2 | 3 3 3 | 4 4 4

where 0 means local domain interior, and 1-4 denote neighbor ids. This way when you submit the send you only have to take the pointer to the first entry required by a given neighbor and submit that to the transport backend. No packing is needed, since the sent values are already contiguous in memory. Of course, the receiver has to perform normal unpacking, as he then extracts the contiguous buffer into non-contiguous locations in the local domain part.

To achieve this, wither the user has to provide a domain description, in which the vertex IDs have already been reordered to fulfill the above, or we could perform such renumbering internally in GHEX (at users' request) and return the new vertex number permutation to the user. GHEX's init function would then behave as a 'local' (per-subdomain) node reordering routine.

context.rank() vs comm.rank()

We can currently obtain (MPI) rank and size from both the context, and the communicator. Will the communicator ever have a different value of those variables? Should we only expose this through the context? When I see both, I don't know which one to use, and whether there is a difference.

[unstructured] Minor refactorings

Remove adjacency member from unstructured user concepts (domain descriptor);
provide concept for receive_domain_id_generator functor. So far it is implemented directly in the test / benchmark application code. By definition the implementation is application-dependent, but still this could help users in writing their own. Also just adding it to the documentation might be an option;
re-use some structs from communication_object_2 (field_info and buffer) instead of defining new ones; they can be reused at least for the send part;
offset of serialized buffer should not be needed anymore. This has to be verified first, and, if true, offset should be removed both from field_info and from field_info_ipr, in which actually it does not even refer to a serialized buffer.

UCX backend without MT support

Currently the UCX backend always supports a multi-threaded execution model, i.e., the context is initialized with MT, and we always create the mutex. Might be a good idea to make this optional (just like the backends are built - with and without MT support).

Just a thought. Should be discussed.

Repo cleanup

The following branches are merged and not actively developed:

test
protocol
communication_object
comm2
dev

I think we can delete them from GridTools/GHEX repo to clean things up.