Giter Site home page Giter Site logo

ghex-org / ghex Goto Github PK

View Code? Open in Web Editor NEW
8.0 13.0 14.0 7.85 MB

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes

License: Other

CMake 4.73% C++ 84.23% Cuda 0.09% Makefile 0.06% Python 4.78% Fortran 5.93% Shell 0.18%

ghex's Introduction

License CI Pip

GHEX

Generic exascale-ready library for halo-exchange operations on variety of grids/meshes.

Documentation and instructions at GHEX Documentation.

Installation instructions

Requirements

  • C++17 compiler (gcc or clang)
  • CMake (3.21 or later)
  • GridTools (available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
  • Boost headers (1.80 or later)
  • MPI
  • UCX (optional)
  • Libfabric (optional)
  • Xpmem (optional)
  • oomph (0.3.0 or later, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
  • hwmalloc (0.3.0 or later, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
  • CUDA Toolkit (optional, 11.0 or later)
  • Rocm/Hip Toolkit (optional, 4.5.1 or later)
  • Google Test (only when GHEX_WTIH_TESTING=ON, available as submodule with GHEX_USE_BUNDLED_LIBS=ON)
  • parmetis, metis (optional)
  • atlas (optional)
  • gfortran compiler (optional, for Fortran bindings)
  • Python3 (optional, for Python bindings )
  • MPI4PY (optional, for Python bindings )
  • Pybind11 (optional, for Python bindings)

From Source

git clone https://github.com/ghex-org/GHEX.git
cd GHEX
git submodule update --init --recursive
mkdir -p build && cd build
cmake -DGHEX_WITH_TESTING=ON -DGHEX_USE_BUNDLED_LIBS=ON ..
make -j8
make test
CMake options
Option Allowed Values Default Description
CMAKE_INSTALL_PREFIX= <path> /usr/local Choose install path prefix
CMAKE_BUILD_TYPE= {Release, Debug} Release Choose type of build
GHEX_USE_BUNDLED_LIBS= {ON, OFF} OFF Use available git submodules
GHEX_USE_GPU= {ON, OFF} OFF Enable GPU
GHEX_GPU_TYPE= {AUTO, NVIDIA, AMD} AUTO Choose GPU type
CMAKE_CMAKE_CUDA_ARCHITECTURES= list of architectures "60;70;75" Only relevant for GHEX_GPU_TYPE=NVIDIA
CMAKE_CMAKE_HIP_ARCHITECTURES= list of architectures "gfx900;gfx906" Only relevant for GHEX_GPU_TYPE=AMD
GHEX_BUILD_FORTRAN= {ON, OFF} OFF Build with Fortran bindings
GHEX_BUILD_PYTHON_BINDINGS= {ON, OFF} OFF Build with Python bindings
GHEX_PYTHON_LIB_PATH= <path> ${CMAKE_INSTALL_PREFIX}/<python-site-packages-path> Installation directory for GHEX's Python package
GHEX_WITH_TESTING= {ON, OFF} OFF Build unit tests
GHEX_USE_XPMEM= {ON, OFF} OFF Use Xpmem
GHEX_TRANSPORT_BACKEND= {MPI, UCX, LIBFABRIC} MPI Choose transport backend

Pip Install

python -m venv ghex_venv
. ghex_venv/bin/activate
python -m pip install 'git+https://github.com/ghex-org/GHEX.git#subdirectory=bindings/python'
Pertinent environment variables
Variable Allowed Values Default Description
GHEX_USE_GPU= {ON, OFF} OFF Enable GPU
GHEX_GPU_TYPE= {AUTO, NVIDIA, AMD} AUTO Choose GPU type
GHEX_GPU_ARCH= list of archs "60;70;75;80"/ "gfx900;gfx906" GPU architecture
GHEX_TRANSPORT_BACKEND= {MPI, UCX, LIBFABRIC} MPI Choose transport backend

Acknowledgements

The development of GHEX was supported partly by The Partnership for Advanced Computing in Europe (PRACE). PRACE is an international non-profit association with its seat in Brussels. The PRACE Research Infrastructure provides a persistent world-class high performance computing service for scientists and researchers from academia and industry in Europe. The computer systems and their operations accessible through PRACE are provided by 5 PRACE members (BSC representing Spain, CINECA representing Italy, ETH Zurich/CSCS representing Switzerland, GCS representing Germany and GENCI representing France). The Implementation Phase of PRACE receives funding from the EU’s Horizon 2020 Research and Innovation Programme (2014-2020) under grant agreement 823767. For more information, see www.prace-ri.eu.

ghex's People

Contributors

angainor avatar bettiolm avatar biddisco avatar boeschf avatar edopao avatar fthaler avatar halungge avatar havogt avatar kotsaloscv avatar mbianco avatar stubbiali avatar tehrengruber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ghex's Issues

CUDA + OpenMP

Currently, our build assumption is that the entire user application is only one file, which includes all needed headers. That means that the code is either compiled using nvcc, or mpicxx. That is a problem when one wants to use OpenMP with CUDA-enabled code: nvcc does not support it.

Maybe there is a way to fix this with nvcc. If not, we need to somehow add more flexibility to how the applications are built, e.g., by allowing them to be composed of multiple object files. Even now, each file is first compiled to an object file, which is then linked to an executable. It would be enough to allow executables to be composed of multiple object files, and use nvcc only when needed.

GCC 9 Warning

With GCC 9.1.0 there is a warning that I was not able to remove:

/Users/mbianco/Work/GHEX/benchmarks/gcl_test_halo_exchange_3D_generic_full.cpp:202:35: error: implicitly-declared 'constexpr triple_t<false, int>& triple_t<false, int>::operator=(const triple_t<false, int>&)' is deprecated [-Werror=deprecated-copy]
  202 |                     a(ii, jj, kk) = triple_t<USE_DOUBLE, T1>();
      |                     ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated due to -Wfatal-errors.

Now CMakeLists.txt explicitly says

 -Wno-deprecated-copy

as an option to the compiler. It looks like it is a problem with the compiler and it was patched. I'll check with newer versions and see if the problem disappear.

Do we need shared_message?

Maybe we do not need a shared_message class, since we may just pass a std::shared_ptr to a message of any type. This issue is related to #35

Fix communication object naming

We should rename communication_object_2 into communication_object, and move the current communication_object into the benchmark folder, so we can test if there are overheads from having the more general co.

Setup CI for GHEX

Requirements/goals (they do not have to be fulfilled at first):

  • Running from GitHub
  • Running on multiple platforms (CSCS and Oslo)
  • Running many combinations of parameters (stochastic testing?)
  • Performance tests

Resume send_multi

send_multi was dropped since we did not have direct use cases. We will need to re-introduce it to handle upcoming usecases

Cancellation is not thread safe

Making it thread safe would require also the other operations to be (they are thread compatible so far). We need to discuss/design/implement some ways the cancellation could be triggered more generically (maybe exposing thread primitives to the user?)

worker and endpoint cleanup

Currently we destroy the endpoint with

ucs_status_ptr_t ret = ucp_ep_close_nb(m_ep, UCP_EP_CLOSE_MODE_FORCE);

We have to use UCP_EP_CLOSE_MODE_FORCE instead of UCP_EP_CLOSE_MODE_FLUSH because of a cleanup glitch. Summary: if one rank destroys the worker and the endpoint, and another rank then tries to destroy the endpoint with a FLUSH, it can to communicate with an already closed remote worker. This causes a segfault. UCP_EP_CLOSE_MODE_FORCE fixes the segfault, but the solution suggested by the developers was to use a barrier after endpoint destructor, only then close the workers.

Import of GHEX cmake config in user projects

Currently, when GHEX is configured using CMake and installed, the configured options are hard-coded into the installed GHEX Cmake config file. Since this file is then included by the user in his cmake config, he can only use GHEX in the form it was configured (e.g., with either UCX, or MPI backend).

Since at this point GHEX is header-only, this is not necessary. We could allow the user to change the GHEX configuration in his cmake configuration by, .e.g, exporting the available configuration options in the installed GHEX cmake config, instead of hard-coding them. The user could then choose whatever he wants without having to re-configure and re-install GHEX separately.

Test raw_shared_message

Rigth now the raw_thread_message class i snot tested. We should benchmark the implementation and see if the benefit for real cases is enough to justify the class

Create interface to/integrate HWLOC

This seems the way to go to be able to bind threads to GPUs in a meaningful way. Not sure what is needed by GHEX itself and what is user responsibility.

[[nodiscard]] not understood by the intel compiler

I get a lot of warnings like this when compiling with ICC:

/cluster/projects/nn9999k/marcink/ghex/angainor/GHEX/include/ghex/communication_object_2.hpp(309): warning #1292: unknown attribute "nodiscard"
              [[nodiscard]] std::enable_if_t<std::is_same<Arch,cpu>::value, handle_type>
                ^

blocked CSR for unstructured grids

Often when dealing with problems with many degrees of freedom per node, a blocked CSR is used to represent the compute matrix. In that storage format the grid connectivity itself is represented as with 1 degree of freedom per node (normal graph connectivity matrix), but each vertex is logically responsible for multiple variables, and the sparse matrix contains small, dense blocks of data. Hence, during e.g., sparse matrix vector product it references small contiguous blocks of the input vector.

Exchanging halos for such blocked storage would require a trivial change to GHEX: instead of sending / packing / unpacking a single value from the input vector, GHEX would need to pack / unpack ndof contiguous values. The pointer to the vector position would also need to be trivially scaled by ndof. So effectively GHEX would send/recv [halo_idx*ndof+0, halo_idx*ndof+1, ... halo_idx*ndof+ndof-1] vector subranges.

hardware tag matching fails with UCX backend

Running ghexbench with UCX_DC_MLX5_TM_ENABLE=y causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?

[1615309041.680074] [b2237:256544:0] rc_mlx5_common.c:827  UCX  ERROR ibv_exp_create_srq(device=mlx5_0) failed: Cannot allocate memory

==== backtrace (tid: 110170) ====
 0 0x0000000000052e95 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucs/debug/debug.c:656
 1 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:832
 2 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:844
 3 0x00000000000246bd ucp_worker_get_address()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/core/ucp_worker.c:2241
 4 0x00000000004327a8 gridtools::ghex::tl::ucx::worker_t::worker_t()  ???:0
 5 0x000000000042b646 cartex::runtime::impl::init()  ???:0
 6 0x000000000041da99 cartex::runtime::exchange()  ???:0
 7 0x000000000040afe5 main()  ???:0
 8 0x0000000000022545 __libc_start_main()  ???:0
 9 0x000000000040ca8d _start()  ???:0
=================================

Avoiding packing with unstructured grids

It is possible to avoid packing when sending halos with unstructured grids. In order to do this, the ordering of the mesh vertices has to be such that first indices address the internal nodes, then come all the halo nodes for first neighbor, then - for second neighbor, and so on. To illustrate, if you have a local domain of 10 internal nodes and 12 halo nodes, with 4 neighbors, then the vector should be stored in memory such

0 0 0 0 0 0 0 0 0 0 | 1 1 1 | 2 2 2 | 3 3 3 | 4 4 4

where 0 means local domain interior, and 1-4 denote neighbor ids. This way when you submit the send you only have to take the pointer to the first entry required by a given neighbor and submit that to the transport backend. No packing is needed, since the sent values are already contiguous in memory. Of course, the receiver has to perform normal unpacking, as he then extracts the contiguous buffer into non-contiguous locations in the local domain part.

To achieve this, wither the user has to provide a domain description, in which the vertex IDs have already been reordered to fulfill the above, or we could perform such renumbering internally in GHEX (at users' request) and return the new vertex number permutation to the user. GHEX's init function would then behave as a 'local' (per-subdomain) node reordering routine.

context.rank() vs comm.rank()

We can currently obtain (MPI) rank and size from both the context, and the communicator. Will the communicator ever have a different value of those variables? Should we only expose this through the context? When I see both, I don't know which one to use, and whether there is a difference.

[unstructured] Minor refactorings

  • Remove adjacency member from unstructured user concepts (domain descriptor);
  • provide concept for receive_domain_id_generator functor. So far it is implemented directly in the test / benchmark application code. By definition the implementation is application-dependent, but still this could help users in writing their own. Also just adding it to the documentation might be an option;
  • re-use some structs from communication_object_2 (field_info and buffer) instead of defining new ones; they can be reused at least for the send part;
  • offset of serialized buffer should not be needed anymore. This has to be verified first, and, if true, offset should be removed both from field_info and from field_info_ipr, in which actually it does not even refer to a serialized buffer.

UCX backend without MT support

Currently the UCX backend always supports a multi-threaded execution model, i.e., the context is initialized with MT, and we always create the mutex. Might be a good idea to make this optional (just like the backends are built - with and without MT support).

Just a thought. Should be discussed.

Repo cleanup

The following branches are merged and not actively developed:

test
protocol
communication_object
comm2
dev

I think we can delete them from GridTools/GHEX repo to clean things up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.