celerity / celerity-runtime Goto Github PK

View Code? Open in Web Editor NEW

129.0 10.0 18.0 6.24 MB

High-level C++ for Accelerator Clusters

Home Page: https://celerity.github.io

License: MIT License

Ruby 0.70% CMake 1.74% C++ 95.34% JavaScript 0.98% CSS 0.09% Shell 0.37% Python 0.78% GDB 0.01%

sycl cluster-computing hpc multi-gpu

celerity-runtime's People

Contributors

Stargazers

Watchers

Forkers

hapintel peterth n-io zeta1999 moneytech bijoumd78 mbrukman schaffenrath xyuan progtx maedoc mfkiwl fknorr cctvbtx ridhitbhura yuand2022

celerity-runtime's Issues

Decide on which clang-tidy checks we want to enable

I've parsed all clang-tidy diagnostics generated with the following checks enabled:

-*,
bugprone-*,
clang-analyzer-*,
clang-diagnostic-*,
cppcoreguidelines-*,
mpi-*,
performance-*,
readability-*

and those are the results:

Found 2137 diagnostics (out of 13487 parsed; 11350 were duplicates). 
  787: cppcoreguidelines-avoid-magic-numbers,readability-magic-numbers 
  678: readability-identifier-length
  104: readability-function-cognitive-complexity
   92: readability-qualified-auto
   41: cppcoreguidelines-avoid-c-arrays
   37: cppcoreguidelines-pro-bounds-pointer-arithmetic
   36: cppcoreguidelines-init-variables
   30: cppcoreguidelines-special-member-functions
   23: bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions
   22: cppcoreguidelines-macro-usage
   22: performance-unnecessary-value-param
   19: cppcoreguidelines-avoid-non-const-global-variables
   17: readability-implicit-bool-conversion
   16: readability-isolate-declaration
   16: readability-named-parameter
   16: bugprone-macro-parentheses
   15: bugprone-easily-swappable-parameters
   14: cppcoreguidelines-pro-bounds-constant-array-index
   12: bugprone-exception-escape
   12: cppcoreguidelines-pro-bounds-array-to-pointer-decay
   11: cppcoreguidelines-pro-type-member-init
   10: clang-analyzer-deadcode.DeadStores
   10: bugprone-implicit-widening-of-multiplication-result
   10: cppcoreguidelines-pro-type-static-cast-downcast
    7: cppcoreguidelines-pro-type-vararg
    7: performance-move-const-arg
    7: cppcoreguidelines-non-private-member-variables-in-classes
    6: readability-braces-around-statements
    6: readability-else-after-return
    6: cppcoreguidelines-pro-type-reinterpret-cast
    4: readability-redundant-access-specifiers
    4: cppcoreguidelines-owning-memory
    4: readability-avoid-const-params-in-decls
    3: clang-analyzer-core.UndefinedBinaryOperatorResult
    3: performance-faster-string-find
    3: cppcoreguidelines-pro-type-const-cast
    2: readability-simplify-boolean-expr
    2: performance-unnecessary-copy-initialization
    2: performance-for-range-copy
    2: readability-container-size-empty
    2: readability-inconsistent-declaration-parameter-name
    2: readability-static-accessed-through-instance
    2: clang-analyzer-optin.mpi.MPI-Checker
    2: performance-noexcept-move-constructor
    2: bugprone-lambda-function-name
    2: readability-convert-member-functions-to-static
    2: clang-diagnostic-error
    1: performance-inefficient-vector-operation
    1: readability-suspicious-call-argument
    1: clang-analyzer-core.CallAndMessage
    1: readability-make-member-function-const
    1: cppcoreguidelines-prefer-member-initializer

You can find the raw clang-tidy output here (useful for checking what some of these diagnostics even do).

Based on this, I would suggest the following configuration:

-*,
bugprone-*,
clang-analyzer-*,
clang-diagnostic-*,
cppcoreguidelines-*,
-cppcoreguidelines-avoid-c-arrays,
-cppcoreguidelines-macro-usage,
-cppcoreguidelines-pro-bounds-pointer-arithmetic,
mpi-*,
performance-*,
readability-*,
-readability-avoid-const-params-in-decls
-readability-identifier-length,
-readability-magic-numbers,
-readability-uppercase-literal-suffix,

Thoughts?

Profile and optimize frontend (from CGF submission to serialized graph)

#107 has frontend benchmarks that do not instantiate the entire runtime.

Low hanging fruit:

Currently, 13% of the time is spent in iostreams for creating the CDAG debug labels. Maybe this should happen in graph printing, not generation (there is already a PoC implementation for that)
malloc/free pairs have significant impact deep in dependency checking. We can potentially eliminate a lot of convenience allocations (get_accessed_buffers et al)

Requires further investigation:

Use a bump allocator for intrusive graphs to improve locality, e.g. a heuristically sized pool per horizon + fallback as needed (maybe also for the dependencies vectors?)
command / task map rehashing is also a noticeable factor, but memory is abundant - reserve() them to avoid stalls

Replace ::cl::sycl with SYCL 2020 ::sycl namespace

SYCL 2020 introduces a new canonical namespace ::sycl. While the old ::cl::sycl namespace remains available for backwards compatibility, there is no reason for us to keep using it. The only question is whether we want to do a single bulk update of all namespaces (somewhat polluting the git history), or transition gradually by using the new namespace in all new (and modified) code going forward.

Remove support for CELERITY_FORCE_WG

CELERITY_FORCE_WG is a terrible hack that was intended to be used as a way of controlling the size of work groups for the non-ND-range parallel_for overload (which was the only overload we could support prior to SYCL 2020, unfortunately). It never saw much use, required horrible implementation-specific hackery to construct a sycl::item from a sycl::nd_item, and given that we now can support proper ND-range kernels, there is no reason for keeping it around.

Implement chained subscript operator on Celerity accessors

Hello,

I have the following piece of code:

QueueManager::getInstance().with_master_access([&](celerity::handler& cgh) {
      auto result = output_buf.template get_access<cl::sycl::access::mode::read>(cgh, cl::sycl::range<2>(args.problem_size, args.problem_size));
      
      for (size_t i = 0; i < args.problem_size; i++) {
        for (size_t j = 0; j < args.problem_size; j++) {
          output[i*args.problem_size + j] = result[i][j];
        }
      }

It yields the following error (using a hipSYCL CPU-backend):

/p/project/celerity/code/celerity-bench/single-kernel/sobel.cc:108:51: error: no match for ‘operator[]’ (operand types are ‘celerity::accessor<hipsycl::sycl::vec<float, 4>, 2, (hipsycl::sycl::access::mode)1024, (hipsycl::sycl::access::target)2018>’ and ‘size_t’ {aka ‘long unsigned int’})
           output[i*args.problem_size + j] = result[i][j];
                                                   ^

Using a hipSYCL GPU-backend the error message appears similar:

/data/code/celerity-bench/single-kernel/sobel.cc:108:51: error: no viable overloaded operator[] for type 'celerity::accessor<hipsycl::sycl::vec<float, 4>, 2, hipsycl::sycl::access::mode::read, hipsycl::sycl::access::target::host_buffer>'
          output[i*args.problem_size + j] = result[i][j];
                                            ~~~~~~^~

The code has not changed since the last time I used it successfully on the master branch. Please could you have a look and advise?

Many thanks in advance!

Add support for SYCL 2020 no_init accessor property

SYCL 2020 changes the way how accessors that should not retain the previous buffer contents are created. While the discard_* access modes that previously fulfilled this purpose remain available, they are being deprecated. The new preferred mechanism is to use the sycl::property::no_init property when creating an accessor, like so (using the new CTAD interface, see #24):

sycl::accessor acc { my_buffer, cgh, sycl::write, sycl::no_init };

While there should be no inherent problems in making this API available to Celerity as well, we currently use the discard_* access modes internally to decide whether certain data transfers are required or not. While we should migrate away from those at some point, a simpler course of action that will allows us to align with the new API without requiring extensive changes internally is to remap the internally stored type of accesses based on the existence of the no_init property. So for example sycl::write + sycl::no_init would create an accessor with sycl::access_mode::write, but the access would be recorded internally as discard_write.

I'm creating this issue mostly for documentation purposes, as @facuMH is already looking into it.

catch2 sigStackSize errors

Building against hipSYCL, CUDA 11.6, clang-12, Ubuntu 21.10, the build errors on

[ 63%] Built target convolution
[ 65%] Building CXX object test/CMakeFiles/unit_test_suite.dir/unit_test_suite.cc.o
clang: warning: Unknown CUDA version. cuda.h: CUDA_VERSION=11060. Assuming the latest supported version 10.1 [-Wunknown-cuda-version]
In file included from /home/duke/src/celerity-runtime/test/unit_test_suite.cc:4:
/home/duke/src/celerity-runtime/vendor/Catch2/single_include/catch2/catch.hpp:10675:34: error: constexpr variable 'sigStackSize' must be initialized by a constant expression
    static constexpr std::size_t sigStackSize = 32768 >= MINSIGSTKSZ ? 32768 : MINSIGSTKSZ;
                                 ^              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/duke/src/celerity-runtime/vendor/Catch2/single_include/catch2/catch.hpp:10675:58: note: non-constexpr function 'sysconf' cannot be used in a constant expression
    static constexpr std::size_t sigStackSize = 32768 >= MINSIGSTKSZ ? 32768 : MINSIGSTKSZ;
                                                         ^
/usr/include/x86_64-linux-gnu/bits/sigstksz.h:32:22: note: expanded from macro 'MINSIGSTKSZ'
# define MINSIGSTKSZ SIGSTKSZ
                     ^
/usr/include/x86_64-linux-gnu/bits/sigstksz.h:28:19: note: expanded from macro 'SIGSTKSZ'
# define SIGSTKSZ sysconf (_SC_SIGSTKSZ)
                  ^
/usr/include/unistd.h:640:17: note: declared here
extern long int sysconf (int __name) __THROW;
                ^
In file included from /home/duke/src/celerity-runtime/test/unit_test_suite.cc:4:
/home/duke/src/celerity-runtime/vendor/Catch2/single_include/catch2/catch.hpp:10675:34: error: dynamic initialization is not supported for __device__, __constant__, __shared__, and __managed__ variables.
    static constexpr std::size_t sigStackSize = 32768 >= MINSIGSTKSZ ? 32768 : MINSIGSTKSZ;
                                 ^              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2 errors generated when compiling for sm_75.

Could it be an issue of the unknown CUDA version defaulting to 10.1?

(unrelated: I'm looking at your CI script to figure out how get a working setup, but I haven't figured them out yet. Where/how are the containers built?)

Lack of cpu_set_t on macOS

macOS appears to lack cpu_set_t, though it seems the use is rather limited to counting cores, so some other API could stand in on macOS, e.g. here.

buffer_manager causes a segment fault

Describe the bug 💥
Some weird error.
Building celerity_runtime with the examples and tests works just fine. All runtime tests passes and the examples runs also fine.
Now the weird think, add_celerity_to_target cause a segment fault when it try's to create a celerity::buffer. The segment fault occur in buffer_manager.h on line 130.

To Reproduce 🔁
Steps to reproduce the behavior:

Open CMakeList.txt of any example
Add find_package(Celerity 0.2.1 REQUIRED CONFIG HINTS "/opt/celerity/lib")
Remove target_link_libraries and add_sycl_to_target
Add (for matmul as example)

add_celerity_to_target(
  TARGET matmul
  SOURCES matmul.cc
)

Build and run

Output

[2020-10-26 21:06:53.208] [default] [info] [rank = 0] sycl = hipSYCL 0.8.2, build = release, pid = 12708, event = initialized
[2020-10-26 21:06:53.220] [default] [info] [rank = 0] Using platform 'hipSYCL [SYCL over CUDA/HIP] on NVIDIA CUDA', device 'GeForce GTX 1050' (automatically selected platform 0, device 0)
[2020-10-26 21:06:53.280] [bench] [info] [rank = 0] event = userConfig, matSize = 1024
[XPS-15-9560:12708] *** Process received signal ***
[XPS-15-9560:12708] Signal: Segmentation fault (11)
[XPS-15-9560:12708] Signal code: Address not mapped (1)
[XPS-15-9560:12708] Failing at address: 0x71
[XPS-15-9560:12708] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f92c81723c0]
[XPS-15-9560:12708] [ 1] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x434973]
[XPS-15-9560:12708] [ 2] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x433ebd]
[XPS-15-9560:12708] [ 3] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x433b8c]
[XPS-15-9560:12708] [ 4] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x433a35]
[XPS-15-9560:12708] [ 5] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x42d4f5]
[XPS-15-9560:12708] [ 6] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x42cff9]
[XPS-15-9560:12708] [ 7] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x429641]
[XPS-15-9560:12708] [ 8] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x4291d7]
[XPS-15-9560:12708] [ 9] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x41fd94]
[XPS-15-9560:12708] [10] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x41a51c]
[XPS-15-9560:12708] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f92c7c470b3]
[XPS-15-9560:12708] [12] /home/salih/projects/c++/celerity-runtime/cmake-build-debug/examples/matmul/matmul[0x41a28e]
[XPS-15-9560:12708] *** End of error message ***

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Environment 🌎

OS: Ubuntu 20.04
Sycl: hipSYCL 0.8.2
Celerity: 0.2.1
Clang 10 and CUDA 10.1

Add support for SYCL 2020 CTAD accessors

SYCL 2020 raised the required language level to C++17, and with that, several improvements to the existing interfaces could be made. One of the major usability improvements is to make it easier to create buffer accessors. Going from this:

auto acc = my_buffer.get_access<sycl::access::mode::read_write>(cgh);

To this:

sycl::accessor acc { my_buffer, cgh, sycl::read_write };

The most notable part of this change is the use of deduction tags such as sycl::read_write to infer the access mode using C++17 CTAD.

We should align our interfaces to match this, where possible. I'm creating this issue mostly for documentation purposes, as @facuMH is already looking into it.

Address problems in current CI setup

There currently are several problems with our CI setup that we will have to address at some point:

Race-conditions within concurrent workflow runs. #29 addresses an issue where the clean option of the checkout step would sometimes remove build directories of other workflow runs while they were still running. Although that is an improvement, I suspect that there may be another problem, namely when the checkout step changes revisions before another workflow can start building (it seems that the runner only ever does one workflow step at a time, but consecutive steps can come from different workflow runs - so it is conceivable that it does checkout 1, checkout 2, build 1, build2, which would mean that both runs build from the same revision). I think the cleanest solution to this would be to check out the entire repository in a unique directory each time. Update: Addressed in #32.
CI currently does not run on external PRs (from forks). The basic concern here is that we don't want to run arbitrary code on our own self-hosted runner. I thought that I had solved this in a9a3afe by means of the pull_request_target, but it turns out that it doesn't exactly do what I thought it does (see also #21). However, since GitHub, in an attempt to mitigate the problem of PR cryptomining, recently added functionality that requires manual approval for action runs for first-time contributors, I think we could now enable the pull_request trigger for external PRs as well. Ideally they'd add an option to always require approval for external workflow runs, though.
Support running on multiple SYCL versions, ideally on a pinned version we know we want to support, as well as a nightly build so we can catch regressions early.
A minor one: We currently upload the build logs after the build step, only to download them again for the report step. As long as the report step runs before cleanup, why don't we simply read the log files from the build directories directly..? Update: Not a good idea (as explained in #32).
A quality-of-life improvement we could also look into is to skip running a workflow for both push and pull_request triggers, if they turn out to be using the same revision (addressed in #74).

[Help Wanted] Error: indirection requires pointer operand HIPSYCL_ID_BINARY_OP_IN_PLACE(-=)

Hello, when i try to build celerity with hipSYCL i get an error (see output).

Steps to reproduce

git clone https://github.com/celerity/celerity-runtime
cd celerity-runtime
mkdir build && cd build
cmake -G Ninja .. -DCMAKE_PREFIX_PATH="/opt/hipSYCL/lib" -DHIPSYCL_PLATFORM=cuda -DHIPSYCL_GPU_ARCH=sm_61 -DCMAKE_INSTALL_PREFIX="/opt/celerity/" -DCMAKE_BUILD_TYPE=Release
ninja or ninja install

System

Celerity Version: Master
OS: Ubuntu 18.04 (WSL2)
SYCL: hipSYCL 0.8.1-20200825
CUDA Version 10.0.130
Clang Version 9.0.0-2~ubuntu18.04.2

Output

[18/41] Building SYCL object examples/wave_sim/CMakeFiles/wave_sim.dir/wave_sim.cc.o
FAILED: examples/wave_sim/CMakeFiles/wave_sim.dir/wave_sim.cc.o
/opt/hipSYCL/bin/syclcc-clang -DBOOST_PP_VARIADICS=1 -DSPDLOG_COMPILED_LIB -I../include -I../vendor -I/usr/local/include -I../vendor/spdlog/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/opal/mca/event/libevent2022/libevent/include -I/usr/lib/x86_64-linux-gnu/openmpi/include --hipsycl-platform=cuda --hipsycl-gpu-arch=sm_61 -fdiagnostics-color=always -O3 -DNDEBUG -Wall -Wextra -Wno-unused-parameter -pthread -MD -MT examples/wave_sim/CMakeFiles/wave_sim.dir/wave_sim.cc.o -MF examples/wave_sim/CMakeFiles/wave_sim.dir/wave_sim.cc.o.d -o examples/wave_sim/CMakeFiles/wave_sim.dir/wave_sim.cc.o -c ../examples/wave_sim/wave_sim.cc
In file included from ../examples/wave_sim/wave_sim.cc:7:
In file included from /opt/hipSYCL/bin/../include/CL/sycl.hpp:31:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/sycl.hpp:40:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/device_selector.hpp:33:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/detail/../device.hpp:36:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/info/info.hpp:37:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/info/kernel.hpp:33:
In file included from /opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/detail/../range.hpp:35:
/opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/detail/../id.hpp:187:3: error: indirection requires pointer operand ('id<3>' invalid)
  HIPSYCL_ID_BINARY_OP_IN_PLACE(-=)
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/hipSYCL/bin/../include/CL/../hipSYCL/sycl/detail/../id.hpp:183:12: note: expanded from macro 'HIPSYCL_ID_BINARY_OP_IN_PLACE'
    return *lhs; \
           ^~~~
../include/range_mapper.h:184:18: note: in instantiation of member function 'hipsycl::sycl::operator-=' requested here
                        result.offset -= delta;
                                      ^
../include/range_mapper.h:25:11: note: in instantiation of member function 'celerity::access::neighborhood<2>::operator()' requested here
                        return std::forward<Functor>(fn)(chunk);
                               ^
../include/range_mapper.h:32:11: note: in instantiation of function template specialization 'celerity::detail::invoke_range_mapper<2, 2, const celerity::access::neighborhood<2> &>' requested here
                        return invoke_range_mapper<KernelDims, BufferDims>(fn, chunk, buffer_size);
                               ^
/usr/bin/../lib/gcc/x86_64-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/unique_ptr.h:821:34: note: in instantiation of function template specialization 'celerity::detail::range_mapper<2, 2>::range_mapper<celerity::access::neighborhood<2> >' requested here
    { return unique_ptr<_Tp>(new _Tp(std::forward<_Args>(__args)...)); }
                                 ^
../include/buffer.h:71:41: note: in instantiation of function template specialization 'std::make_unique<celerity::detail::range_mapper<2, 2>, celerity::access::neighborhood<2> &, hipsycl::sycl::access::mode, hipsycl::sycl::range<2> >' requested here
                        prepass_cgh.add_requirement(id, std::make_unique<detail::range_mapper<rmfn_traits::arg1_type::dims, Dims>>(rmfn, Mode, get_range()));
                                                             ^
../include/buffer.h:58:10: note: in instantiation of function template specialization 'celerity::buffer<float, 2>::get_access<hipsycl::sycl::access::mode::read, hipsycl::sycl::access::target::global_buffer, celerity::access::neighborhood<2> >' requested here
                return get_access<Mode, cl::sycl::access::target::global_buffer>(cgh, rmfn);
                       ^
../examples/wave_sim/wave_sim.cc:44:25: note: in instantiation of function template specialization 'celerity::buffer<float, 2>::get_access<hipsycl::sycl::access::mode::read, celerity::access::neighborhood<2> >' requested here
                auto r_u = u.template get_access<cl::sycl::access::mode::read>(cgh, celerity::access::neighborhood<2>(1, 1));
                                      ^
../examples/wave_sim/wave_sim.cc:61:2: note: in instantiation of function template specialization 'step<float, init_config, initialize>' requested here
        step<float, init_config, class initialize>(queue, up, u, dt, delta);
        ^
1 error generated when compiling for sm_61.

Document new experimental features (side effects, drain)

can't build using OpenMPI 1.5.5 and Boost 1.65

I used the following command to configure the CMake

cmake -G Ninja . -DCMAKE_PREFIX_PATH="/path/hip/build/lib/" -DHIPSYCL_PLATFORM=cuda -DHIPSYCL_GPU_ARCH=sm_35 -DCMAKE_INSTALL_PREFIX="/path/celerity-build" -DBOOST_ROOT=/path/boost-build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS="-std=c++14"  -DCMAKE_SYCL_FLAGS="--gcc-toolchain=/path/gcc/7.3.0"

After everything was successfully configured I tried to build it using:
ninja install
but I got the following error:

FAILED: CMakeFiles/celerity_runtime.dir/src/device_queue.cc.o
/path/hip/build/bin/syclcc-clang  -Iinclude -Ivendor -I/path/boost-build/include -I/path/openmpi/gcc/openmpi-1.5.5/@sys/include -Ispdlog-src/include --gcc-toolchain=/path/gcc/7.3.0 -fdiagnostics-color=always -O3 -DNDEBUG   -Wall -Wextra -Wno-unused-parameter -MD -MT CMakeFiles/celerity_runtime.dir/src/device_queue.cc.o -MF CMakeFiles/celerity_runtime.dir/src/device_queue.cc.o.d -o CMakeFiles/celerity_runtime.dir/src/device_queue.cc.o -c src/device_queue.cc
syclcc fatal error: Required command line argument --hipsycl-gpu-arch or environment variable HIPSYCL_GPU_ARCH not specified

I managed to fix this issue by setting env var:
export HIPSYCL_GPU_ARCH=sm_35
but now I have a different error after running ninja install:

FAILED: CMakeFiles/celerity_runtime.dir/src/config.cc.o
/path/hip/build/bin/syclcc-clang  -Iinclude -Ivendor -I/path/boost-build/include -I/afs/math.tu-berlin.de/software/openmpi/gcc/openmpi-1.5.5/@sys/include -Ispdlog-src/include --gcc-toolchain=/afs/math/software/gcc/7.3.0 -fdiagnostics-color=always -O3 -DNDEBUG   -Wall -Wextra -Wno-unused-parameter -MD -MT CMakeFiles/celerity_runtime.dir/src/config.cc.o -MF CMakeFiles/celerity_runtime.dir/src/config.cc.o.d -o CMakeFiles/celerity_runtime.dir/src/config.cc.o -c src/config.cc
In file included from src/config.cc:11:
In file included from include/workaround.h:3:
In file included from /net/path/hip/build/bin/../include/CL/sycl.hpp:35:
In file included from /net/path/hip/build/bin/../include/CL/sycl/backend/backend.hpp:51:
In file included from /net/path/hip/build/bin/../include/hipSYCL/hip/hip_runtime.h:58:
In file included from /net/path/hip/build/bin/../include/hipSYCL/hip/nvcc_detail/hip_runtime.h:28:
In file included from /net/path/hip/build/bin/../include/hipSYCL/hip/hip_runtime_api.h:325:
/net/path/hip/build/bin/../include/hipSYCL/hip/nvcc_detail/hip_runtime_api.h:886:9: warning: variable 'cdattr' is used uninitialized whenever switch default is taken [-Wsometimes-uninitialized]
        default:
        ^~~~~~~
/net/path/hip/build/bin/../include/hipSYCL/hip/nvcc_detail/hip_runtime_api.h:891:41: note: uninitialized use occurs here
    cerror = cudaDeviceGetAttribute(pi, cdattr, device);
                                        ^~~~~~
/net/path/hip/build/bin/../include/hipSYCL/hip/nvcc_detail/hip_runtime_api.h:786:5: note: variable 'cdattr' is declared here
    enum cudaDeviceAttr cdattr;
    ^
src/config.cc:102:41: error: use of undeclared identifier 'OMPI_COMM_TYPE_HOST'
                                MPI_Comm_split_type(MPI_COMM_WORLD, SPLIT_TYPE, 0, MPI_INFO_NULL, &node_comm);
                                                                    ^
src/config.cc:93:20: note: expanded from macro 'SPLIT_TYPE'
#define SPLIT_TYPE OMPI_COMM_TYPE_HOST
                   ^
src/config.cc:137:5: error: function-like macro 'BOOST_PP_OVERLOAD' is not defined
#if WORKAROUND(HIPSYCL, 0)
    ^
include/workaround.h:38:58: note: expanded from macro 'WORKAROUND'
#define WORKAROUND(impl, ...) (WORKAROUND_##impl == 1 && _WA_CHECK_VERSION(__VA_ARGS__))
                                                         ^
include/workaround.h:33:32: note: expanded from macro '_WA_CHECK_VERSION'
#define _WA_CHECK_VERSION(...) BOOST_PP_OVERLOAD(_WA_CHECK_VERSION_, __VA_ARGS__)(__VA_ARGS__)
                               ^
1 warning and 2 errors generated when compiling for sm_35.

has anyone run into the same problem?