wdmapp / gtensor Goto Github PK

GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.

License: BSD 3-Clause "New" or "Revised" License

CMake 3.86% C++ 94.50% C 0.32% Shell 0.21% Fortran 1.10%

gpu cuda rocm sycl hacktoberfest cpp cpp14

gtensor's Introduction

gtensor

gtensor is a multi-dimensional array C++14 header-only library for hybrid GPU development. It was inspired by xtensor, and designed to support the GPU port of the GENE fusion code.

Features:

multi-dimensional arrays and array views, with easy interoperability with Fortran and thrust
automatically generate GPU kernels based on array operations
define complex re-usable operations with lazy evaluation. This allows operations to be composed in different ways and evaluated once as a single kernel
easily support both CPU-only and GPU-CPU hybrid code in the same code base, with only minimal use of #ifdef.
multi-dimensional array slicing similar to numpy
GPU support for nVidia via CUDA and AMD via HIP/ROCm, and experimental Intel GPU support via SYCL.
[Experimental] C library cgtensor with wrappers around common GPU operations (allocate and deallocate, device management, memory copy and set)
[Experimental] lightweight wrappers around GPU BLAS, LAPACK, and FFT routines.

License

gtensor is licensed under the 3-clause BSD license. See the LICENSE file for details.

Installation (cmake)

gtensor uses cmake 3.13+ to build the tests and install:

git clone https://github.com/wdmapp/gtensor.git
cd gtensor
cmake -S . -B build -DGTENSOR_DEVICE=cuda \
  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \
  -DBUILD_TESTING=OFF
cmake --build build --target install

To build for cpu/host only, use -DGTENSOR_DEVICE=host, for AMD/HIP use -DGTENSOR_DEVICE=hip -DCMAKE_CXX_COMPILER=$(which hipcc), and for Intel/SYCL use -DGTENSOR_DEVICE=sycl -DCMAKE_CXX_COMPILER=$(which dpcpp) See sections below for more device specific requirements.

Note that gtensor can still be used by applications not using cmake - see Usage (GNU make) for an example.

To use the internal data vector implementation instead of thrust, set -DGTENSOR_USE_THRUST=OFF. This has the advantage that device array allocations will not be zero initialized, which can improve performance significantly for some workloads, particularly when temporary arrays are used.

To enable experimental C/C++ library features,GTENSOR_BUILD_CLIB, GTENSOR_BUILD_BLAS, or GTENSOR_BUILD_FFT to ON. Note that BLAS includes some LAPACK routines for LU factorization.

nVidia CUDA requirements

gtensor for nVidia GPUs with CUDA requires CUDA Toolkit 10.0+.

AMD HIP requirements

gtensor for AMD GPUs with HIP requires ROCm 4.5.0+, and rocthrust and rocprim. See the ROCm installation guide for details. In Ubuntu, after setting up the ROCm repository, the required packages can be installed like this:

sudo apt install rocm-dkms rocm-dev rocthrust

The official packages install to /opt/rocm. If using a different install location, set the ROCM_PATH cmake variable. To use coarse grained managed memory, ROCm 5.0+ is required.

To use gt-fft and gt-blas, rocsolver, rocblas, and rocfft packages need to be installed as well.

Intel SYCL requirements

The current SYCL implementation requires Intel OneAPI/DPC++ 2022.0 or later, with some known issues in gt-blas and gt-fft (npvt getrf/rs, 2d fft). Using the latest available release is recommended. When using the instructions at install via package managers, installing the intel-oneapi-dpcpp-compiler package will pull in all required packages (the rest of basekit is not required).

The reason for the dependence on Intel OneAPI is that the implementation uses the USM extension, which is not part of the current SYCL standard. CodePlay ComputeCpp 2.0.0 has an experimental implementation that is sufficiently different to require extra work to support.

The default device selector is always used. To control device section, set the SYCL_DEVICE_FILTER environment variable. See the intel llvm documentation for details.

The port is tested with an Intel iGPU, specifically UHD Graphics 630. It may also work with the experimental CUDA backend for nVidia GPUs, but this is untested and it's recommended to use the gtensor CUDA backend instead.

Better support for other SYCL implementations like hipSYCL and ComputCPP should be possible to add, with the possible exception of gt-blas and gt-fft sub-libraries which require oneMKL.

HOST CPU (no device) requirements

gtensor should build with any C++ compiler supporting C++14. It has been tested with g++ 7, 8, and 9 and clang++ 8, 9, and 10.

Advanced multi-device configuration

By default, gtensor will install support for the device specified by the GTENSOR_DEVICE variable (default cuda), and also the host (cpu only) device. This can be configured with GTENSOR_BUILD_DEVICES as a semicolon (;) separated list. For example, to build support for all four backends (assuming a machine with multi-vendor GPUs and associated toolkits installed).

cmake -S . -B build -DGTENSOR_DEVICE=cuda \
  -DGTENSOR_BUILD_DEVICES=host;cuda;hip;sycl \
  -DCMAKE_INSTALL_PREFIX=/opt/gtensor \
  -DBUILD_TESTING=OFF

This will cause targets to be created for each device: gtensor::gtensor_cuda, gtensor::gtensor_host, gtensor::gtensor_hip, and gtensor::gtensor_sycl. The main gtensor::gtensor target will be an alias for the default set by GTENSOR_DEVICE (the cuda target in the above example).

Usage (cmake)

Once installed, gtensor can be used by adding this to a project's CMakeLists.txt:

# if using GTENSOR_DEVICE=cuda
enable_language(CUDA)

find_library(gtensor)

# for each C++ target using gtensor
target_gtensor_sources(myapp PRIVATE src/myapp.cxx)
target_link_libraries(myapp gtensor::gtensor)

When running cmake for a project, add the gtensor install prefix to CMAKE_PREFIX_PATH. For example:

cmake -S . -B build -DCMAKE_PREFIX_PATH=/opt/gtensor

The default gtensor device, set with the GTENSOR_DEVICE cmake variable when installing gtensor, can be overridden by setting GTENSOR_DEVICE again in the client application before the call to find_library(gtensor), typically via the -D cmake command line option. This can be useful to debug an application by setting -DGTENSOR_DEVICE=host, to see if the problem is related to the hybrid device model or is an algorithmic problem, or to run a host-only interactive debugger. Note that only devices specified with GTENSOR_BUILD_DEVICES at gtensor install time are available (the default device and host if no option was specified).

Using gtensor as a subdirectory or git submodule

gtensor also supports usage as a subdiretory of another cmake project. This is typically done via git submodules. For example:

cd /path/to/app
git submodule add https://github.com/wdmapp/gtensor.git external/gtensor

In the application's CMakeLists.txt:

# set here or on the cmake command-line with `-DGTENSOR_DEVICE=...`.
set(GTENSOR_DEVICE "cuda" CACHE STRING "")

if (${GTENSOR_DEVICE} STREQUAL "cuda")
  enable_language(CUDA)
endif()

# after setting GTENSOR_DEVICE
add_subdirectory(external/gtensor)

# for each C++ target using gtensor
target_gtensor_sources(myapp PRIVATE src/myapp.cxx)
target_link_libraries(myapp gtensor::gtensor)

Usage (GNU make)

As a header only library, gtensor can be integrated into an existing GNU make project as a subdirectory fairly easily for cuda and host devices.

The subdirectory is typically managed via git submodules, for example:

cd /path/to/app
git submodule add https://github.com/wdmapp/gtensor.git external/gtensor

See examples/Makefile for a good way of organizing a project's Makefile to provide cross-device support. The examples can be built for different devices by setting the GTENSOR_DEVICE variable, e.g. cd examples; make GTENSOR_DEVICE=host.

Getting Started

Basic Example (host CPU only)

Here is a simple example that computes a matrix with the multiplication table and prints it out row by row using array slicing:

#include <iostream>

#include <gtensor/gtensor.h>

int main(int argc, char **argv) {
    const int n = 9;
    gt::gtensor<int, 2> mult_table(gt::shape(n, n));

    for (int i=0; i<n; i++) {
        for (int j=0; j<n; j++) {
            mult_table(i,j) = (i+1)*(j+1);
        }
    }

    for (int i=0; i<n; i++) {
        std::cout << mult_table.view(i, gt::all) << std::endl;
    }
}

It can be built like this, using gcc version 5 or later:

g++ -std=c++14 -I /path/to/gtensor/include -o mult_table mult_table.cxx

and produces the following output:

{ 1 2 3 4 5 6 7 8 9 }
{ 2 4 6 8 10 12 14 16 18 }
{ 3 6 9 12 15 18 21 24 27 }
{ 4 8 12 16 20 24 28 32 36 }
{ 5 10 15 20 25 30 35 40 45 }
{ 6 12 18 24 30 36 42 48 54 }
{ 7 14 21 28 35 42 49 56 63 }
{ 8 16 24 32 40 48 56 64 72 }
{ 9 18 27 36 45 54 63 72 81 }

See the full mult_table example for different ways of performing this operation, taking advantage of more gtensor features.

GPU and CPU Example

The following program computed vector product a*x + y, where a is a scalar and x and y are vectors. If build with GTENSOR_HAVE_DEVICE defined and using the appropriate compiler (currently either nvcc or hipcc), it will run the computation on a GPU device.

See the full daxpy example for more detailed comments and an example of using an explicit kernel.

#include <iostream>

#include <gtensor/gtensor.h>

using namespace std;

// provides convenient shortcuts for common gtensor functions, for example
// underscore ('_') to represent open slice ends.
using namespace gt::placeholders;

template <typename S>
gt::gtensor<double, 1, S> daxpy(double a, const gt::gtensor<double, 1, S> &x,
                                const gt::gtensor<double, 1, S> &y) {
    return a * x + y;
}

int main(int argc, char **argv)
{
    int n = 1024 * 1024;
    int nprint = 32;

    double a = 0.5;

    // Define and allocate two 1d vectors of size n on the host.
    gt::gtensor<double, 1, gt::space::host> h_x(gt::shape(n));
    gt::gtensor<double, 1, gt::space::host> h_y = gt::empty_like(h_x);
    gt::gtensor<double, 1, gt::space::host> h_axpy;

    // initialize host vectors
    for (int i=0; i<n; i++) {
        h_x(i) = 2.0 * static_cast<double>(i);
        h_y(i) = static_cast<double>(i);
    }

#ifdef GTENSOR_HAVE_DEVICE
    cout << "gtensor have device" << endl;

    // Define and allocate device versions of h_x and h_y, and declare
    // a varaible for the result on gpu.
    gt::gtensor<double, 1, gt::space::device> d_x(gt::shape(n));
    gt::gtensor<double, 1, gt::space::device> d_y = gt::empty_like(d_x);
    gt::gtensor<double, 1, gt::space::device> d_axpy;
 
    // Explicit copies of input from host to device.
    copy(h_x, d_x);
    copy(h_y, d_y);

    // This automatically generates a computation kernel to run on the
    // device.
    d_axpy = daxpy(a, d_x, d_y);

    // Explicit copy of result to host
    h_axpy = gt::empty_like(h_x);
    copy(d_axpy, h_axpy);
#else
    // host implementation - simply call directly using host gtensors
    h_axpy = daxpy(a, h_x, h_y);
#endif // GTENSOR_HAVE_DEVICE

    // Define a slice to print a subset of elements for checking result
    auto print_slice = gt::gslice(_, _, n/nprint);
    cout << "a       = " << a << endl;
    cout << "x       = " << h_x.view(print_slice)  << endl;
    cout << "y       = " << h_y.view(print_slice)  << endl;
    cout << "a*x + y = " << h_axpy.view(print_slice) << endl;
}

Example build for nVidia GPU using nvcc:

GTENSOR_HOME=/path/to/gtensor
nvcc -x cu -std=c++14 --expt-extended-lambda --expt-relaxed-constexpr \
 -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_CUDA -DGTENSOR_USE_THRUST \
 -DNDEBUG -O3 \
 -I $GTENSOR_HOME/include \
 -o daxpy_cuda daxpy.cxx

Build for AMD GPU using hipcc:

hipcc -hc -std=c++14 \
 -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_HIP -DGTENSOR_USE_THRUST \
 -DNDEBUG -O3 \
 -I $GTENSOR_HOME/include \
 -isystem /opt/rocm/rocthrust/include \
 -isystem /opt/rocm/include \
 -isystem /opt/rocm/rocprim/include \
 -isystem /opt/rocm/hip/include \
 -o daxpy_hip daxpy.cxx

Build for Intel GPU using dpcpp:

dpcpp -fsycl -std=c++14 \
 -DGTENSOR_HAVE_DEVICE -DGTENSOR_DEVICE_SYCL \
 -DGTENSOR_DEVICE_SYCL_GPU \
 -DNDEBUG -O3 \
 -I $GTENSOR_HOME/include \
 -o daxpy_sycl daxpy.cxx

Build for host CPU:

g++ -std=c++14 \
 -DNDEBUG -O3 \
 -I $GTENSOR_HOME/include \
 -o daxpy_host daxpy.cxx

Example using gtensor with existing GPU code

If you have existing code written in CUDA or HIP, you can use the gt::adapt and gt::adapt_device functions to wrap existing allocated host and device memory in gtensor span containers. This allows you to use the convenience of gtensor for new code without having to do an extensive rewrite.

See trig.cu and trig_adapted.cxx. The same approach will work for HIP with minor modifications.

Data Types and mutability

gtensor has two types of data objects - those which are containers that own the underlying data, like gtensor, and those which behave like span objects or pointers, like gtensor_span. The gview objects, which are generally constructed via the helper method gt::view or the convenience view methods on gtensor, implement the slicing, broadcasting, and axis manipulation functions, and have hybrid behavior based on the underlying expression. In particular, a gview wrapping a gtensor_span object will have span-like behavior, and in most other cases will have owning container behavior.

Before a data object can be passed to a GPU kernel, it must be converted to a span-like object, and must be resident on the device. This generally happens automatically when using expression evaluation and gtensor_device, but must be done manually by calling the to_kernel() method when using custom kernels with gt::launch<N>. What typically happens is that the underlying gtensor objects get transformed to gtensor_span of the appropriate type. This happens even when they are wrapped inside complex gview and gfunction objects.

The objects with span like behavior also have shallow const behavior. This means that even if the outer object is const, they allow modification of the underlying data. This is consistent with std::span standardized in C++20. The idea is that if copying does not copy the underlying data (shallow copy), all other aspects of the interface should behave similarly. This is called "regularity". This also allows non-mutable lambdas to be used for launch kernels. Non-mutable lambdas are important because SYCL requires const kernel functions, so the left hand side of expressions must allow mutation of the underlying data even when const because they may be contained inside a non-mutable lambda and forced to be const.

To ensure const-correctness whenever possible, the to_kernel() routine on const gtensor<T, N, S> is special cased to return a gtensor_span<const T, N, S>. This makes it so even though a non-const reference is returned from the element accessors (shallow const behavior of span like object), modification is still not allowed since the underlying type is const.

To make this more concrete, here are some examples:

gtensor_device<int, 1> a{1, 2, 3};
const gtensor_device<int, 1> a_const_copy = a;

a(0) = 10; // fine
a_const_copy(0) = 1; // won't compile, because a_const_copy(0) is const int&

const auto k_a = a.to_kernel(); // const gtensor_span<int, 1>
k_a(0) = -1; // allowed, gtensor_span has shallow const behavior

auto k_a_const_copy = a_const_copy.to_kernel(); // gtensor_span<const int, 1>
k_a_const_copy(0) = 10; // won't compile, type of LHS is const int&

Streams (experimental)

To facilitate interoperability with existing libraries and allow experimentation with some advanced multi-stream use cases, there are classes gt::stream and gt::stream_view. The gt::stream will create a new stream in the default device backend and destroy the stream when the object is destructed. The gt::stream_view is constructed with an existing native stream object in the default backend (e.g. a cudaStream\_t for the CUDA backend). They can be used as optional arguments to gt::launch and gt::assign, in which case they will execute asynchronously with the default stream on device. Note that the equals operator form of assign does not work with alternate streams - it will always use the default stream. For the SYCL backend, the native stream object is a sycl::queue.

See also tests/test_stream.cxx. Note that this API is likely to change; in particular, the stream objects will become templated on space type.

Library Wrapper Extensions

gt-blas

Provides wrappers around commonly used blas routines. Requires cuBLAS, rocblas, or oneMKL, depending on the GPU backend. Interface is mostly C style taking raw pointers, for easy interoperability with Fortran, with a few higher level gtensor specific helpers.

#include "gt-blas/blas.h"

void blas()
{
  gt::blas::handle_t h;
  gt::gtensor_device<double, 1> x = gt::arange<double>(1, 11);
  gt::gtensor_device<double, 1> y = gt::arange<double>(1, 11);
  gt::blas::axpy(h, 2.0, x, y);
  std::cout << "a*x+y = " << y << std::endl;
  /* a*x+y = { 3 6 9 12 15 18 21 24 27 30 } */
}

A naive banded LU solver implementation is also provided, useful in cases where the matrices are banded and the native GPU batched LU solve has not been optimized yet. Parallelism for this implementaiton is on batch only.

gt-fft

Provides high level C++ style interface around cuFFT, rocFFT, and oneMKL DFT.

#include "gt-fft/fft.h"

void fft()
{
  // python: x = np.array([2., 3., -1., 4.])
  gt::gtensor_device<double, 1> x = {2, 3, -1, 4};
  auto y = gt::empty_device<gt::complex<double>>({3});

  // python: y = np.fft.fft(x)
  gt::fft::FFTPlanMany<gt::fft::Domain::REAL, double> plan({x.shape(0)}, 1);
  plan(x, y);
  std::cout << y << std::endl;
  /* { (8,0) (3,1) (-6,0) } */
}

gt-solver

Provides a high level C++ style interface around batched LU solve, in particular the case where a single set of matrices is used repeatedly to solve with different right hand side vectors. It maintains it's own contiguous copy of the factored matrices in device memory, and device buffers for staging input and output right hand side vectors. This allows the application to build the matrices on host and pass them off to the solver interface for all the device specific handling. On some platforms, there are performance issues with solving out of managed memory, and the internal device buffers can significantly improve performance on these platforms.

This should be preferred over directly calling gt-blas routines like getrf and getrs, when the use case matches (single factor and many solves with different right hand sides).

gtensor's People

Contributors

Stargazers

Watchers

Forkers

bd4 germasch td-mpcdf gmerlo 3togo oleksandr-pavlyk harry-zhou uphoffc cmpfeil

gtensor's Issues

add CI tests

Ideally would build for no device, CUDA, HIP, and SYCL. I have a machine with both nVidia and AMD GPUs, can use that as a starting point. Too bad it doesn't have an iGPU... but it could test SYCL also by using hipSYCL+AMD, or intel SYCL + Intel CPU OpenCL, or any SYCL with HOST device.

add gt::exp

For real and complex gtensor expressions. Needed for GENE.

improve kernel launch error checking

Can add an error check in gtLaunchKernel macro, for HIP and CUDA backends. SYCL will raise an async exception on launch failure, which should already be handled by the global error handler.

create gtensor-config.cmake

This will simplify the use of gtensor a lot, especially as more GPU backends are added.

add gt::transform_reduce

This should cover most reduction use cases. It can wrap thrust::transform_reduce for cuda and hip backends. For sycl backend, this could possibly be implemented on top of the reduction support in 2020 spec, or more simply using oneDPL for Intel SYCL.

In particular this could be used for efficient sum of squares implementation, needed by @td-mpcdf

gt-blas can't handle non-contiguous batch data in all backends

oneMKL SYCL backend doesn't support this at all, and the HIP (rocblas+rocsolver) backend appears to have issues with it as well, even though the interface suggests it should work.

Options here:
(1) change our interface to only support contiguous, by taking a single pointer for A/B data, and allocate / populate the pointer array when needed to adapt to the underlying API.
(2) keep current interface, and emulate support in oneMKL by copying data if it's non-contiguous

The performance hit of (2) is likely prohibitive, but it could be a stopgap if support will be added later. Need to look more closely at available (non one) MKL APIs, for a sense of what might be supported in the future.

gpu unit test failure

Not sure this is unique to me, it looks like this should be reproducible as long as bounds check is enabled, (ie., build type Debug).

[kai@macbook build (pr/functions-norm)]$ tests/test_launch
Running main() from _deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from gtensor
[ RUN      ] gtensor.launch_1d
[       OK ] gtensor.launch_1d (1 ms)
[ RUN      ] gtensor.launch_1d_templated
[       OK ] gtensor.launch_1d_templated (0 ms)
[ RUN      ] gtensor.launch_view_reverse_1d
[       OK ] gtensor.launch_view_reverse_1d (0 ms)
[ RUN      ] gtensor.device_launch_1d
[       OK ] gtensor.device_launch_1d (1822 ms)
[ RUN      ] gtensor.device_launch_view_reverse_1d
out-of-bounds error: dim = 0, arg = 2, shape = 2
../include/gtensor/strides.h:102: bounds_check: block: [0,0,0], thread: [2,0,0] Assertion `0` failed.
gpuCheck: 710 (device-side assert triggered) ../include/gtensor/gtensor.h 401
Abort trap: 6

row major layout support

Follow convention of xtensor, i.e. add another template param for gt::layout_type::row_major, gt::layout_type::column_major. For backward compatibility we can make the default column_major, but since it's a C++ lib, maybe we should consider switching the default.

See
https://xtensor.readthedocs.io/en/latest/container.html?highlight=layout#internal-memory-layout

add reductions

This is the interface used by xtensor:
https://xtensor.readthedocs.io/en/latest/operator.html#reducers

The trivial cases to add are 1d sum/min/max, that operate over the entire array (not axis aware). This can be done using thrust for CUDA/HIP, and the built in reduction support for SYCL 2020. One issue here is that thrust can be used even when not being used as the gtensor backing strore - so perhaps GTENSOR_USE_THRUST should be renamed to something like GTENSOR_USE_THRUST_VECTOR. I don't see much benefit in avoiding thrust for reductions - we could fallback to CUB/rocPRIM, but rocPRIM still requires a separate install for ROCm, and Thrust/CUB are always available for CUDA.

causing bugs: gtensor_span may not be contiguous

I've seen this in two places now:

template <typename T, size_type N, typename S>
inline void gtensor_span<T, N, S>::fill(const value_type v)
{
  if (v == T(0)) {
    auto data = gt::backend::raw_pointer_cast(this->data());
    backend::standard::memset<S>(data, 0, sizeof(T) * this->size());
  } else {
    assign(*this, scalar(v));
  }
}

and also in the various copy overloads between device and spans. This makes an underlying assumption that the data kept in the span is contiguous, but that's only the case if the strides have been calculated from a shape, as gt::adapt does. Fundamentally, however, a gtensor_span can have arbitrary strides (e.g., we could and should do that for a nicer Fortran interface, but there are other reasons one might want to do this.)

custom launch and assign kernel names

When profiling and debugging, tracking down which version of a gtensor kernel is the one of interest can be challenging. For SYCL, kernel names are passed as template parameters with templated type names - another sub-type of the templated name could be optionally supplied by the user, via an gt::assign<typename>(lhs, rhs) helper function. This is ugly, but a starting point for discussion. Not exactly sure how to do this for HIP / CUDA yet.

can't assign scalar to gtensor

gt::gtensor<double, 1, gt::space::device> d_axpy{gt::shape(100)};
d_axpy = 1.0;

fails to compile in clang++10:

/opt/isycl/bin/clang++  -DGTENSOR_HAVE_DEVICE -DNDEBUG -D__SYCL__ -I/home/bda/fusion/gtensor/examples/../../syclthrust/include -I/home/bda/fusion/gtensor/examples/../include  -O3 -DNDEBUG   -fsycl -fsycl-unnamed-lambda -o CMakeFiles/daxpy.dir/src/daxpy.cxx.o -c /home/bda/fusion/gtensor/examples/src/daxpy.cxx
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: error: no viable overloaded '='
    d_axpy = 1.0;
    ~~~~~~ ^ ~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:42:6: note: candidate template ignored: could not match 'expression<type-parameter-0-0>' against 'double'
  D& operator=(const expression<E>& e);
     ^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
class gcontainer : public gstrided<D>
      ^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit move assignment operator) not viable: no known conversion from 'double' to 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
/home/bda/fusion/gtensor/examples/../include/gtensor/gtensor.h:46:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gtensor<double, 1, gt::space::device>' for 1st argument
class gtensor : public gcontainer<gtensor<T, N, S>>
      ^

Note that this also fails with space::host.

Using d_axpy = gt::scalar(1.0) fails with a different error. Not sure if that makes any sense but it seems logical from an abstract interface point of view:

/home/bda/fusion/gtensor/examples/../include/gtensor/sarray.h:66:41: error: cannot convert 'gt::sarray<int, 0>' to 'int' without a conversion operator
sarray<T, N>::sarray(U... args) : data_{T(args)...}
                                        ^~~~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:60:10: note: in instantiation of function template specialization 'gt::sarray<int, 1>::sarray<gt::sarray<int, 0> , 0>' requested here
  resize(e.derived().shape());
         ^
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: note: in instantiation of function template specialization 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >::operator=<gt::gscalar<double> >' requested here
    d_axpy = gt::scalar(1.0);
           ^
1 error generated.

Merging with xtensor?

Would it make sense to contribute this to xtensor itself, or is it better as a separate library as it is now?

I think @SylvainCorlay, @wolfv and others might be interested in your work.

fix kernel launch failures when grid dims are too large

gtensor currently uses a fixed block size for CUDA/HIP kernels, which fails for arrays that are too large in a certain dimension (but still easily small enough to fit in a single kernel launch with a block size adjustment). This could be handled by a generic flattening n-d implementation (using 1d indexing), similar to the SYCL backend n-d assign/launch implementations, or with more complex logic to determine block and grid sizes.

constness of span/to_kernel types vs container types

gtensor is designed to have const safe containers, in that operator() returns a const reference if called on a const instance. This is much like std::vector. std::span, on the other hand, does not consider the underlying data part of it's state (since the data is not copied when copying a span), and it's operator[] returns a non-const reference even if the instance is const. gtensor, like it's inspiration xtensor, uses the same semantics on view objects, even though it doesn't provide much extra protection, given that you can assign a view to a non-const and modify it that way. Unfortunately this causes problems for SYCL 2020 provisional, which requires a const kernel function.

More discussion here:
intel/llvm#2538

avoid unnecessary storage copies

Adding a debug print to gtensor_storage copy ctor, shows some surprises. In particular assigning a temporary to an auto variable can result in a copy.

should gtensor_storage discard on resize?

std::vector will preserve values on resize(), which may involve copies when reallocation is necessary. gtensor_storage does the same, however, I don't think in its use in gtensor it never happens that values are expected to be preserved, so this copy is waste of time (albeit probably not typically significantly so). Since the semantics of this container differ from std::vector, anyway, in particular, lack of initializing values, I think a point could be made that it'd also not try to preserve values on resize.

cmake warning

I'm getting this warning:

[cmake] CMake Warning (dev) at build-spack-ubuntu2/_deps/gtensor-src/CMakeLists.txt:230:
[cmake]   Syntax Warning in cmake code at column 63
[cmake] 
[cmake]   Argument not separated from preceding token by whitespace.
[cmake] This warning is for project developers.  Use -Wno-dev to suppress it.

I think it has to do with how things are quoted, but I don't really know what the issue is, and it's in sycl-specific code, so I can't easily test what's going on.

sycl: improve scratch data clean up for gt-blas

SYCL backend for gt-blas will leak if there is an error, and possibly in some cases when there is no error.

cmake: replace HIP and SYCL lib hacks

Currently the main CMakeLists.txt file lists libraries needed for blas and fft for both HIP and SYCL. This is just a quick hack to get things working until upstream cmake is improved. In particular, the HIP config breaks mixed language builds (e.g. Fortran + C), once this is merged the official cmake support for HIP should work again:
ROCm/HIP#2190

Also need to explore oneAPI cmake support further.

add GTENSOR_ENABLE_ASSERT

Since NDEBUG affects other code, we should have a define that is specific to gtensor for disabling asserts. This is important especially for HIP and SYCL, which don't allow assert in device code.

flatten can result in a storage copy

The flatten implementation attempt to return the original object if it's already 1d or 0d, but this often results in a copy. Using decltype(auto) everywhere can fix it (both for return value, and for the variable it's assigned to), but this is fragile, i.e. using auto for the variable instead results in a copy. This breaks the flatten usage in SYCL N-d assign (can be worked around by calling to_kernel before flatten, since copying the span returned by to_kernel doesn't copy storage).

ci doesn't build benchmarks

(see #101) Should be easy to enable, but I need a reminder ;)

gt-blas: sycl / mkl oneAPI blas uses 64bit index type (ILP64 only)

MKL oneAPI doesn't currently support LP64. This shows up for getrf/getrs calls for the pivot array. Since this is allocated on device, there is not an easy way around it, especially re the Fortran interface for GENE. Some options:

Always take 32bit args, for SYCL copy to a / from a 64bit array internally, so the user never sees the difference
Define a gpublas_allocate_pivot_array which allocates correct size, and require users to use this

My inclination is to do (1) for now, and if it becomes a performance bottleneck we can revisit. If it's a big batch, then the computation should dominate anyway, so extra pivot array copies shouldn't be too horrible. Also, in the future LP64 may be supported better in oneAPI, then we can switch and they will all be consistent.

Create generic launch and assign kernels

The CUDA/HIP implementations are fragile in that there may be array sizes that overflow certain limits. Explore using linear launch indexing and mapping back to expression indexes. This requires integer divide and modulo, which may hurt performance, but depending on computational intensity may not hurt performance and my simplify the launch routines a lot. Ideally we have one generic launch for any number of dimensions.

The WIP sycl implementation currently only goes up to 3 dims using range instead of nd_range, and similar challenges apply.

arithmetic operations on gt::shape

It sometimes could be convenient of one could do

  auto s = gt::shape(1, 2) + gt:shape(3, 4);

ci: rocm build is slow...

So I guess I'm not quite sure why the builds with rocm are so slow in the first place...

But since the initialization / rocm install only takes of the order of 1 minute, it seems like it may be worth it to split this up in some fashion so that the various builds can run in parallel in separate containers (if possible without too much hassle).

Add numpy like gtensor create/initialize helpers

add lower bound to view or as an adapt option

Suggested by @td-mpcdf. My understanding is that this would just be an offset, so if lbound=9, the range [10,20] would actually be [1,11] on the physical array. Basically lbound is subtracted from all indexes. Is this the desired behavior? @germasch do you think this makes sense for gtensor?

This could be it's own view type, or a new view parameter. Or something specific to gt::adapt and gt::adapt_device, which is where it wold be used in GENE, but these seems like an odd place to put the functionality when looking at gtensor as a whole.

fix gfunction reference types

the reference types are currently defined to be value_type

add gt::transform or gt::vectorize?

This can be implemented via thrust::transform for CUDA/HIP and via std::transform for Intel SYCL

rename development branch to main

I suggest renaming -> main -- going with the times --, unless there are any objections.

fix gt::copy for const data type spans and containers

If the source has a const data type, no signature will match and compile will fail, e.g.

error: no instance of overloaded function "gt::copy" matches the argument list
            argument types are: (gt::gtensor_span<const real_t, 2UL, gt::space::thrust_host>, gt::gtensor<real_t, 2UL, gt::space::device>)`

there is no reason for this to not work.

benchmarks fail to build with SYCL backend

Two issues - need a native SYCL implementation of ij_deriv, and need to fix N-d assign when RHS is a scalar.

The issue is that the N-d implementation uses flatten, which calls reshape(std::forward<E>(e), shape(e.size()));, and the e here is gscalar which has no size method.

add gt::backend::managed_storage type alias

Some of the gt-blas calls require temporary space, particularly the SYCL backend. Having a managed storage helper class would be useful for implementing this, in a nice RIAA way.

custom kernel names

make reductions available by default

Currently, one needs to explicitly #include <gtensor/reductions.h. Unless there's a good reason (say compile times becomes even slower), it'd be nice to have the available by default, ie., included from <gtensor/gtensor.h>.

gfunction doesn't have `.view(...)`?

I didn't look into it more...

sycl: update for 2020 provisional standard

One important change is that mutable lambdas are no longer allowed, so const gtensor/view refs need to allow element modification.

performance: avoid thrust eval in complex expression

There is a performance regression in GENE related to the RHS calculation, which does a complex expression eval involving views. The first step is to develop a test case which reproduces the issue, then fix it and update gtensor in GENE.

Alternative device_vector backend not requiring thrust

This is already needed for SYCL port, and should be easy to generalize to support CUDA and HIP as well. One advantage of this is avoiding the zero fill initialization in thrust, which can really hurt performance when allocating temporary arrays.

update for ROCm 3.5.0

Both thrust and no thrust builds fail:

In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:6:
/home/bda/fusion/gtensor/include/gtensor/expression.h:41:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() const& -> const derived_type&
                           ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:35:33: note: previous declaration is here
  GT_INLINE const derived_type& derived() const&;
                                ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:47:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() & -> derived_type&
                           ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:36:27: note: previous declaration is here
  GT_INLINE derived_type& derived() &;
                          ^
In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:8:
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:92:25: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline int gstrided<D>::shape(int i) const
                        ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:53:17: note: previous declaration is here
  GT_INLINE int shape(int i) const;
                ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:98:26: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline auto gstrided<D>::shape() const -> const shape_type&
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:54:31: note: previous declaration is here
  GT_INLINE const shape_type& shape() const;
                              ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:104:26: error: __host__ function 'strides' cannot overload __host__ __device__ function 'strides'
inline auto gstrided<D>::strides() const -> const strides_type&
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:55:33: note: previous declaration is here
  GT_INLINE const strides_type& strides() const;
                                ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:110:31: error: __host__ function 'size' cannot overload __host__ __device__ function 'size'
inline size_type gstrided<D>::size() const
                              ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:56:23: note: previous declaration is here
  GT_INLINE size_type size() const;
                      ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:117:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) const -> const_reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:59:29: note: previous declaration is here
  GT_INLINE const_reference operator()(Args&&... args) const;
                            ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:124:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) -> reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:61:23: note: previous declaration is here
  GT_INLINE reference operator()(Args&&... args);
                      ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:151:26: error: __host__ function 'data_access' cannot overload __host__ __device__ function 'data_access'
inline auto gstrided<D>::data_access(size_type i) const -> const_reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:70:29: note: previous declaration is here
  GT_INLINE const_reference data_access(size_type i) const;

sycl backend is using malloc_shared for device space

This was introduced in the allocator/backend refactor, and appears to have masked issues in oneMKL getrf/s where it segfaults when using device memory.

gt-blas prints wrong error messages

gt-blas implementation (possibly gt-fft also?) does lazy error handling and re-uses gtGpuCheck, which handles success vs failure OK but prints extremely confusing and misleading error messages. One way to deal with this is to add an overload for gtGpuCheck that takes the appropriate library error type and prints the library error message, or at the very list print the numeric code and not a misleading message.

initialize complex arrays with initializer lists

Initializing gt::gtensor<std::complex<double, 1>> = { 1., 2+3.i } works (at least sometimes), but the same with gt::complex usually does not work, not only because of 1.i being std::complex but more significantly because thrust::complex's constructors aren't constexpr. This makes, at the very least, for a bunch of ugly tests.

I think it's probably possible to get this to work with a special overload that initializes gt::gtensor<gt::complex<T>> with an initializer list of std::complex<T>. It'd be even nicer to be able to just use std::complex, but pre C++-20 lots of it is not constexpr, ie., not __host__ __device__.

cost of param limit size workaround, and optimizing view size

SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.

One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.

We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.

There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.

I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.

linear and index array accessors

For the SYCL assign/launch and sum_axis_to reduction routine, linear indexing based on expression shape is used to create N dimensional implementations. This is done using the index_expression helper function (currently only works up to 6d), which is hacky. This could also be generally useful elsewhere in gtensor to generalize functions beyond six dimension, and to create specialized routines in application code.

A related operation is access based on an index object, rather than operator() style which takes separate arguments. gview.h implements some of these for internal use, for arbitrary dimension. This could be refactored and moved so, for example, all containers support both linear access and sarray index access.

We can look to xtensor for what interface we might want to provide for this:
https://xtensor.readthedocs.io/en/latest/indices.html
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#element-access
https://xtensor.readthedocs.io/en/latest/view.html#flatten-views
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#data-buffer

So one possibility is operator[] for index array style access (a[{0, 7}] == a(0, 7)), and forcing the use of a.data() to get linear indexing. However a.data() only works on container types, not general expressions, so it is less powerful for implementing n-dimensional routines. Currently gcontainer provides the data_access method for this, but again it's limited to container types and container spans only.

Also note that gtensor currently assumes column major (Fortran style) indexing. If we add this functionality, it is another place that will need to be updated if row major and other layouts are added.

gtensor should support +=

There may have been some subtleties involves, but it should definitely be possible somehow.

New per device type spaces

Device vendor dependent code should be refactored, so ifdefs only occur in one place (perhaps space.h. One idea is to have gt::space::cuda, gt::space::hip, etc, and have gt::space::device be a type alias for this, and in the same place the device specific header can be included.

Using streams with gtensor

It seems to be necessary to somehow make it possible to use the (CUDA) streams with gtensor. In the launch of an explicit kernel this could be another argument, in the assignment kernels, we should add a function set_stream(streamID) and launch the assignment kernel in this stream. A set_default_stream should set it back.