unisa-hpc / sycl-bench Goto Github PK

View Code? Open in Web Editor NEW

52.0 10.0 29.0 25.28 MB

SYCL Benchmark Suite

License: BSD 3-Clause "New" or "Revised" License

C++ 80.48% C 0.35% CMake 11.82% Python 3.39% Ruby 3.96%

sycl opencl gpu gpgpu gpu-programming spir-v

sycl-bench's Introduction

SYCL-Bench

SYCL Benchmark Suite, work in progress

Benchmarks support the following command line arguments:

--size=<problem-size> - total problem size. For most benchmarks, global range of work items. Default: 3072
--local=<local-size> - local size/work group size, if applicable. Not all benchmarks use this. Default: 256
--num-runs=<N> - the number of times that the problem should be run, e.g. for averaging runtimes. Default: 5
--device=<d> - changes the SYCL device selector that is used. Supported values: cpu, gpu, default. Default: default
--output=<output> - Specify where to store the output and how to format. If <output>=stdio, results are printed to standard output. For any other value, <output> is interpreted as a file where the output will be saved in csv format.
--verification-begin=<x,y,z> - Specify the start of the 3D range that should be used for verifying results. Note: Most benchmarks do not implement this feature. Default: 0,0,0
--verification-range=<x,y,z> - Specify the size of the 3D range that should be used for verifying results. Note: Most benchmarks do not implement this feature. Default: 1,1,1
--no-verification - disable verification entirely
--no-ndrange-kernels - do not run kernels based on ndrange parallel for
--warmup-run - run benchmarks once before evaluation to discard possible "warmup" times, e.g., JIT compilation

Usage

Clone sycl-bench repo

$ git clone https://github.com/bcosenza/sycl-bench.git

Navigate into repo and create build folder

$ cd bench
$ mkdir build && cd build

Compile with CMake

$ cmake -DSYCL_IMPL=[target SYCL implementation] [other compiler arguments] ..
$ cmake --build .
$ sudo make install

Example compiling with CMake for DPC++:

$ cmake -DSYCL_IMPL=LLVM -DCMAKE_CXX_COMPILER=/path/to/llvm/build/bin/clang++ ..

Each test should now have an executable in the build folder Run individual tests as such:

$ ./arith --device=cpu --output=output.csv

Packaging

SYCL-Bench provides a CMake target package (and package_source) to package a SYCL-Bench installation. Users can configure what generators to use to build the packages by passing a semicolon-separated list of generators to use to the CPACK_GENERATOR CMake flag and then build the enabled packages by building the package target:

cmake -Bbuild -DCPACK_GENERATOR="TGZ;ZIP"
cmake --build build --target package

For more information, check the CPack documentation for the relevant CMake version.

Packages built via the package target will contain all files contained in a SYCL-Bench installation (binaries, scripts, benchmark inputs). Packages built via the package_source target will additionally contain the source files.

Attribution

If you use SYCL-Bench, please cite the following papers:

@inproceedings{SYCL-Bench:Euro-Par:2020,
author = {Lal, Sohan and Alpay, Aksel and Salzmann, Philip and Cosenza, Biagio and Hirsch, Alexander and Stawinoga, Nicolai and Thoman, Peter and Fahringer, Thomas and Heuveline, Vincent},
title = {{SYCL-Bench: A Versatile Cross-Platform Benchmark Suite for Heterogeneous Computing}},
year = {2020},
publisher = {Springer International Publishing},
booktitle = {Euro-Par 2020: 26th International European Conference on Parallel and Distributed Computing},
series = {Euro-Par ’20}
}

@inproceedings{SYCL-Bench:IWOCL:2020,
author = {Lal, Sohan and Alpay, Aksel and Salzmann, Philip and Cosenza, Biagio and Stawinoga, Nicolai and Thoman, Peter and Fahringer, Thomas and Heuveline, Vincent},
title = {{SYCL-Bench: A Versatile Single-Source Benchmark Suite for Heterogeneous Computing}},
year = {2020},
isbn = {9781450375313},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3388333.3388669},
doi = {10.1145/3388333.3388669},
booktitle = {Proceedings of the International Workshop on OpenCL},
articleno = {10},
numpages = {1},
keywords = {Heterogeneous Computing, SYCL Benchmarks &Runtime},
location = {Munich, Germany},
series = {IWOCL ’20}
}

sycl-bench's People

Contributors

Stargazers

Watchers

sycl-bench's Issues

Race condition in scalar prod

This line is a race condition:
https://github.com/bcosenza/sycl-bench/blob/5d05aa4193944ca2e42f6ac2e6bd29109dc8174e/single-kernel/scalar_prod.cpp#L131

As it writes into memory that might be read by work items from work group 0. I should fix this.

blocked_transform is broken due to SYCL 2020 offset semantics

In SYCL 1.2.1, accessor::operator[] did not add the accessor offset. Users were instead expected to use parallel_for with offset to achieve correct indexing.

In SYCL 2020, accessor::operator[] does add the accessor offset. parallel_for offset is deprecated and should not be used.

The blocked_transform benchmark assumes the SYCL 1.2.1 semantics. Building it with a SYCL 2020 compiler causes the offset to be added twice, because the parallel_for is provided an offset, and additionally the accessor::operator[] now also adds the offset. So this benchmark is currently UB in any SYCL 2020 implementation.

We should change it to not use parallel_for with offset to make everything match again.

Runtime failure for the DGEMM application

I am working on an example which I am trying to integrate and execute using the sycl-bench suite. I am getting runtime error upon testing it with AdaptiveCpp for CUDA as well as HIP backend. When the same kernel is executed as the standalone ACpp application, the kernel executes without any error. I have a similar kind of behavior with one of my other applications. One of the similarities between both the applications is that they both are nd_range parallel_for type.

Matrices are currently initialized as identity matrix. Please let me know if you have any idea on the error.

Application code:

#include "common.h"

#include <iostream>
#include <sycl/sycl.hpp>

using namespace sycl;

class MatMulBlocked
{
protected:    
  
  BenchmarkArgs args;
  std::vector<double> initA;
  std::vector<double> initB;
  std::vector<double> initC;
  
  PrefetchedBuffer<double, 2> initA_buf;
  PrefetchedBuffer<double, 2> initB_buf;
  PrefetchedBuffer<double, 2> initC_buf;

  const size_t problem_size = 256;
  const size_t Ndim = 256;
  const size_t Mdim = 256;
  const size_t Pdim = 256;
  const size_t Bsize = 16;

public:
  MatMulBlocked(const BenchmarkArgs &_args) : args(_args) {}
  
  void setup() {
    // host memory intilization
    initA.resize(Ndim * Pdim);
    initB.resize(Pdim * Mdim);
    initC.resize(Ndim * Mdim);

    // Initialize matrix A to the identity
    for(size_t i = 0; i < Ndim; ++i) {
      for(size_t j = 0; j < Pdim; ++j) {
        initA[i * Pdim + j] = i == j;
          }
    }
      // Initialize matrix B to the identity
    for(size_t i = 0; i < Pdim; ++i) {
      for(size_t j = 0; j < Mdim; ++j) {
              initB[i * Mdim + j] = i == j;
          }
    }
      // Initialize matrix C to the zero
    for(size_t i = 0; i < Ndim; ++i) {
      for(size_t j = 0; j < Mdim; ++j) {
              initC[i * Mdim + j] = 0;
          }
    }

    initA_buf.initialize(args.device_queue, initA.data(), range<2>(Ndim, Pdim));
    initB_buf.initialize(args.device_queue, initB.data(), range<2>(Pdim, Mdim));
    initC_buf.initialize(args.device_queue, initC.data(), range<2>(Ndim, Mdim));
  }

  void run(std::vector<event>& events) {
    events.push_back(args.device_queue.submit(
        [&](handler& cgh) {
          
      auto in1 = initA_buf.template get_access<access::mode::read>(cgh);
      auto in2 = initB_buf.template get_access<access::mode::read>(cgh);
      auto out = initC_buf.template get_access<access::mode::read_write>(cgh);

      // Use local memory address space for local memory
      accessor<double, 2, access_mode::read_write, access::target::local> Awrk({Bsize, Bsize}, cgh);
      accessor<double, 2, access_mode::read_write, access::target::local> Bwrk({Bsize, Bsize}, cgh);

      cgh.parallel_for<class SYCL_Matmul_blocked_kernel>(
        nd_range<2>{{Ndim, Mdim}, {Bsize, Bsize}}, 
          [=](nd_item<2> idx) {
            // This work-item will compute C(i,j)
            const size_t i = idx.get_global_id(0);
            const size_t j = idx.get_global_id(1);

            // Element C(i,j) is in block C(Iblk, Jblk)
            const size_t Iblk = idx.get_group(0);
            const size_t Jblk = idx.get_group(1);

            // C(i,j) is element C(iloc, jloc) of block C(Iblk, Jblk)
            const size_t iloc = idx.get_local_id(0);
            const size_t jloc = idx.get_local_id(1);

            // Number of blocks
            const size_t Nblk = Ndim / Bsize;
            const size_t Mblk = Mdim / Bsize;
            const size_t Pblk = Pdim / Bsize;

            for (size_t Kblk = 0; Kblk < Pblk; ++Kblk) {
              // Copy A and B into local memory
              Awrk[iloc][jloc] = in1[Iblk * Bsize + iloc][Kblk * Bsize + jloc];
              Bwrk[iloc][jloc] = in2[Kblk * Bsize + iloc][Jblk * Bsize + jloc];

              // Compute matmul for block
              for (size_t kloc = 0; kloc < Bsize; ++kloc) {
                out[i][j] += Awrk[iloc][kloc] * Bwrk[kloc][jloc];
              }
            }
        });
        args.device_queue.wait_and_throw();

    }));

  }

  bool verify(VerificationSetting &ver) {
    //Triggers writeback
    initC_buf.reset();
    bool pass = true;
 
    for (size_t i = 0; i < problem_size; ++i) {
        for (size_t j = 0; j < problem_size; ++j) {
            auto kernel_value = initC[i * Mdim + j];
            auto host_value = (i == j) ? 1.0 : 0.0;

            if (kernel_value != host_value) {
                pass = false;
                break;
            }
        }
    }    
    return pass;
  }
  
  static std::string getBenchmarkName() {
    std::stringstream name;
    if(kernel_type_thorin)
      name << "Thorin_DGEMM_MatMulBlocked_";
    else
      name << "DGEMM_MatMulBlocked_";    
    return name.str();
  }
};

int main(int argc, char** argv)
{
  BenchmarkApp app(argc, argv);
  app.run<MatMulBlocked>();
  return 0;
}

Error logs when offloading to CUDA device:


********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: NVIDIA GeForce RTX 2080 SUPER
sycl-implementation: hipSYCL
============== hipSYCL error report ============== 
hipSYCL has caught the following undhandled asynchronous errors: 

   0. from /home/jgupta/development/opensycl/OpenSYCL/OpenSYCL_juhi/src/runtime/cuda/cuda_event.cpp:63 @ wait(): cuda_node_event: cudaEventSynchronize() failed (error code = CUDA:700)
The application will now be terminated.
terminate called without an active exception
zsh: IOT instruction (core dumped)

Error logs when offloading to HIP device:


********** Results for DGEMM_MatMulBlocked_**********
problem-size: 3072
local-size: 256
device-name: AMD Radeon Pro VII
sycl-implementation: hipSYCL
Memory access fault by GPU node-1 (Agent handle: 0x56096774a7e0) on address 0x7ffe8126f000. Reason: Page not present or supervisor privilege.
zsh: IOT instruction (core dumped)

`emitResults` has invalid memory accesses when used with `--warmup-run` & the first run fails

TimeMetricsProcessor::emitResults leads to segfaults in case the very first benchmark run fails (e.g. doesn't verify), and just this run also happens to be the warmup run.
E.g. line double median = resultsSeconds[resultsSeconds.size() / 2];

This is e.g. triggered by running ./nbody --size=1024 --local=256 --warmup-run with AdaptiveCpp's -DHIPSYCL_TARGETS=generic as hierarchical parallel for isn't supported with this compilation-flow, yet.
In this case specifying --no-verification or leaving out --warmup-run works around the issue.

Problem in compilation stage with. computecpp 2.0.0

Hi,

I tried compile the sycl-bench, using the computecpp 2.0.0, gcc 9.3.0 and ubuntu error, but the cmake tool return me a error: #include<CL/sycl.hpp> not found.

cmake command: cmake .. -DSYCL_IMPL=ComputeCpp -D COMPUTECPP_RUNTIME_LIBRARY=/opt/ComputeCPP/include -DComputeCpp_INFO_EXECUTABLE=/opt/ComputeCPP -D COMPUTECPP_RUNTIME_LIBRARY_DEBUG=/opt/ComputeCPP

What could I be doing wrong?

Rui

Questions about the single kernel set

[ 3%] Building CXX object CMakeFiles/sobel7.dir/single-kernel/sobel7.cpp.o
Unsupported kernel parameter type
UNREACHABLE executed at /home/cc/sycl_workspace/llvm/clang/lib/Sema/SemaSYCL.cpp:1039!

Investigate nbody on ComputeCpp

Reports indicate that there's an issue with nbody on ComputeCpp. But what could it be?

Decide on future of sycl2020 branch - make default branch or merge into main?

It seems that currently people have difficulty finding the SYCL 2020 benchmarks because they live in the sycl2020 branch which is not really mentioned anywhere.

We should consider

Either making sycl2020 the default branch
or just merge it into main -- after all, it contains important functionality of this benchmark suite.

Use ndrange of hierarchical parallel for in pattern_shared

This benchmark uses a local memory accessor which is illegal inside a basic hierarchical for invocation. This causes the benchmark to crash at least on hipSYCL CPU.

Some thoughts on DRAM throughput benchmarking

As previously discussed, I think that the current implementation of the DRAM microbenchmark is not ideal, if our goal is to simply measure throughput. The benchmark is currently based on the paper "GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling" (Guerreiro18), in which they used it to characterize the power consumption of GPUs when stressing device memory. However, the kernel itself doesn't just do a memory copy, but also includes a parameter N to introduce a certain level of arithmetic intensity. I'm not certain why they have that in there, but I would assume they have their reasons. For us however this means adding another dimension to the benchmark, which makes it harder to evaluate and compare results.

To further complicate things, the actual implementation of the DRAM benchmark used in Guerreiro18, which can be found on GitHub (here), appears to be completely different. In this version, the arithmetic operations are gone, and the kernel mostly consists of copy instructions. However, it still does some strange things I'm not entirely clear about.

Here's my SYCL implementation of this kernel:

cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
  for(size_t j = 0; j < COPY_ITERATIONS; j += UNROLL_ITERATIONS) {
    for(size_t i = 0; i < UNROLL_ITERATIONS; ++i) {
      DataT r0 = in[gid[0] + STRIDE * i];
      out[gid[0] + STRIDE * i] = r0;
    }
  }
});

As you can see, since the loop variable j is not being used inside the loop body, this kernel repeatedly copies the same chunk of memory from one buffer into the other; by default COPY_ITERATIONS is 1024. In each iteration UNROLL_ITERATIONS (default 32) copies are made. To allow for memory coalescing, each thread accesses the buffer based linearly on its global id. Each subsequent copy is offset by a multiple of STRIDE. This parameter is sensitive to cache sizes: If chosen too small, caching can occur. In the implementation of Guerreiro18, they use an offset of 64 * 1024, and profiling reveals a 0% L2 hit rate on my GTX 2070. If I change this to 32 * 1024 however, I already get a 11% L2 hit rate. How large this stride has to be exactly depends on the number of work items, cache sizes and so on. All in all, this version of the benchmark essentially introduces three additional parameters, greatly increasing its complexity, without a clear benefit.

Furthermore, in order to achieve the highest throughput with this benchmark, it weirdly needs to be compiled without any assembly-level optimizations (by passing -O0 to ptxas). This is of course hard to replicate with hipSYCL, and my SYCL implementation of Guerreiro18 thus also only achieves around ~300 GiB/s, i.e., 67% of the GTX 2070's 448 GiB/s theoretical maximum. This is a bit less than the CUDA version achieves with optimizations (311 GiB/s), and substantially less than it when compiled without optimizations (325 GiB/s).

Long story short, instead of implementing Guerreiro18, I would suggest to instead go for the simplest possible memcopy kernel, i.e.:

cgh.parallel_for<DRAMKernel>(global_size, [=](s::id<1> gid) {
  out[gid] =  in[gid];
});

With this, I'm able to get ~358 GiB/s, which is about 80% of the theoretical maximum, and probably quite close to what we can reasonably expect to achieve.

Note that to get good results, the buffer size needs to be large enough, i.e. at least several hundred megabytes. If we do a plain mapping of --size to buffer size, this means that the specified sizes need to be rather large. For example I used --size=939524096 (which equals 3.5 GiB).

This is of course quite a large number, and a much larger size than many other benchmarks can support (if e.g. it is used as the range in both dimensions of a 2D buffer). One possible solution would be to always multiply the size by some fixed factor, e.g. 1024. However, I generally feel like the mapping of the --size parameter to the actual workload is too arbitrary already. It might be convenient for running lots of benchmarks in batch, but in reality I think we'll have to hand tune these values (as well as any additional parameters a benchmark might have) for each individual platform anyways. I'm thus thinking whether maybe having individual parameters for each benchmark would make more sense (e.g. having a buffer-size= for this one).

In the same vein, having the ability to compute custom metrics would also be great. For example, it would be cool if this benchmark could actually print its achieved throughput, instead of having to manually compute it.

What do you guys think?

More special treatment for implementations in CMakeLists.txt

We should:

Add -march=native when compiling with hipSYCL for CPU. Unlike ComputeCpp+OpenCL CPU backend, hipSYCL cannot do runtime compilation, so it's important to optimize when compiling. This may also be relevant for Intel SYCL and ComputeCpp if we want to also benchmark their host fallbacks.
Define TRISYCL preprocessor macro when compiling for triSYCL, so that we can add some workarounds in the code.

Let run-suite read test profiles from yaml or json

... to make it easier to add/work with test profiles

Add references to original implementations of benchmarks

Right now it is not clear which benchmarks are new, and which are ports of existing benchmarks. We should add notes in each benchmark source file, referring to the original source, if there is any.

use of undeclared identifier 'device_selector'

In file included from /home/cc/sycl-bench/include/common.h:14:
/home/cc/sycl-bench/include/command_line.h:244:8: error: use of undeclared identifier 'device_selector'
if(device_selector != "gpu") {

Issue with --local command line parameter.

While executing the test case blocked_transform which is present under runtime (https://github.com/bcosenza/sycl-bench/blob/master/runtime/blocked_transform.cpp), we noticed that we are getting a core dump error.

Command used to execute - ./blocked_transform --device=gpu

Output -

********** Results for Runtime_BlockedTransform_iter_64_blocksize_0**********
problem-size: 3072
local-size: 1024
device-name: NVIDIA RTX A4000
sycl-implementation: LLVM CUDA (Codeplay)
blocked_transform: /tmp/llvm-sycl-nightly-20220222/sycl/source/detail/scheduler/commands.cpp:1826: void cl::sycl::detail::adjustNDRangePerKernel(cl::sycl::detail::NDRDescT&, cl::sycl::detail::pi::PiKernel, const cl::sycl::detail::device_impl&): Assertion `NDR.NumWorkGroups[0] != 0 && NDR.LocalSize[0] == 0' failed.
Aborted (core dumped)

However, when we are explicitly assigning the value of the --local parameter to 256 (which is the default value) during runtime, it is executing without any errors.

Command used to execute - ./blocked_transform --device=gpu --local=256

We would like to know if there is a fix for this issue? If so, where can we get the revised code?

Fix the run-suite brommy.bmp not found issue

It used to work. But we don't know why. Now it doesn't and we know why. We should transform this to a state where it works and (ideally, but at a lower priority) we know why.

build procedure

The project really looks interesting, but I am unable to find a HOWTO on build. Did I missed it somewhere?