Giter Site home page Giter Site logo

triad-gpu-bandwidth's Introduction

Triad GPU Bandwidth

Test achievable triad kernel bandwidth using all available CUDA allocation/transfer methodologies:

  • pageable allocation and explicit transfer: cudaMalloc / cudaMemcpy
  • pinned allocatiion and explicit transfer: cudaHostAlloc / cudaMemcpy
  • pinned allocation with mapping: cudaHostAlloc(..., cudaHostAllocMapped)
  • unified memory: cudaMallocManaged
  • unified memory with AccessedBy hint: cudaMallocManaged + cudaMemAdvise(..., cudaMemAdviseAccessedBy)
  • unified memory with prefetching: cudaMallocManaged + cudaMemPrefetchAsync
  • system allocator: new

The basic operation is a host-to-device transfer, followed by the triad kernel, followed by a device-to-host transfer.

The output is a CSV file with columns for the transfer time (zero for implicit transfers), the kernel time, and the total time.

  • total time: the time from the start of the first transfer to the end of the last transfer.
  • transfer time: the combined time for the host-to-device and the device-to-host transfers.
  • kernel time: start of the kernel execution to end of the kernel execution

transfer + kernel may not equal total, though they should be close.

Examples

  • Run benchmarks on GPU 0 from n = 1e5; n <= 2.5e8; n *= 1.3. Repeat each benchmark 5 times. Pin access and allocations to NUMA node 0. Show output on terminal and also pipe to triad.csv.

numactl -p 0 ./triad | tee triad.csv

  • Run only the pinned benchmark

./triad --pinned

  • Run only n = 1e9

./triad -n 1e9

  • Run 3 iterations of each benchmark

./triad -i 3

  • Show all options

./triad -h

Automatic Testing for Functional System Allocator

Forks a child process to test whether CUDA can use the system allocator. If CUDA cannot, this causes a sticky error that permanently damages the CUDA context, so we use a child process to fully isolate the context so it can be completely destroyed. The child process needs to be create before the parent does any CUDA activity at all.

The test is done in test_system_allocator.cu

Getting stable benchmark results

It is important to do 3 things:

  1. Call cudaDeviceReset() before each benchmark. This ensures that any CUDA state is wiped between runs.
  2. Call cudaFree(0) after cudaDeviceReset(). This initializes the GPU, ensuring that we don't actidentally time any lazy initialization.
  3. Pin to a single NUMA region or CPU. This ensures that data copies always take a consistent route from CPU to GPU.

Building with a gcc that has std::regex

mkdir build && cd build
cmake ..

Building on Power9 with gcc 4.8.5

GCC 4.8.5 doesn't have a working std::regex (used for cxxopts), so install a supported version of clang. GCC 4.8.5 cannot build libcxx, so we use a clang without libcxx to build a clang with libcxx. Depending on your installed CUDA, you'll need a different version of clang.

CUDA Clang Installer
9.2 5.0.0 https://gist.github.com/cwpearson/c5521dfc50175b1d977643b2fc5a2bb1
10.1 5.0.0 https://gist.github.com/cwpearson/c13ac7c25bde8c8644300e211faf4e78

Add the clang to your path, and have CMake use clang in the build.

mkdir build && cd build
cmake .. -DCMAKE_TOOLCHAIN_FILE=`readlink -f ../toolchains/clang.toolchain`

The CUDA documentation claims that clang 8.0.0 is supported for CUDA 10.1, but if you actually try it says it requires clang>=3.2 and clang<8. Clang 7.1.0 fails on CUDA 10.0.0 with some errors about __fp16.

triad-gpu-bandwidth's People

Contributors

cwpearson avatar

Watchers

 avatar  avatar

Forkers

poti35

triad-gpu-bandwidth's Issues

Confusing input/output copies

Hi Carl,

I think there are somewhat confused copies for the input and output data for the benchmark. Since the kernel computes a = b + scalar * c, the vectors b and c should be copied from host to device and the vector a should be copied back from device to host. See here and here.

Most importantly from the performance point of view, you're making one unnecessary copy from host to device, which does not happen in the zero-copy and unified memory variants, so the comparison with pageable and pinned memory is not completely fair.

Jakub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.