Giter Site home page Giter Site logo

generics's Introduction

Generics

This library generalizes certain CUDA intrinsics to work on arbitrary data types. For example, NVIDIA GPUs of CUDA compute capability 3.5 and greater, such as the Tesla K20, support __ldg(), an intrinsic that loads through the read-only texture cache, and can improve performance in some circumstances. This library allows __ldg to work on arbitrary types, as detailed below. It also generalizes __shfl() to shuffle arbitrary types.

LDG

CUDA provides overloads of __ldg() for some built-in types:

char, short, int, long long, int2, int4, unsigned char, unsigned short, unsigned int, unsigned long long, uint2, uint4, float, double, float2, float4, double2.

However, for all other types, including user defined types, the native overloads of __ldg() are insufficient. To solve this problem, this library provides a template:

template<typename T> __device__ T __ldg(const T*);

This template allows data of any type to be loaded using __ldg. The only restriction on T is that it have a default constructor.

To use this library, simply #include <generics/ldg.h>.
The __ldg() overloads provided natively by CUDA will be used if T is natively supported. If not, the template will be used.

See ldg.cu for an example.

If you are compiling for CUDA compute capability of less than 3.5, __ldg() will fall back to traditional loads.

SHFL

For devices of compute capability 3.0 or above, CUDA provides a set of __shfl() intrinsics that share data between threads in a warp, without using any shared memory. CUDA provides overloads for int and float types. For all other types, this library provides a few templates:

template<typename T> __device__ T __shfl(const T& t, const int& i);
template<typename T> __device__ T __shfl_down(const T& t, const int& delta);
template<typename T> __device__ T __shfl_up(const T& t, const int& delta);
template<typename T> __device__ T __shfl_xor(const T& t, const int& mask);

This allows data of other types to be shuffled using the __shfl() mechanism. There are two restrictions on T:

  • sizeof(T) must be divisible by 4. The code will fail to compile if you instantiate it with a type that does not satisfy this requirement.
  • T must have a default constructor

To use this library, simply #include <generics/shfl.h>.
The __shfl() overloads provided natively by CUDA will be used if T is natively supported. If not, the template will be used.

See shfl.cu for an example.

generics's People

Contributors

bryancatanzaro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

generics's Issues

Note: Performance of __ldg()

I tested your __ldg() implementation on a Titan-X pascal with CUDA 8.0 to handle vec2 items. I found this reduced performance of the entire kernel by 2.5x compared to using __ldg() on the individual float components in my latency/memory bound kernel.

Unusually the profiler's PC sampling listed the stalls as memory dependency, but gave them the colour of synchronisation stall.

I realise the repo is untouched in 3 years, not expecting any updates, just leaving a note incase anyone else is going to use it blindly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.