Giter Site home page Giter Site logo

jg.techlearning.cuda's Introduction

CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by Nvidia.

https://en.wikipedia.org/wiki/CUDA

  • Code written on NVIDIA Cuda Runtime 10.2

  • Tested on Quadro M2000M

Implemented & measured:

  1. Polynomial calculation on CPU & GPU (part of the code was provided by an academic teacher)
  2. Image gray scaling on CPU & GPU
  3. Matrix multiplication on CPU & GPU (how the loop rolling/unrolling performed by the NVCC compiler affects the time it takes to perform calculations)

Tasks that have been completed here and theoretical introduction to laboratories was developed by IT engineer and academic teacher - Slawomir Wernikowski at the West Pomeranian University of Technology

Polynomial calculation results

Test 1

JG::Starting program which uses GPu and CPU to calculate polynomial for <1000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <3> [ms]
  • Duration CPU: <2> [ms]

Test 2 JG::Starting program which uses GPu and CPU to calculate polynomial for <10000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <28> [ms]
  • Duration CPU: <31> [ms]

Test 2.1 JG::Starting program which uses GPu and CPU to calculate polynomial for <10000000> values

  • GPU no. of blocks : <1024>
  • Duration GPU: <17> [ms]
  • Duration CPU: <32> [ms]

Test 3 JG::Starting program which uses GPu and CPU to calculate polynomial for <100000000> values

  • GPU no. of blocks : <16>
  • Duration GPU: <290> [ms]
  • Duration CPU: <348> [ms]

Test 4

  • JG::Starting program which uses GPu and CPU to calculate polynomial for <100000000> values
  • GPU no. of blocks : <1024>
  • Duration GPU: <189> [ms]
  • Duration CPU: <362> [ms]

Conclusions:

  • When size of vector storing values is bigger than 10^6 the GPU calculations are much faster than CPU (even twice faster)
  • When number of blocks (and threads inside blocks) are fully used we may get results pretty fast, decreasing number of threads used in blocks will extend the time

Image gray scaling on CPU & GPU output files & measurements

Input 1 Output 1 Input 2 Output 2


Results

CPU: Intel64 Family 6 Model 94 Stepping 3

GPU:NVIDIA Quadro M2000M

TEST1

Width of an image: 640

Height of an image: 480

Resolution (total number of pixels): 307200

Number of colors: 256

GPU duration 2370833 nanoseconds

CPU duration 69864629 nanoseconds


TEST2

Width of an image: 640

Height of an image: 480

Resolution (total number of pixels): 307200

Number of colors: 256

GPU duration 2420297 nanoseconds

CPU duration 68725466 nanoseconds


TEST3

Width of an image: 1419

Height of an image: 1001

Resolution (total number of pixels): 1420419

Number of colors: 16777216

GPU duration 11482626 nanoseconds

CPU duration 346865093 nanoseconds


TEST4

Width of an image: 1419

Height of an image: 1001

Resolution (total number of pixels): 1420419

Number of colors: 16777216

GPU duration 11523404 nanoseconds

CPU duration 343563144 nanoseconds

Matrix multiplication on CPU & GPU Results

a) Matrix A=300x300, Matrix B=300x300

  • CPU (avg) = 133ms
  • GPU version "without pragma unroll" (avg) = 123ms
  • GPU version with pragma unroll (avg) = 111ms ( 10% time faster)

b) Matrix A=900x900, Matrix B=900x900

  • CPU (avg) = 4839 ms,
  • GPU version "without pragma unroll" (avg) = 1646ms
  • GPU version with pragma unroll (widthC calculated at runtime, no boost) = 1633ms (not faster at all)
  • GPU version with pragma roll but used CONST_WIDTH_C 400ms (almost 24-30% time faster)

Loop below has been tested with #pragma unroll

    for (int i = 0; i < CONST_WIDTH_C; i++)
    {
        tmp_sum += in_tabA[row * CONST_WIDTH_C+ i] * in_tabB[i * CONST_WIDTH_C + col];
    }

Conclusions:

  • The effectiveness of the #pragma unroll directive in the context of performance - strongly depends on what is calculated in a loop
  • When performing tests, it turned out that the CONST_WIDTH_C variable must be either constant or defined using #define - if it was calculated in runtime based on the size of dynamic matrix (i.e. not known at the compilation stage) - then #pragma unroll did not bring any profit
  • #pragma unroll at the compilation stage reduce the number of operations that would have to be carried out in the case of dynamic loop development is reduced

jg.techlearning.cuda's People

Contributors

jacob273 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.