Giter Site home page Giter Site logo

falarion08 / dot-product-implementation Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 1.0 12.56 MB

An implementation of dot product using CUDA, x64, and SIMD using the integer data type (32-bits) in C Language.

License: MIT License

C 64.21% Assembly 35.79%
cuda cuda-programming nasm parallel-computing simd simd-parallelism x86 x86-64

dot-product-implementation's Introduction

Dot Product Implementation

Link to CUDA implementation: https://colab.research.google.com/drive/1dxFkxKCokf-MbbmoTKiWeVs1MpCF6-hx?usp=sharing

Important Notes:

  • Each kernel result were tested on an input size of 2^20 and the average time was taken on 30 number of runs.
  • The number of blocks and threads for each input sizes are both 1024 in length.
  • The implementation for CUDA uses cooperative groups with prefetching

I. Analysis

Average Runtime (microseconds)

Kernel 2^20 2^24 2^28
C 4,633.33 us 129,600.00 us 1,289,266.67 us
x86-64 1,433.33 us 28,366.67 us 271,266.67 us
x86-SIMD 1,133.33 us 16,166.67 us 246,133.33 us
CUDA 355.81us 590.65 us 8,013.80 us

Measuring Performance of x86-64, x86-SIMD, and CUDA to C Implementation (How many times faster than normal C implementation)

Kernel 2^20 2^24 2^28
x86-64 3.23 4.57 4.75
x86-SIMD 4.09 8.02 5.24
CUDA 13.02 219.42 160.88

Measuring Performance of x86-SIMD and CUDA to x86-64 (How many times faster than x86-64)

Kernel 2^20 2^24 2^28
x86-SIMD 1.26 1.75 5.24
CUDA 4.02 48.03 33.85

Measuring Performance of CUDA to x86-SIMD (How many times faster than x86-SIMD)

Kernel 2^20 2^24 2^28
CUDA 3.19 27.37 30.71

From the table provided, it is observable that the assembly implementation of dot product is always faster than normal C Programming. This is mainly because it takes several instructions to run a for loop in C when translated in assembly than doing a LOOP instruction call in assembly. x86-SIMD provides an even faster computation in computing the dot matrix as it can process multiple data with a single instruction. Specifically, a single instruction can perform operations on 8 or 16 integer elements, in which you could multiply, add, and other operations on the elements. For the CUDA implementation, cooperative groups was used along with prefetching in order to obtain huge perfomance gain by having faster execution time. Threads are partitions into smaller group of threads and performs mathematical operations and binary reduction. The execution time increases as smaller group of threads are synchronized and does not wait for all the threads in a block to reach a certain barrier of the code. In addition, prefetching sends the GPU data before it is needed while the CPU executes instructions. As a result, memory latency (cost of transferring data) to decrease.

Although, the average execution time for all array sizes is fastest in CUDA, the consequence of GPU is the overheard costs as compared to running the program in x64 which directly occurs in the CPU and does not need any time to transfer data. Consider the execution time of running CUDA for 30 times (results of V.). The execution time of for the kernel is 11.61 ms. Considering the overhead time (Data Transfer Time, GPU Page Fault Groups, Page Throttling), the actual runtime for executing the kernel is 20.800 ms. Consider the second best performing kernel, which is the implementation for x86-SIMD (results of IV.), the execution time for running an array size of 2^20 is 43.000 ms. With prefetching and cooperative groups implementation of CUDA, it still performs better than the x64-SIMD by 2.06 times.

II. C Program Result

image

III. x86-64 Program Result

image

IV. x86-SIMD Program Result

image

  • Register YMM0 stores the base address of pointer a, and loads the initial first 8 elements image

  • Register YMM1 stores the base address of pointer b, and loads the initial first 8 elements

    image

  • Register YMM2 stores the vertical multiplation of pointer a and pointer b

    image

  • Register YMM3 stores the horizontal addition of register YMM2 (Duplicates of the horizontal additions will occur)

    image

V. Cuda Program Result

image

Overhead time

image

dot-product-implementation's People

Contributors

falarion08 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

clayne

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.