Giter Site home page Giter Site logo

hpc's Introduction

Matrix Multiplication in Cuda

This implementation performs the multiplication A^T*A of a matrix A, in CUDA. This repository consists of three parts. The first one, uses cublasDgemm function, to perform the multiplication. The second one, uses global memory and a simple loop construct to perform the multiplication. The third and last one, utilizes multiple optimized constructs to outperform the simple cublasDgemm implementation. Finally, all of the above implementations are benchmarked against each other. Check the report.pdf for a detailed walk through of the project.

Algorithm Design

The third implementation works with a block multiplication paradigm, where only matrix A is stored into global memory, and every block of threads maps to a tile to perform the final multiplication. Using tiles to perform the multiplication, achieves better locality of references achieving higher memory throughput, and faster execution.

In detail, since we run tests on a Tesla C2075 GPU (compute capability 2), a tile size of 48*48 cells and a block size of 16*16 threads are created. In this fashion, every thread maps to 9 cells, performing 9 multiplication of the final matrix C. Each thread utilizes three different sets of registers:

  1. The registers rC[3][3], are used to store the partial product, before kernel's termination and final update of matrix C.
  2. The registers rA[3] and rA_T[3], are used to store a column and a row, from a block of matrix A respectively. Those cells are loaded from shared memory and are later used to calculate the value of rC.
  3. The registers ra[3] and ra_T[3], are used for prefetching the next tile from the slow global memory to the faster shared memory.

Registers rC are zeroed and thread offsets are calculated to initiate the first transfer from global to shared memory. All threads are synced and the main loop starts calculating the matrix product in tiles. Before the actual calculation of this partial product, the prefetching of the next tile is initiated, as described before, storing the new tile data in the ra, ra_T registers. From the currently available tile in shared memory, registers rA, rA_T are populated. The result of rC is calculated easily as:

math-equation

When all the threads in a block finish with this calculation, they transfer the new tile parts, from the registers ra, ra_T, back to shared memory. The final tile is calculated and the final values of rC are stored back to global memory. Since the product of A^T*A is a symmetric matrix C, we only calculate half of the matrix (upper triangular in our case) and copy the results to their symmetric indices in C.

Testing

The three implementations, are benchmarked against their execution time. The graph bellow shows their performance on different matrix sizes of A:

graph

Compiling

To compile each implementation, simply run the make command. The available targets are erwt1, erwt2, erwt3. Depending on your GPU, TILE_WIDTH and BLOCK_WIDTH can be chosen appropriately to fully utilize your computing capability.

Contributors

memaskal vassilas
memaskal vassilas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.