Matrix Multiplication in Cuda

This implementation performs the multiplication A^T*A of a matrix A, in CUDA. This repository consists of three parts. The first one, uses cublasDgemm function, to perform the multiplication. The second one, uses global memory and a simple loop construct to perform the multiplication. The third and last one, utilizes multiple optimized constructs to outperform the simple cublasDgemm implementation. Finally, all of the above implementations are benchmarked against each other. Check the report.pdf for a detailed walk through of the project.

Algorithm Design

The third implementation works with a block multiplication paradigm, where only matrix A is stored into global memory, and every block of threads maps to a tile to perform the final multiplication. Using tiles to perform the multiplication, achieves better locality of references achieving higher memory throughput, and faster execution.

In detail, since we run tests on a Tesla C2075 GPU (compute capability 2), a tile size of 48*48 cells and a block size of 16*16 threads are created. In this fashion, every thread maps to 9 cells, performing 9 multiplication of the final matrix C. Each thread utilizes three different sets of registers:

The registers rC[3][3], are used to store the partial product, before kernel's termination and final update of matrix C.
The registers rA[3] and rA_T[3], are used to store a column and a row, from a block of matrix A respectively. Those cells are loaded from shared memory and are later used to calculate the value of rC.
The registers ra[3] and ra_T[3], are used for prefetching the next tile from the slow global memory to the faster shared memory.

Registers rC are zeroed and thread offsets are calculated to initiate the first transfer from global to shared memory. All threads are synced and the main loop starts calculating the matrix product in tiles. Before the actual calculation of this partial product, the prefetching of the next tile is initiated, as described before, storing the new tile data in the ra, ra_T registers. From the currently available tile in shared memory, registers rA, rA_T are populated. The result of rC is calculated easily as:

$math-equation$

When all the threads in a block finish with this calculation, they transfer the new tile parts, from the registers ra, ra_T, back to shared memory. The final tile is calculated and the final values of rC are stored back to global memory. Since the product of A^T*A is a symmetric matrix C, we only calculate half of the matrix (upper triangular in our case) and copy the results to their symmetric indices in C.

Testing

The three implementations, are benchmarked against their execution time. The graph bellow shows their performance on different matrix sizes of A:

Compiling

To compile each implementation, simply run the make command. The available targets are erwt1, erwt2, erwt3. Depending on your GPU, TILE_WIDTH and BLOCK_WIDTH can be chosen appropriately to fully utilize your computing capability.

Contributors


memaskal	vassilas

memaskal / hpc Goto Github PK

hpc's Introduction

Matrix Multiplication in Cuda

Algorithm Design

Testing

Compiling

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent