Giter Site home page Giter Site logo

heydavid633 / matmultutorial Goto Github PK

View Code? Open in Web Editor NEW

This project forked from knowingnothing/matmultutorial

0.0 0.0 0.0 2.11 MB

A Easy-to-understand TensorOp Matmul Tutorial

License: Apache License 2.0

C++ 49.76% Python 47.09% C 0.35% Cuda 2.80%

matmultutorial's Introduction

TensorOp Matmul Tutorial

This is an example repo for CUDA MatMul implementation. The aim of this repo is to provide some insights in high-performance kernel design for CUDA beginners. Currently, I only provide some implementation examples in examples/matmul/this. Contributions for more kernels and other MatMul implementations are highly welcomed.

About

There is a detailed explanation about the different versions of MatMul kernels in examples/matmul/this.

Contents

  • examples:

    • matmul: The MatMul implementations

      • this-sm90: The Hopper version Matmul
      • this-sm80: The MatMul implemented by this repo
      • cublas: Call CuBLAS for performance test
      • cutlass: Call CUTLASS for performance test
      • mlir-gen: The cuda code generated by MLIR
      • triton: Call Triton for performance test
      • tvm: Call Relay+CUTLASS/CuBLAS or TensorIR for performance test
    • atom: The usage of single intrinsic/instructions

    • reduction: Some reduction kernels for epilogue

Performance Results

Performance on H800 GPU

image The current version only achieves on average 70% performance of CuBLAS. I am still working on improving the performance.

Performance on A100 GPU

A100-GEMM-perf

The overall performance comparison among Relay, CuBLAS, CUTLASS, TensorIR, Triton, and our implementations. The y-axis is speedup to Relay+CUTLASS.

Overall, the geometric mean speedup to Relay+CUTLASS is 1.73x, to TensorIR (1000 tuning trials using MetaSchedule per case) is 1.22x, to CuBLAS is 1.00x, to CUTLASS is 0.999x, to Triton is 1.07x. The 61 shapes are:

No. M N K
1 5376 5376 2048
2 5376-128 5376 2048
3 5376-2*128 5376 2048
... ... ... ...
11 5376-10*128 5376 2048
12 5376+128 5376 2048
13 5376+2*128 5376 2048
... ... ... ...
21 5376+10*128 5376 2048
22 5376 5376-128 2048
23 5376 5376-2*128 2048
... ... ... ...
31 5376 5376-10*128 2048
32 5376 5376+128 2048
33 5376 5376+2*128 2048
... ... ... ...
41 5376 5376+10*128 2048
42 5376 5376 2048-128
43 5376 5376 2048-2*128
... ... ... ...
51 5376 5376 2048-10*128
52 5376 5376 2048+128
53 5376 5376 2048+2*128
... ... ... ...
61 5376 5376 2048+10*128

MLIR Generated CUDA kernels

I also use MLIR to generate MatMul kernels. The generated ones are in examples/matmul/mlir-gen. The performance to handwritten ones (examples/matmul/this) is shown as belows. As MLIR generated ones only implement part of the optimizations used by handwritten ones, we call the MLIR generated ones partial and the handwritten ones full.

mlir-gen Overall, MLIR generated versions achieve 86% the performance of handwritten kernels.

Plan

More kernels

I plan to implement kernels for other operators such as softmax in future.

Use CUTLASS in implementation

There is a plan to use the CuTe interface of CUTLASS to implement high-performance kernels.

matmultutorial's People

Contributors

knowingnothing avatar kuangjux avatar l1nkr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.