Giter Site home page Giter Site logo

david-webb / flash_attention_inference Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shayebuhui01/flash_attention_inference

0.0 0.0 0.0 2.6 MB

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

License: MIT License

Shell 0.08% C++ 99.60% Python 0.04% C 0.08% Cuda 0.18% CMake 0.02%

flash_attention_inference's Introduction

Flash Attention Inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios. The calculation expression is as follows, where the precision of tensor Q, K, V and O is FP16. Remove redundant code from flash attention that has nothing to do with inference, such as backward, dropout, bf16 and torch dependencies, so you can easily integrate flash attention into LLM inference programs.

O = Softmax(Q * K^T) * V

mha

Support

  • GQA/MQA Inference: Group Query Attention / Multi Query Attention Inference
  • Hybrid Inference: Hybrid Inference by Prefill and Decoding
  • ALiBi Inference: Attention with Linear Biases

Compile

Environment

  • OS: Linux
  • Cmake Version: >= 3.12
  • GCC Version: >= 5
  • CUDA Version: >= 11.4
  • Gflags: install on ubuntu as follows
sudo apt-get install libgflags-dev

Clone

git clone https://github.com/Bruce-Lee-LY/flash_attention_inference.git

Build

RTX3080Ti / RTX3090 / RTX A6000

cd flash_attention_inference
./build.sh -a 86 -t Release -b OFF
./build.sh -a 86 -t Debug -b OFF

Tesla A100

cd flash_attention_inference
./build.sh -a 80 -t Release -b OFF
./build.sh -a 80 -t Debug -b OFF

Run Sample

./run_sample.sh

Performance

Process the data in the log and plot it as a line chart.

cd tools/performance
./performance.sh

RTX3090

  • CUDA Version: 11.8
  • Head Num: 32
  • Head Dim: 128

Prefill

Seq

The performance of both is similar for short sequences and Flash Attention v2 performs well in long sequences. It can increase by about 50%.

  • Batch Size: 128
  • Seq Q: Seq
  • Seq K: Seq

prefill_seq

Batch

When the Batch is small, the Flash Attention v2 performance is better. When the Batch is large, the performance of the two kernels is comparable.

  • Batch Size: Batch
  • Seq Q: 128
  • Seq K: 128

prefill_batch

Decoding

Seq

The performance of both is similar for short sequences and Flash Attention performs well in long sequences.

  • Batch Size: 128
  • Seq Q: 1
  • Seq K: Seq

decoding_seq

Batch

The Flash Attention performance is better regardless of the size of the Batch.

  • Batch Size: Batch
  • Seq Q: 1
  • Seq K: 128

decoding_batch

Hybrid

Regardless of the ratio of Prefill to Decoding, Flash Attention and Flash Attention v2 are similar in performance.

  • Batch Size: 100
  • Seq Q: 128
  • Seq K: 128

hybrid

Reference

  • flash attention: v1.0.9
  • flash attention v2: v2.1.0
  • cutlass: v3.1.0

flash_attention_inference's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.