flash-attention-minimal's Introduction

flash-attention-minimal

A minimal re-implementation of Flash Attention with CUDA and PyTorch. The official implementation can be quite daunting for a CUDA beginner (like myself), so this repo tries to be small and educational.

The entire forward pass is written in ~100 lines in flash.cu.
The variable names follow the notations from the original paper.

Usage

Prerequisite

PyTorch (with CUDA)
Ninja for loading in C++

Benchmark

Compare the wall-clock time between manual attention and minimal flash attention:

python bench.py

Sample output on a T4:

=== profiling manual attention ===
...
Self CPU time total: 52.389ms
Self CUDA time total: 52.545ms

=== profiling minimal flash attention === 
...  
Self CPU time total: 11.452ms
Self CUDA time total: 3.908ms

Speed-up achieved!

I don't have a GPU

Try out this online colab demo.

Caveats

No backward pass! To be honest, I found it a lot more complex than the forward pass, which was enough to show the use of shared memory to avoid large N^2 read/writes.
In the inner loop, I assign each thread to a row of the output matrix. This differs from the original implementation.
This thread-per-row simplification makes the matrix multiplications very slow. This is probably why for longer sequences and larger block sizes, this gets slower than the manual implementation.
Q,K,Vs are in float32, unlike the original implementation which uses float16.
The block size is fixed at compile time to 32.

Todos

Add backward pass
Speed up matmults
Dynamically set block size

flash-attention-minimal's People

Contributors

Stargazers

Watchers

flash-attention-minimal's Issues

slow in for loop test

slow if i test it in for loop:

REPEAT = 10
manual_result = manual_attn(q, k, v) # warmup
st = time.time()
for _ in range(REPEAT):
    manual_result = manual_attn(q, k, v)
    torch.cuda.synchronize()
print(f"manual attention mean time(ms): {((time.time() - st) * 1000) / REPEAT}")

minimal_result = minimal_attn.forward(q, k, v)  # warmup
st = time.time()
for _ in range(REPEAT):
    minimal_result = minimal_attn.forward(q, k, v)
    torch.cuda.synchronize()
print(f"minimal attention mean time(ms): {((time.time() - st) * 1000) / REPEAT}")

expect implementation of flash attention-v2 and flash-decoding

Correctness parameters

Hi Peter,

I just found your post on HN. Congratulations on the post!

I am one of the developers behind Faial which is a tool that can analyze CUDA kernels and find data-races.

I ran our tool against the kernel flash.cu and found that it is data-race free as long as the following conditions are met:

N > 0
Bc == blockDim.x
Br == blockDim.x
Tr <= blockDim.x
N >= blockDim.x * blockDim.x

Faial is a research project, so I am wondering if having access to these correctness conditions is valuable to you as a developer.

Please let me know if you'd like me to try out any combinations of parameters to see if the kernel is still data-race free.

tspeterkim / flash-attention-minimal Goto Github PK