Giter Site home page Giter Site logo

vinx13 / flashinfer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from flashinfer-ai/flashinfer

0.0 1.0 0.0 1.1 MB

FlashInfer: Kernel Library for LLM Serving

Home Page: https://flashinfer.ai

License: Apache License 2.0

Shell 0.26% C++ 3.87% Python 19.71% Cuda 72.96% CMake 3.19%

flashinfer's Introduction

FlashInfer

Kernel Library for LLM Serving

| Blog | Documentation | Discussion Forum |

Release Documentation

FlashInfer is a library for Language Languages Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, PageAttention and LoRA. FlashInfer focus on LLM serving and inference, and delivers state-the-art performance across diverse scenarios.

The unique features of FlashInfer include:

  1. Comprehensive Attention Kernels: Attention kernels that cover all the common use cases of LLM serving, including single-request and batching versions of Prefill, Decode, and Append kernels, on different formats of KV-Cache (Padded Tensor, Ragged Tensor, and Page Table).
  2. Optimized Shared-Prefix Batch Decoding: FlashInfer enhances shared-prefix batch decoding performance through cascading, resulting in an impressive up to 31x speedup compared to the baseline vLLM PageAttention implementation (for long prompt of 32768 tokens and large batch size of 256).
  3. Accelerate Attention for Compressed/Quantized KV-Cache: Modern LLMs are often deployed with quantized/compressed KV-Cache to reduce memory traffic. FlashInfer accelerates these scenarios by optimizing performance for Grouped-Query Attention, Fused-RoPE Attention and Quantized Attention.

FlashInfer support PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.

News

  • [Jan 31, 2024] Blog Post Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding
  • [Jan 31, 2024] Blog Post Accelerating Self-Attentions for LLM Serving with FlashInfer

Getting Started

Using our PyTorch API is the easiest way to get started:

Installation

We provide prebuilt wheels for Linux and you can try out FlashInfer with the following command:

# For CUDA 12.1 & torch 2.2
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.2
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

or you can build from source:

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer/python
pip install -e .

Trying it out

Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:

import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0) 
v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0) 

# decode attention

num_qo_heads = 32
q = torch.randn(num_qo_heads, head_dim).half().to(0)

o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(0) # prefill attention
o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask

Check out documentation for usage of batch decode/append/prefill kernels and shared-prefix cascading kernels.

Run Benchmarks

We profile FlashInfer kernel performance with nvbench and you can compile and run the benchmarks with the following commands:

mkdir build
cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
cd build
cmake ..
make -j12

You can run ./bench_{single/batch}_{prefill/decode} to benchmark the performance (e.g. ./bench_single_prefill for single-request prefill attention). ./bench_{single/batch}_{prefill/decode} --help will show you the available options.

C++ API and TVM Bindings

FlashInfer also provides C++ API and TVM bindings, please refer to documentation for more details.

Adoption

Currently FlashInfer is adopted by the following projects:

Acknowledgement

FlashInfer is inspired by FlashAttention 1&2, vLLM, stream-K and cutlass projects.

flashinfer's People

Contributors

yzh119 avatar masterjh5574 avatar junrushao avatar hnyls2002 avatar cyx-6 avatar abcdabcd987 avatar esmeetu avatar github-actions[bot] avatar yard1 avatar dune-z avatar guocuimi avatar qubitium avatar rickzx avatar hsq79815 avatar knowingnothing avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.