The awesome-llm-inference from squareandcompass-2

📒Introduction

Awesome-LLM-Inference: A small collection for Awesome LLM inference papers with codes, please check 📙Awesome LLM Inference Papers with Codes for more details.

🎉Download PDFs

LLMs-Inference-Papers-v0.1.pdf: Introduction to LLMs and LLMs inference tech, 600 pages PDF, contains Transformer, BN, LN, MQA, FlashAttention 1/2, GLM, GPT, LLaMA 1/2, LoRA, QLoRA, P-Tuning V1/V2, RoPE, SmoothQuant, WINT8/4, Continuous Batching, FP8 etc.
LLMs-Inference-Papers-v0.2.pdf: LLMs inference papers only, 286 pages PDF, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant etc.

📙Awesome LLM Inference Papers with Codes

Date	Title	Paper	Code
2022.10	[ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs	[arxiv][pdf]	[GitHub] [ByteTransformer]
2022.07	[Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models	[osdi22-yu][pdf]	-
2023.05	[FastServe] Fast Distributed Inference Serving for Large Language Models	[arxiv][pdf]	-
2022.05	[FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness	[arxiv][pdf]	[GitHub][flash-attention]
2023.07	[FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning	[arxiv][pdf]	[GitHub][flash-attention]
2023.03	[FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU	[arxiv][pdf]	[GitHub][FlexGen]
2022.09	[FP8] FP8 FORMATS FOR DEEP LEARNING	[arxiv][pdf]	-
2022.08	[LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale	[arxiv][pdf]	[GitHub][bitsandbytes]
2018.03	[Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision	[arxiv][pdf]	-
2018.05	[Online Softmax] Online normalizer calculation for softmax	[arxiv][pdf]	-
2023.09	[PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention	[arxiv][pdf]	[GitHub][vllm]
2023.08	[Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library	[arxiv][pdf]	[GitHub][wmma_extension]
2021.04	[RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING	[arxiv][pdf]	[GitHub][transformers]
2022.11	[SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models	[arxiv][pdf]	[GitHub][smoothquant]
2023.05	[SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification	[arxiv][pdf]	[GitHub][FlexFlow]
2022.11	[WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production	[arxiv][pdf]	[GitHub][FasterTransformer]
2022.06	[ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers	[arxiv][pdf]	[GitHub][DeepSpeed]
2023.03	[ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation	[arxiv][pdf]	[GitHub][DeepSpeed]
2023.07	[ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats	[arxiv][pdf]	[GitHub][DeepSpeed]
2023.09	[StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS	[arxiv][pdf]	[GitHub][streaming-llm]
2023.06	[AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration	[arxiv][pdf]	[GitHub][llm-awq]
2023.06	[SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression	[arxiv][pdf]	[GitHub][SpQR]
2023.09	[Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads	[blog]	[GitHub][Medusa]