Awesome-LLM-Inference: A small collection for Awesome LLM inference papers with codes, please check 📙Awesome LLM Inference Papers with Codes for more details.
- LLMs-Inference-Papers-v0.1.pdf: Introduction to LLMs and LLMs inference tech, 600 pages PDF, contains Transformer, BN, LN, MQA, FlashAttention 1/2, GLM, GPT, LLaMA 1/2, LoRA, QLoRA, P-Tuning V1/V2, RoPE, SmoothQuant, WINT8/4, Continuous Batching, FP8 etc.
- LLMs-Inference-Papers-v0.2.pdf: LLMs inference papers only, 286 pages PDF, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant etc.
Date | Title | Paper | Code |
---|---|---|---|
2022.10 | [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs | [arxiv][pdf] | [GitHub] [ByteTransformer] |
2022.07 | [Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models | [osdi22-yu][pdf] | - |
2023.05 | [FastServe] Fast Distributed Inference Serving for Large Language Models | [arxiv][pdf] | - |
2022.05 | [FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness | [arxiv][pdf] | [GitHub][flash-attention] |
2023.07 | [FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning | [arxiv][pdf] | [GitHub][flash-attention] |
2023.03 | [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU | [arxiv][pdf] | [GitHub][FlexGen] |
2022.09 | [FP8] FP8 FORMATS FOR DEEP LEARNING | [arxiv][pdf] | - |
2022.08 | [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale | [arxiv][pdf] | [GitHub][bitsandbytes] |
2018.03 | [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision | [arxiv][pdf] | - |
2018.05 | [Online Softmax] Online normalizer calculation for softmax | [arxiv][pdf] | - |
2023.09 | [PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention | [arxiv][pdf] | [GitHub][vllm] |
2023.08 | [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library | [arxiv][pdf] | [GitHub][wmma_extension] |
2021.04 | [RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING | [arxiv][pdf] | [GitHub][transformers] |
2022.11 | [SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models | [arxiv][pdf] | [GitHub][smoothquant] |
2023.05 | [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification | [arxiv][pdf] | [GitHub][FlexFlow] |
2022.11 | [WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production | [arxiv][pdf] | [GitHub][FasterTransformer] |
2022.06 | [ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers | [arxiv][pdf] | [GitHub][DeepSpeed] |
2023.03 | [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation | [arxiv][pdf] | [GitHub][DeepSpeed] |
2023.07 | [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats | [arxiv][pdf] | [GitHub][DeepSpeed] |
2023.09 | [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS | [arxiv][pdf] | [GitHub][streaming-llm] |
2023.06 | [AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | [arxiv][pdf] | [GitHub][llm-awq] |
2023.06 | [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression | [arxiv][pdf] | [GitHub][SpQR] |
2023.09 | [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads | [blog] | [GitHub][Medusa] |
GNU General Public License v3.0