Giter Site home page Giter Site logo

squareandcompass-2 / awesome-llm-inference Goto Github PK

View Code? Open in Web Editor NEW

This project forked from squareandcompass/awesome-llm-inference

0.0 1.0 1.0 78.83 MB

A small Collection for Awesome LLM inference papers with codes, contains LLM.int8(), streaming-llm, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention 1/2, PagedAttention etc.

Home Page: https://github.com/DefTruth/Awesome-LLM-Inference

License: GNU General Public License v3.0

awesome-llm-inference's Introduction

v02

📒Introduction

Awesome-LLM-Inference: A small collection for Awesome LLM inference papers with codes, please check 📙Awesome LLM Inference Papers with Codes for more details.

🎉Download PDFs

  • LLMs-Inference-Papers-v0.1.pdf: Introduction to LLMs and LLMs inference tech, 600 pages PDF, contains Transformer, BN, LN, MQA, FlashAttention 1/2, GLM, GPT, LLaMA 1/2, LoRA, QLoRA, P-Tuning V1/V2, RoPE, SmoothQuant, WINT8/4, Continuous Batching, FP8 etc.
  • LLMs-Inference-Papers-v0.2.pdf: LLMs inference papers only, 286 pages PDF, contains ByteTransformer, FastServe, FlashAttention 1/2, FlexGen, FP8, LLM.int8(), Tensor Cores, PagedAttention, RoPE, SmoothQuant, SpecInfer, WINT8/4, Continuous Batching, ZeroQuant etc.

📙Awesome LLM Inference Papers with Codes

Date Title Paper Code
2022.10 [ByteTransformer] A High-Performance Transformer Boosted for Variable-Length Inputs [arxiv][pdf] [GitHub] [ByteTransformer]
2022.07 [Continuous Batching] Orca: A Distributed Serving System for Transformer-Based Generative Models [osdi22-yu][pdf] -
2023.05 [FastServe] Fast Distributed Inference Serving for Large Language Models [arxiv][pdf] -
2022.05 [FlashAttention] Fast and Memory-Efficient Exact Attention with IO-Awareness [arxiv][pdf] [GitHub][flash-attention]
2023.07 [FlashAttention-2] Faster Attention with Better Parallelism and Work Partitioning [arxiv][pdf] [GitHub][flash-attention]
2023.03 [FlexGen] High-Throughput Generative Inference of Large Language Models with a Single GPU [arxiv][pdf] [GitHub][FlexGen]
2022.09 [FP8] FP8 FORMATS FOR DEEP LEARNING [arxiv][pdf] -
2022.08 [LLM.int8()] 8-bit Matrix Multiplication for Transformers at Scale [arxiv][pdf] [GitHub][bitsandbytes]
2018.03 [Tensor Core] NVIDIA Tensor Core Programmability, Performance & Precision [arxiv][pdf] -
2018.05 [Online Softmax] Online normalizer calculation for softmax [arxiv][pdf] -
2023.09 [PagedAttention] Efficient Memory Management for Large Language Model Serving with PagedAttention [arxiv][pdf] [GitHub][vllm]
2023.08 [Tensor Cores] Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library [arxiv][pdf] [GitHub][wmma_extension]
2021.04 [RoPE] ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING [arxiv][pdf] [GitHub][transformers]
2022.11 [SmoothQuant] Accurate and Efficient Post-Training Quantization for Large Language Models [arxiv][pdf] [GitHub][smoothquant]
2023.05 [SpecInfer] Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification [arxiv][pdf] [GitHub][FlexFlow]
2022.11 [WINT8/4] Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production [arxiv][pdf] [GitHub][FasterTransformer]
2022.06 [ZeroQuant] Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [arxiv][pdf] [GitHub][DeepSpeed]
2023.03 [ZeroQuant-V2] Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation [arxiv][pdf] [GitHub][DeepSpeed]
2023.07 [ZeroQuant-FP] A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats [arxiv][pdf] [GitHub][DeepSpeed]
2023.09 [StreamingLLM] EFFICIENT STREAMING LANGUAGE MODELS WITH ATTENTION SINKS [arxiv][pdf] [GitHub][streaming-llm]
2023.06 [AWQ] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [arxiv][pdf] [GitHub][llm-awq]
2023.06 [SpQR] SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression [arxiv][pdf] [GitHub][SpQR]
2023.09 [Medusa] Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads [blog] [GitHub][Medusa]

©️License

GNU General Public License v3.0

awesome-llm-inference's People

Contributors

deftruth avatar

Watchers

Square and Compass 1 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.