Giter Site home page Giter Site logo

flash-llm's Introduction

Flash-LLM

Flash-LLM is a large language model (LLM) inference acceleration library for unstructured model pruning. Flash-LLM mainly contains efficient GPU code based on Tensor-Core-accelerated unstructured sparse matrix multiplication calculations, which can effectively accelerate the performance of common matrix calculations in LLM. With Flash-LLM, the pruned LLM models can be deployed onto GPUs with less memory consumption and can be executed more efficiently. Currently, the code has been evaluated on NVIDIA A100 GPUs.

We observe that LLM inference performance and memory usage are heavily bounded by four types of Skinny MatMuls shown in the left figure. Flash-LLM aims to optimize the four MatMuls based on the key approach called "Load-as-Sparse and Compute-as-Dense" (LSCD).

Getting Started

Visit the documentation to get started.

Performance

Flash-LLM shows superior performance in both single SpMM kernel and end-to-end LLM inference. The figure below shows the kernel-level performance comparisons among Flash-LLM and state-of-the-art solutions. Flash-LLM outperforms Sputnik/SparTA by 3.6x/1.4x, 3.0x/1.4x, and 2.0x/1.6x under 70%, 80%, and 90% sparsity respectively. Besides, Flash-LLM can also outperform the state-of-the-art dense kernels cuBLAS with Tensor Core enabled by 1.4x, 1.7x, and 2.1x.

KernelBenchmarking

The figure below on the left shows the performance of Flash-LLM, FasterTransformer, and DeepSpeed respectively on the OPT-66B models. First of all, Flash-LLM can support larger batch sizes because it requires less storage resources; secondly, Flash-LLM has significantly higher token generation efficiency than FasterTransformer and DeepSpeed; finally, Flash-LLM often requires fewer GPUs to execute the same LLM model.

The figure below on the right presents the performance of Flash-LLM and FasterTransformer respectively on the OPT-175B models and the memory breakdown for the inference. On the one hand, Flash-LLM's matrix calculation is more efficient; on the other hand, its communication cost is lower because it requires fewer GPUs.

Publication

Flash-LLM is a collaborated research project between Alibaba Group and FSA-Lab@USYD, which is recently accepted by VLDB 2024:

Haojun Xia*, University of Sydney; Zhen Zheng*, Yuchao Li, Alibaba Group; Donglin Zhuang, Zhongzhu Zhou, University of Sydney; Xiafei Qiu, Yong Li, Wei Lin, Alibaba Group; Shuaiwen Leon Song, University of Sydney. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. VLDB2024.

You can find the pre-print online using this link.

Citation

If you use this codebase or otherwise found our work valuable, please cite:

@misc{xia2023flashllm,
      title={Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity}, 
      author={Haojun Xia and Zhen Zheng and Yuchao Li and Donglin Zhuang and Zhongzhu Zhou and Xiafei Qiu and Yong Li and Wei Lin and Shuaiwen Leon Song},
      year={2023},
      eprint={2309.10285},
      archivePrefix={arXiv},
      primaryClass={cs.DC}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.