Giter Site home page Giter Site logo

alwaysssssss / lmdeploy Goto Github PK

View Code? Open in Web Editor NEW

This project forked from internlm/lmdeploy

0.0 0.0 0.0 1.32 MB

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

License: Apache License 2.0

Shell 0.31% C++ 49.38% Python 13.63% C 0.07% PowerShell 0.01% Cuda 34.29% CMake 2.26% Dockerfile 0.04%

lmdeploy's Introduction

docs badge PyPI license issue resolution open issues

English | 简体中文

👋 join us on Twitter, Discord and WeChat


News 🎉

  • [2023/09] TurboMind supports all features of Code Llama: code completion, infilling, chat / instruct, and python specialist. Click here for deployment guide
  • [2023/09] TurboMind supports Baichuan2-7B
  • [2023/08] TurboMind supports flash-attention2.
  • [2023/08] TurboMind supports Qwen-7B, dynamic NTK-RoPE scaling and dynamic logN scaling
  • [2023/08] TurboMind supports Windows (tp=1)
  • [2023/08] TurboMind supports 4-bit inference, 2.4x faster than FP16, the fastest open-source implementation🚀. Check this guide for detailed info
  • [2023/08] LMDeploy has launched on the HuggingFace Hub, providing ready-to-use 4-bit models.
  • [2023/08] LMDeploy supports 4-bit quantization using the AWQ algorithm.
  • [2023/07] TurboMind supports Llama-2 70B with GQA.
  • [2023/07] TurboMind supports Llama-2 7B/13B.
  • [2023/07] TurboMind supports tensor-parallel inference of InternLM.

Introduction

LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:

  • Efficient Inference Engine (TurboMind): Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind, which supports the inference of LLaMA and its variant models on NVIDIA GPUs.

  • Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, it remembers dialogue history, thus avoiding repetitive processing of historical sessions.

  • Multi-GPU Model Deployment and Quantization: We provide comprehensive model deployment and quantification support, and have been validated at different scales.

  • Persistent Batch Inference: Further optimization of model execution efficiency.

PersistentBatchInference

Supported Models

LMDeploy has two inference backends, Pytorch and TurboMind.

TurboMind

Note
W4A16 inference requires Nvidia GPU with Ampere architecture or above.

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes Yes Yes No
Llama2 Yes Yes Yes Yes No
InternLM Yes Yes Yes Yes No
QWen-7B Yes Yes Yes No No
Baichuan-7B Yes Yes Yes Yes No
Baichuan2-7B Yes Yes No No No
Code Llama Yes Yes No No No

Pytorch

Models Tensor Parallel FP16 KV INT8 W4A16 W8A8
Llama Yes Yes No No No
Llama2 Yes Yes No No No
InternLM Yes Yes No No No

Performance

Case I: output token throughput with fixed input token and output token number (1, 2048)

Case II: request throughput with real conversation data

Test Setting: LLaMA-7B, NVIDIA A100(80G)

The output token throughput of TurboMind exceeds 2000 tokens/s, which is about 5% - 15% higher than DeepSpeed overall and outperforms huggingface transformers by up to 2.3x. And the request throughput of TurboMind is 30% higher than vLLM.

benchmark

Quick Start

Installation

Install lmdeploy with pip ( python 3.8+) or from source

pip install lmdeploy

Deploy InternLM

Get InternLM model

# 1. Download InternLM model

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/internlm/internlm-chat-7b /path/to/internlm-chat-7b

# if you want to clone without large files – just their pointers
# prepend your git clone with the following env var:
GIT_LFS_SKIP_SMUDGE=1

# 2. Convert InternLM model to turbomind's format, which will be in "./workspace" by default
python3 -m lmdeploy.serve.turbomind.deploy internlm-chat-7b /path/to/internlm-chat-7b

Inference by TurboMind

python -m lmdeploy.turbomind.chat ./workspace

Note
When inferring with FP16 precision, the InternLM-7B model requires at least 15.7G of GPU memory overhead on TurboMind.
It is recommended to use NVIDIA cards such as 3090, V100, A100, etc. Disable GPU ECC can free up 10% memory, try sudo nvidia-smi --ecc-config=0 and reboot system.

Note
Tensor parallel is available to perform inference on multiple GPUs. Add --tp=<num_gpu> on chat to enable runtime TP.

Serving with gradio

python3 -m lmdeploy.serve.gradio.app ./workspace

Serving with Restful API

Launch inference server by:

python3 -m lmdeploy.serve.openai.api_server ./workspace server_ip server_port --instance_num 32 --tp 1

Then, you can communicate with it by command line,

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
python -m lmdeploy.serve.openai.api_client restful_api_url

or webui,

# restful_api_url is what printed in api_server.py, e.g. http://localhost:23333
# server_ip and server_port here are for gradio ui
# example: python -m lmdeploy.serve.gradio.app http://localhost:23333 localhost 6006 --restful_api True
python -m lmdeploy.serve.gradio.app restful_api_url server_ip --restful_api True

Refer to restful_api.md for more details.

Serving with Triton Inference Server

Launch inference server by:

bash workspace/service_docker_up.sh

Then, you can communicate with the inference server by command line,

python3 -m lmdeploy.serve.client {server_ip_addresss}:33337

or webui,

python3 -m lmdeploy.serve.gradio.app {server_ip_addresss}:33337

For the deployment of other supported models, such as LLaMA, LLaMA-2, vicuna and so on, you can find the guide from here

Inference with PyTorch

For detailed instructions on Inference pytorch models, see here.

Single GPU

python3 -m lmdeploy.pytorch.chat $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

Tensor Parallel with DeepSpeed

deepspeed --module --num_gpus 2 lmdeploy.pytorch.chat \
    $NAME_OR_PATH_TO_HF_MODEL \
    --max_new_tokens 64 \
    --temperture 0.8 \
    --top_p 0.95 \
    --seed 0

You need to install deepspeed first to use this feature.

pip install deepspeed

Quantization

Weight INT4 Quantization

LMDeploy uses AWQ algorithm for model weight quantization

Click here to view the test results for weight int4 usage.

KV Cache INT8 Quantization

Click here to view the usage method, implementation formula, and test results for kv int8.

Warning
runtime Tensor Parallel for quantilized model is not available. Please setup --tp on deploy to enable static TP.

Contributing

We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.

Acknowledgement

License

This project is released under the Apache 2.0 license.

lmdeploy's People

Contributors

allentdan avatar apx103 avatar del-zhenwu avatar djkcyl avatar grimoire avatar harold-lkk avatar humu789 avatar irexyc avatar kevinnunu avatar lvhan028 avatar lzhangzz avatar lzhgrla avatar pangsg avatar pppppm avatar rollroll90 avatar runningleon avatar sleepwalker2017 avatar streamsunshine avatar tpoisonooo avatar vansin avatar wangruohui avatar xin-li-67 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.