vectorch-ai / scalellm Goto Github PK

View Code? Open in Web Editor NEW

356.0 16.0 24.0 17.54 MB

A high-performance inference system for large language models, designed for production environments.

Home Page: https://docs.vectorch.com/

License: Apache License 2.0

CMake 3.14% Shell 1.39% C++ 66.53% Cuda 15.63% Rust 0.86% Go 0.82% C 3.95% Dockerfile 0.01% Python 7.67% Jinja 0.01%

cuda inference llm llm-inference model production serving speculative transformer efficiency

scalellm's Introduction

ScaleLLM

An efficient LLM Inference solution

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3.1, Gemma2, Bloom, GPT-NeoX, and more.

ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our Roadmap for more details.

News:

[06/2024] - ScaleLLM is now available on PyPI. You can install it using pip install scalellm.
[03/2024] - Advanced features support for ✅ CUDA graph, ✅ prefix cache, ✅ chunked prefill and ✅ speculative decoding.
[11/2023] - First release with support for popular open-source models.

Key Features

High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more.
Tensor Parallelism: Utilizes tensor parallelism for efficient model execution.
OpenAI-compatible API: An OpenAI-compatible REST API server that supports both chat and completions.
Huggingface models: Seamless integration with most popular HF models, supporting safetensors.
Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models.
Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

Get Started
Advanced Features
Supported Models
Limitations
Contributing
Acknowledgements
License

Getting Started

ScaleLLM is available as a Python Wheel package on PyPI. You can install it using pip:

# Install scalellm with CUDA 12.1 and Pytorch 2.4.0
pip install -U scalellm

If you want to install ScaleLLM with different version of CUDA and Pytorch, you can pip install it with provding index URL of the version. For example, to install ScaleLLM with CUDA 12.1 and Pytorch 2.2.2, you can use the following command:

pip install -U scalellm -i https://whl.vectorch.com/cu121/torch2.2.2/

Build from source

If no wheel package is available for your configuration, you can build ScaleLLM from source code. You can clone the repository and install it locally using the following commands:

python setup.py bdist_wheel
pip install dist/scalellm-*.whl

OpenAI-Compatible Server

You can start the OpenAI-compatible REST API server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

Chatbot UI

A local Chatbot UI is also available on localhost:3000. You can start it with latest image using the following command:

docker pull docker.io/vectorchai/chatbot-ui:latest
docker run -it --net=host \
  -e OPENAI_API_HOST=http://127.0.0.1:8080 \
  -e OPENAI_API_KEY=YOUR_API_KEY \
  docker.io/vectorchai/chatbot-ui:latest

Usage Examples

You can use ScaleLLM for offline batch inference, or online distributed inference. Below are some examples to help you get started. More examples can be found in the examples folder.

Chat Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

You can query the chat completions with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

or with openai python client:

import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()
print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.chat.completions.create(
    model=model,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello"},
    ],
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    delta = choice.delta
    if delta.content:
        print(delta.content, end="")
print()

Completions

Start rest api server with the following command:

python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B

For regular completions, you can use this example:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B",
    "prompt": "hello",
    "max_tokens": 32,
    "temperature": 0.7,
    "stream": true
  }'

import openai

client = openai.Client(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY",
)

# List available models
models = client.models.list()

print("==== Available models ====")
for model in models.data:
    print(model.id)

# choose the first model
model = models.data[0].id

stream = client.completions.create(
    model=model,
    prompt="hello",
    max_tokens=32,
    temperature=0.7,
    stream=True,
)

print(f"==== Model: {model} ====")
for chunk in stream:
    choice = chunk.choices[0]
    if choice.text:
        print(choice.text, end="")
print()

Advanced Features

CUDA Graph

CUDA Graph can improve performance by reducing the overhead of launching kernels. ScaleLLM supports CUDA Graph for decoding by default. In addition, It also allows user to specify which batch size to capture by setting the --cuda_graph_batch_sizes flag.

for example:

python3 -m scalellm.serve.api_server \
  --model=meta-llama/Meta-Llama-3.1-8B-Instruct \
  --enable_cuda_graph=true \
  --cuda_graph_batch_sizes=1,2,4,8

The limitations of CUDA Graph could cause problems during development and debugging. If you encounter any issues related to it, you can disable CUDA Graph by setting the --enable_cuda_graph=false flag.

Prefix Cache

The KV cache is a technique that caches the intermediate kv states to avoid redundant computation during LLM inference. Prefix cache extends this idea by allowing kv caches with the same prefix to be shared among different requests.

ScaleLLM supports Prefix Cache and enables it by default. You can disable it by setting the --enable_prefix_cache=false flag.

Chunked Prefill

Chunked Prefill splits a long user prompt into multiple chunks and populates the remaining slots with decodes. This technique can improve decoding throughput and enhance the user experience caused by long stalls. However it may slightly increase Time to First Token (TTFT). ScaleLLM supports Chunked Prefill, and its behavior can be controlled by setting the following flags:

--max_tokens_per_batch: The maximum tokens for each batch, default is 512.
--max_seqs_per_batch: The maximum sequences for each batch, default is 128.

Speculative Decoding

Speculative Decoding is a common used technique to speed up LLM inference without changing distribution. During inference, it employs an economical approximation to generate speculative tokens, subsequently validated by the target model. For now, ScaleLLM supports Speculative Decoding with a draft model to generate draft tokens, which can be enabled by configuring a draft model and setting the speculative steps.

for example:

python3 -m scalellm.serve.api_server \
  --model=google/gemma-7b-it \
  --draft_model=google/gemma-2b-it \
  --num_speculative_tokens=5 \
  --device=cuda:0 \
  --draft_device=cuda:0

Quantization

Quantization is a crucial process for reducing the memory footprint of models. ScaleLLM offers support for two quantization techniques: Accurate Post-Training Quantization (GPTQ) and Activation-aware Weight Quantization (AWQ), with seamless integration into the following libraries: autogptq and awq.

Supported Models

Models	Tensor Parallel	Quantization	Chat API	HF models examples
Aquila	Yes	Yes	Yes	BAAI/Aquila-7B, BAAI/AquilaChat-7B
Bloom	Yes	Yes	No	bigscience/bloom
Baichuan	Yes	Yes	Yes	baichuan-inc/Baichuan2-7B-Chat
ChatGLM4/3	Yes	Yes	Yes	THUDM/chatglm3-6b
Gemma2	Yes	Yes	Yes	google/gemma-2-2b
GPT_j	Yes	Yes	No	EleutherAI/gpt-j-6b
GPT_NeoX	Yes	Yes	No	EleutherAI/gpt-neox-20b
GPT2	Yes	Yes	No	gpt2
InternLM	Yes	Yes	Yes	internlm/internlm-7b
Llama3/2	Yes	Yes	Yes	meta-llama/Meta-Llama-3.1-8B-Instruct, meta-llama/Meta-Llama-3.1-8B
Mistral	Yes	Yes	Yes	mistralai/Mistral-7B-v0.1
MPT	Yes	Yes	Yes	mosaicml/mpt-30b
Phi2	Yes	Yes	No	microsoft/phi-2
Qwen2	Yes	Yes	Yes	Qwen/Qwen-72B-Chat
Yi	Yes	Yes	Yes	01-ai/Yi-6B, 01-ai/Yi-34B-Chat-4bits, 01-ai/Yi-6B-200K

If your model is not included in the supported list, we are more than willing to assist you. Please feel free to create a request for adding a new model on GitHub Issues.

Limitations

There are several known limitations we are looking to address in the coming months, including:

Only supports GPUs that newer than Turing architecture.

Contributing

If you have any questions or want to contribute, please don't hesitate to ask in our "Discussions" forum or join our "Discord" chat room. We welcome your input and contributions to make ScaleLLM even better. Please follow the Contributing.md to get started.

Acknowledgements

The following open-source projects have been used in this project, either in their original form or modified to meet our needs:

License

This project is released under the Apache 2.0 license.

scalellm's People

Contributors

Stargazers

Watchers

scalellm's Issues

pickle model load throw exception for nomic-ai/gpt4all-j

model link: https://huggingface.co/nomic-ai/gpt4all-j/tree/main
sounds the pickle file also includes model architecture. need to add logs to handle this.

does the current openai-copatible-API support function calls?

like format

    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that helps decide whether the customer comment violates the community guidelines. Reviews should not be offensive, abusive, or contain any personal information.",
            },
            {
                "role": "user",
                "content": "customer comment: " + content,
            }
        ],
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "respond_to_comment",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "is_safe": {
                                "type": "boolean",
                            },
                        },
                        "required": ["is_safe"],
                    },
                },
            }
        ],
        max_tokens=20,
    )

Implementing fused FFN (Feed-Forward Network) to enhance efficiency

terminate called after throwing an instance of 'c10::Error'

I created a k8s workload based on the docker-compose but I get the below error,

https://markuphero.com/share/vAFiVkCW4S4OnFyUPj1P

add padding to weights to make tensor parallel work

for example: error from gpt-large with tensor parallel = 2

[F linear_impl.cpp:76] Check failed: out_features % world_size == 0 out_features 50257 not divisible by world_size 2

add stop_token_ids support for chat models

for example: BAAI/AquilaChat-7B

Use local model instead of HF_MODEL_ID

Limit GPU Memory

Is it possible to limit GPU Memory used by ScaleLLM?

GPU Arch: Turing architecture (sm75)

Support OpenChat 3.5?

From discussion in issue #28, the chat template is embedded in code, so I think currently not supported?
Can support this model?

From official openchat/openchat_3.5:
GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant:

From TheBloke/openchat_3.5-AWQ:
GPT4 User: {prompt}<|end_of_turn|>GPT4 Assistant:

I’m not sure why it is different in the quantized model.

Structural Decoding: Json format

How to build locally?

I saw your promption here https://huggingface.co/TheBloke/Yi-34B-GPTQ/discussions/2#654b063883e7bfc43160ca0e

Maybe a stupid question, just wonder if this building is suffice for local deploy w/o docker (since my remote instance is already a docker instance, and I cannot create docker inside docker due to admin privilege not being granted), assumed all dep are resolved.

# build
RUN cmake -G Ninja -S . -B build
RUN cmake --build build --target all --config Release -j$(nproc)

# install
RUN cmake --install build --prefix /app
RUN pip3 install -r ./requirements.txt

cuda graph capture may occasionally become stuck with multiple gpus.

It is a known issue that CUDA graph capture may occasionally become stuck when multiple workers are in use. further investigation is needed.

Disabled cuda graph for multiple gpus by default for now.

set up ci workflow

1> build triggered by pull request
2> docker image publish
3> python package publish

merge two rust projects into one to avoid multiple definition build error for Release

[build] /rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/personality/gcc.rs:251: multiple definition of `rust_eh_personality'; src/model_loader/safetensors/x86_64-unknown-linux-gnu/release/libsafetensors.a(safetensors-e174e679aecb9d8a.safetensors.b257752cd506e4e1-cgu.7.rcgu.o):/rustc/eb26296b556cef10fb713a38f3d16b9886080f26/library/std/src/personality/gcc.rs:251: first defined here [build] /usr/bin/ld: src/tokenizer/tokenizers/x86_64-unknown-linux-gnu/release/libtokenizers.a(std-7b9f6349d87c69a1.std.edecf719d62c0e6b-cgu.0.rcgu.o):(.init_array.00099+0x0): multiple definition of `std::sys::unix::args::imp::ARGV_INIT_ARRAY'; src/model_loader/safetensors/x86_64-unknown-linux-gnu/release/libsafetensors.a(safetensors-e174e679aecb9d8a.safetensors.b257752cd506e4e1-cgu.7.rcgu.o):(.init_array.00099+0x0): first defined here

The output from the API lacks "usage" content, which is causing compatibility issues when trying to use the API with other tools.

（with Yi model）

grpc server connection error

I deploy Yi Model use the follow command

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=01-ai/Yi-34B-Chat-8bits \
  -e DEVICE=auto \
  docker.io/vectorchai/scalellm:latest --logtostderr --model_type=Yi

And then, I try use restful API use this command

docker run -it --net=host \
  docker.io/vectorchai/scalellm-gateway:latest --logtostderr

But when I use requests command, I get grpc error the errMess is

connection error: desc="transport: Error while dialing: dial tcp 127.0.0.1:8888: connect: connection refused"

I test curl localhost:9999/health is ok

Exploring lookahead decoding support

After adding speculative decoding support for draft models, we are ready to explore other proposal mechanisms, like ngram Jacobi (Lookahead), Medusa/EAGLE, etc.

The NVIDIA driver on your system is too old (found version 11080).

NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8

export MODEL_PATH=Yi-34B-Chat-4bits
01ai/Yi-34B-Chat-4bits  $ export MODEL_ID=01-ai/Yi-34B-Chat-4bits
01ai/Yi-34B-Chat-4bits  $ docker run -it --gpus=all --net=host --shm-size=1g \

-v $MODEL_PATH:$MODEL_PATH
-e DEVICE=cuda:1
-e NCCL_DEBUG=INFO
docker.io/vectorchai/scalellm:latest --logtostderr --model_path=$MODEL_PATH --model_id=$MODEL_ID --model_type=Yi
I20231129 08:13:34.992501 7 main.cpp:135] Using devices: cuda:1
W20231129 08:13:34.993809 7 args_overrider.cpp:132] Overwriting model_type from llama to Yi
I20231129 08:13:34.993916 7 engine.cpp:91] Initializing model from: /data4/candowu/modelscope/01ai/Yi-34B-Chat-4bits
W20231129 08:13:34.993944 7 model_loader.cpp:162] Failed to find tokenizer.json, use tokenizer.model instead. Please consider using fast tokenizer for better performance.
I20231129 08:13:35.245934 7 engine.cpp:98] Initializing model with dtype: Half
I20231129 08:13:35.245993 7 engine.cpp:107] Initializing model with ModelArgs: [model_type: Yi, dtype: float16, hidden_size: 7168, hidden_act: silu, intermediate_size: 20480, n_layers: 60, n_heads: 56, n_kv_heads: 8, vocab_size: 64000, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 5e+06, rope_scaling: 1, rotary_pct: 1, max_position_embeddings: 4096, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, residual_post_layernorm: 0], QuantArgs: [quant_method: awq, bits: 4, group_size: 128, desc_act: 0, true_sequential: 0]
terminate called after throwing an instance of 'c10::Error'
what(): The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
Exception raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:53 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f2c0dc6e38b in /app/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xbf (0x7f2c0dc68f3f in /app/lib/libc10.so)
frame #2: c10::cuda::device_count_ensure_non_zero() + 0x18c (0x7f2c0e0535dc in /app/lib/libc10_cuda.so)

Tensor Cores did not get activated when input tensor is not multiple of 8

tensor input need to be multiple of 8 in order to activate Tensor Cores.

[baichuan2-7b] random core dump in offline batched inference.

Bug Description

when use baichuan2-7b model with batchsize=64, random core dump happens.
not reproduced by other batchsize, such as: 1, 8, 16, 32, 128...

Reproduce

benchmark_test --model_name_or_path /data/baichuan2-7b --input_file /data/dataset/Chatbot_group_10_1000.json --batch_size 64

get rid of assert for group_size == -1

group_size could be -1 for gptq. the value can be configed as in_features instead.

How to implement response stream in custom UI?

I have been testing this project and I am seriously impressed with it.

It works great out of the box using the built-in chatbot-ui, and the rest API is also working as intended, however I cannot find any documentation on how to implement a custom frontend with streaming chat response like we see in the chatbot-ui

I believe vectorchai/scalellm-gateway is running a gRPC server written in golang, but I am unsure how to initiate a request and handle the response stream and output it on a custom frontend.

Driver Version: 535.54.03 CUDA Version: 12.2 ，运行报错“OpenAI API returned an error 503: {"error":{"code":14,"message":"connection error: desc = \"transport: Error while dialing: dial tcp: lookup scalellm on 127.0.0.11:53: server misbehaving\""}}”

我的驱动版本是535.54.03，cuda版本是12.2，运行后出现报错“OpenAI API returned an error 503: {"error":{"code":14,"message":"connection error: desc = "transport: Error while dialing: dial tcp: lookup scalellm on 127.0.0.11:53: server misbehaving""}}”
请问出现此报错的原因是什么？谢谢~

Implementing MoE (Mixture of Experts) kernels

ScaleLLM Roadmap

We're excited to present the features we're currently working on and planning to support in this roadmap document. Your feedback is highly valued, so please don't hesitate to comment or reach out if you have anything you'd like to add or discuss. We're committed to delivering the best possible experience with ScaleLLM.

Q1-Q2 2024

Efficiency

Adding flash decoding with paged KV cache support [Done]
Introducing attention kernel capable of supporting speculative decoding [Ongoing]
- Exploring the feasibility of adopting the flashinfer library [Ongoing]
Implementing speculative decoding [Done]
Enabling CUDA graph for decoding to improve performance [Done]
Implementing dynamic split-fuse for enhanced latency [Done]
Exploring lookahead decoding support
Implementing fused FFN (Feed-Forward Network) to enhance efficiency
Introducing a ring attention mechanism for handling long contexts

Cache

Implementing stateful conversation to avoid recomputing for chat sessions [Done]

New Models

Integrating Google Gemma [Done]
Integrating Llama3 [Done]
Incorporating the Mixtral MoE model [Ongoing]
- Implementing MoE (Mixture of Experts) kernels
Introducing the Mamba model
Introducing multi-modal models [Ongoing]
- LLaVA model
LoRA & QLoRA
- S-LoRA: Serving thousands of LoRA adapters

New Devices

Adding support for Apple chips
Exploring other chips such as TPU, etc.

Usability

Developing Python wrapper for easier integration [Done]
Enhancing documentation for improved usability [Ongoing]

New GPU Architecture

Turing architecture (sm75)

Structural Decoding

Function Calling

Quantization

Supporting FP8 for both models and KV caches

Supported Operating Systems

Extending support to macOS and Windows platforms

Misc

Conducting benchmarking to compare performance with other open-source projects [Ongoing]
Adding more bechmarks and unittests for kernels and dependencies [Ongoing]
Adding more Prometheus metrics and creating a Grafana dashboard for monitoring.
Loosening coupling with PyTorch for easy deployment

support models using Alibi

In order to support Alibi we need to calculate alibi bias from alibi slopes then pass it into attention as bias/mask.
since we are using flash-attention which doesn't support atten mask yet,(Dao-AILab/flash-attention#332)
we may have to implement the feature ourself.

Exploring the feasibility of adopting the flashinfer library

Adding support for Apple chips

fast tokenizer support needed for qiwen and baichuan model

scaleLLM is only able to support models with fast tokenizer since it is using tokenizers rust package instead of python package. for models without fast tokenizer, we need a tool to convert legacy tokenizer vocab file to fast tokenizer json file.

Structural Decoding: Function Calling

garbage text generated by gpt-neox with tensor parallel enabled.

need to dig more for root cause.

How to use a model that has been downloaded locally , Instead of downloading from huggingface when running container?

make 'prepend bos token' confinable for sentencepiece tokenizer

for models without fast tokenizer, it may fall back to using sentencepiece tokenizer.
always 'prepend bos token' is not desired behavior. for example: 01-ai/Yi-34B-200K

Extending support to macOS and Windows platforms

use dtype from config for gpus

different dtype would lead to different precision. use the one from config file.

scalellm exited with code 137

  scalellm:
    image: vectorchai/scalellm:latest
    hostname: scalellm
    container_name: scalellm
    ports:
      - 8888:8888
      - 9999:9999
    environment:
      - DEVICE=cpu
    volumes:
      - /Users/xxxx/yi/Yi-6B-Chat:/models/Yi-6B-Chat
    shm_size: 2g
    command: --model_path=/models/Yi-6B-Chat --model_id=01-ai/Yi-6B-Chat --model_type=Yi

the below scalellm docker runs with error :scalellm exited with code 137

scalellm          | ./entrypoint.sh: line 28:     7 Killed                  LD_LIBRARY_PATH=/app/lib:$LD_LIBRARY_PATH /app/bin/scalellm $ARGS "$@"
scalellm exited with code 137

add more options
automatically install dependencies

vectorch-ai / scalellm Goto Github PK

scalellm's Introduction

ScaleLLM

An efficient LLM Inference solution

News:

Key Features

Table of contents

Getting Started

Build from source

OpenAI-Compatible Server

Chatbot UI

Usage Examples

Chat Completions

Completions

Advanced Features

CUDA Graph

Prefix Cache

Chunked Prefill

Speculative Decoding

Quantization

Supported Models

Limitations

Contributing

Acknowledgements

License

scalellm's People

Contributors

Stargazers

Watchers

Forkers

scalellm's Issues

Bug Description

Reproduce

Q1-Q2 2024

Efficiency

Cache

New Models

New Devices

Usability

New GPU Architecture

Structural Decoding

Quantization

Supported Operating Systems

Misc

Recommend Projects

Recommend Topics

Recommend Org