The vllm's discuss from vllm-project

Build failure due to CUDA version mismatch

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by pip is built with CUDA 11.7 while the container uses CUDA 12.1.

RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

Support custom models

We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.

Clean up the scheduler code

Currently, the scheduler code includes the code for experimental purposes (e.g., collecting various system stats). The code should be removed or minimized.

Fix the rushed out multi-query kernel

Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.

Support custom tokenizer

We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.

The current model mapper is hacky; it uses string matching based on the model name or path. Let's use HF-style model mapper that uses the architecture specified in model config and lazy-loads the target model only.

Publish wheels with pre-built CUDA binaries

Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI distribution, just like PyTorch and xformers.

Support GPT-2

GPT-2 is a representative of Transformer-based generative models and is still the most downloaded model in huggingface. It'd be nice to support the model.

Implement custom kernels for top-k and top-p sampling

As mentioned in #81 (comment), the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.

Add performance comparison figures on A100, V100, T4

Debug the optimal upper-bound performance for swapping (0-cost swapping).

Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.

Support BLOOM

BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.

Add an option to disable Ray when using a single GPU

When working with a single GPU, Ray is not useful. Therefore, it would be beneficial to have an option to disable Ray in such scenarios.

Modify the current PyTorch model to C++

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

Python v.s. C++.
PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

(Fake C++) Torch compiler (torch.jit).
Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

Add tests for sampler

Use mypy

Dangerous floating point comparison

I noticed that we use conditions like this to check whether it is greedy sampling
https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45

However, I guess this will result in several problems

It is not recommended to use == for floating point numbers
A small temperature will result in inf/nan

I typically use something like this https://github.com/lm-sys/FastChat/blob/a94fd259a97128f7f4483ddb760690f467888d84/fastchat/serve/inference.py#L227

@WoosukKwon, @zhuohan123 What do you think? If you are happy, I can change all "==" to "<=".

Decrease the default size of swap space

The current default swap space size (20 GiB per GPU) is a bit too large. It can lead to OOM especially for the machine with multiple GPUs.

Documentation on running basic python server and FastAPI server

Frontend Improvements

Current implementation of the FastAPI+asyncio+ray combination seems slow
Merge Hao’s throughput profiling code.
Make the frontend looks like OpenAI’s API.

Support FP32

          Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.

Originally posted by @WoosukKwon in #70 (comment)

Add code formatting script & Add CI to check code format

A critical bug in attention kernel after refactoring

It seems there's a critical bug introduced by #53
Running the single_query_cached_kv_attention kernel with certain configurations leads to CUDA illegal memory access errors. I found the bug in the unit tests.

Add support for Stable-LM and OpenAssistant

The two models are popularly used. As we support LLaMA, it'll not be difficult to support these models.

Add docstring

Use O3 optimization instead of O2 for CUDA compilation?

We are currently using the -O2 flag in compiling our CUDA kernels. We need to investigate whether/how changing it to -O3 affects the system performance and compilation time.

Turn shareGPT data into a standard benchmark

Extract out the lengths of the conversation rounds, and maybe have that data directly available from github.
The current L-shape evaluation with binary search for throughput is hard to run and not scalable. We should find an easier way to benchmark the performance.

Test on A10G and L4

Build & run on more recent cloud GPUs such as A10G and L4.

Add license

Write README

Support string-based stopping conditions

Tensor Parallel profiling result

Will update the profiling results in this PR.

BS=8, input_len=32, output_len=128

OPT-13B
TP 1: 3.5404738585154214 seconds
TP 2: 4.742188215255737 seconds
TP 4: 4.907034238179524 seconds

OPT-30B
TP 1: OOM
TP 2: 5.9848620891571045 seconds
TP 4: 5.943212985992432 seconds

Improve Weight Loading

Just use Huggingface's weights. Don't do another copy!

Check whether the input request is too long

Tokenizer overhead is significant when use_fast=False

After #114 , the server decodes the running sequences every step. This leads to significant overhead, especially when the slow tokenizer is used (e.g., LLaMA).

# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds

# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds

Add a baseline with dynamic growing KV cache size for the paper

Clean up Megatron-LM code

Implement a system logger to print system status and warnings

We need a logger class that can print the system status, warnings, and debugging information.

Support various sampling parameters

The parameters such as repetition_penalty and top_k are often used for sampling. It'd be nice to support them using the HuggingFace logit processors.

Make sure the system can run on T4 and V100

Do not initialize process group when using a single GPU

Currently we call torch.distributed.init_process_group even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.

Bug in LLaMA fast tokenizer

In my environment, using the LLaMA fast tokenizer raises an error about protobuf:

  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.

Initialization with fast tokenizer & protobuf==3.20.3

real    4m18.476s
user    3m52.706s
sys     0m27.644s

Initialization with slow tokenizer

real    0m27.620s
user    0m8.011s
sys     0m19.237s

vllm-project / vllm Goto Github PK

vllm's Issues

BS=8, input_len=32, output_len=128

Recommend Projects

Recommend Topics

Recommend Org