vllm-project / vllm Goto Github PK
View Code? Open in Web Editor NEWA high-throughput and memory-efficient inference and serving engine for LLMs
Home Page: https://docs.vllm.ai
License: Apache License 2.0
A high-throughput and memory-efficient inference and serving engine for LLMs
Home Page: https://docs.vllm.ai
License: Apache License 2.0
I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by pip
is built with CUDA 11.7 while the container uses CUDA 12.1.
RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.
We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.
Currently, the scheduler code includes the code for experimental purposes (e.g., collecting various system stats). The code should be removed or minimized.
We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.
The current model mapper is hacky; it uses string matching based on the model name or path. Let's use HF-style model mapper that uses the architecture specified in model config and lazy-loads the target model only.
Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI distribution, just like PyTorch and xformers.
GPT-2 is a representative of Transformer-based generative models and is still the most downloaded model in huggingface. It'd be nice to support the model.
As mentioned in #81 (comment), the current PyTorch-based top-k and top-p implementation is memory-inefficient. This can be improved by introducing custom kernels.
Rerun the experiment comparing 0-cost swapping and recomputation. Recomputation should not be faster in any case. If recomputation is consistently faster, we should debug into this.
BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.
When working with a single GPU, Ray is not useful. Therefore, it would be beneficial to have an option to disable Ray in such scenarios.
Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.
Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.
Potential sources of overheads:
How to implement a C++ version:
I noticed that we use conditions like this to check whether it is greedy sampling
https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45
However, I guess this will result in several problems
==
for floating point numbersI typically use something like this https://github.com/lm-sys/FastChat/blob/a94fd259a97128f7f4483ddb760690f467888d84/fastchat/serve/inference.py#L227
@WoosukKwon, @zhuohan123 What do you think? If you are happy, I can change all "==" to "<=".
The current default swap space size (20 GiB per GPU) is a bit too large. It can lead to OOM especially for the machine with multiple GPUs.
Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.
Originally posted by @WoosukKwon in #70 (comment)
It seems there's a critical bug introduced by #53
Running the single_query_cached_kv_attention
kernel with certain configurations leads to CUDA illegal memory access errors. I found the bug in the unit tests.
The two models are popularly used. As we support LLaMA, it'll not be difficult to support these models.
We are currently using the -O2
flag in compiling our CUDA kernels. We need to investigate whether/how changing it to -O3
affects the system performance and compilation time.
Build & run on more recent cloud GPUs such as A10G and L4.
Will update the profiling results in this PR.
OPT-13B
TP 1: 3.5404738585154214 seconds
TP 2: 4.742188215255737 seconds
TP 4: 4.907034238179524 seconds
OPT-30B
TP 1: OOM
TP 2: 5.9848620891571045 seconds
TP 4: 5.943212985992432 seconds
Just use Huggingface's weights. Don't do another copy!
After #114 , the server decodes the running sequences every step. This leads to significant overhead, especially when the slow tokenizer is used (e.g., LLaMA).
# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds
# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds
We need a logger class that can print the system status, warnings, and debugging information.
The parameters such as repetition_penalty
and top_k
are often used for sampling. It'd be nice to support them using the HuggingFace logit processors.
Currently we call torch.distributed.init_process_group
even for a single GPU. This is redundant and causes errors when the LLM object is created multiple times.
In my environment, using the LLaMA fast tokenizer raises an error about protobuf:
File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
from .utils import sentencepiece_model_pb2 as model_pb2
File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
_descriptor.EnumValueDescriptor(
File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
1. Downgrade the protobuf package to 3.20.x or lower.
2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.
real 4m18.476s
user 3m52.706s
sys 0m27.644s
real 0m27.620s
user 0m8.011s
sys 0m19.237s
Add the complete list of dependencies to setup.py
so that the package can be pip-installed at once.
We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.