Giter Site home page Giter Site logo

vllm-project / vllm Goto Github PK

View Code? Open in Web Editor NEW
19.6K 192.0 2.6K 8.61 MB

A high-throughput and memory-efficient inference and serving engine for LLMs

Home Page: https://docs.vllm.ai

License: Apache License 2.0

Python 80.10% C++ 2.76% Cuda 15.72% C 0.11% Shell 0.42% Dockerfile 0.14% Jinja 0.08% CMake 0.67%
gpt llm pytorch llmops mlops model-serving transformer llm-serving inference llama

vllm's Issues

Build failure due to CUDA version mismatch

I failed to build the system with the latest NVIDIA PyTorch docker image. The reason is PyTorch installed by pip is built with CUDA 11.7 while the container uses CUDA 12.1.

RuntimeError:
The detected CUDA version (12.1) mismatches the version that was used to compile
PyTorch (11.7). Please make sure to use the same CUDA versions.

Support custom models

We need to provide clean abstractions and interfaces so that users can easily plug in their custom models.

Clean up the scheduler code

Currently, the scheduler code includes the code for experimental purposes (e.g., collecting various system stats). The code should be removed or minimized.

Fix the rushed out multi-query kernel

  1. Fix the correctness issue in the current FlashAttention-copy-based kernel. Make sure we call the FlashAttention kernel correctly. Evaluate the performance of this kernel.
  2. Reduce the memory usage of the current kernel by limiting the buffer size and calling the kernel multiple times.

Support custom tokenizer

We should provide a clean abstraction and interface so that users can use their custom tokenizer very easily.

Enhance model mapper

The current model mapper is hacky; it uses string matching based on the model name or path. Let's use HF-style model mapper that uses the architecture specified in model config and lazy-loads the target model only.

Publish wheels with pre-built CUDA binaries

Currently, pip installing our package takes 5-10 minutes because our CUDA kernels are compiled on the user machine. For better UX, we should include pre-built CUDA binaries in our PyPI distribution, just like PyTorch and xformers.

Support GPT-2

GPT-2 is a representative of Transformer-based generative models and is still the most downloaded model in huggingface. It'd be nice to support the model.

Support BLOOM

BLOOM is an open-source LLM developed by BigScience. The BLOOM models have achieved high rankings in HuggingFace downloads. It'd be great to have these models in our catalog.

Modify the current PyTorch model to C++

Expected gain: For 13B models, we should see a 20%-30% latency gain on a single GPU and 2-3x on 4 GPUs. For smaller models, the gain should be even higher.

Having a single iteration's computation being completely C++ should be enough for high performance. In this way, we can keep most complicated scheduling logics in Python, including weight loading.

Potential sources of overheads:

  1. Python v.s. C++.
  2. PyTorch (even in C++) v.s. FasterTransformer.

How to implement a C++ version:

  1. (Fake C++) Torch compiler (torch.jit).
  2. Libtorch, C++ version of PyTorch (easier to implement and extend, but can only solve overhead 1).
  3. Prune out the useful single model code from FasterTransformer to CacheFlow. This solves both overheads but is harder to implement.

Dangerous floating point comparison

I noticed that we use conditions like this to check whether it is greedy sampling
https://github.com/WoosukKwon/cacheflow/blob/189ae231336857bcc4c6f6157bf7868cdf56fb5f/cacheflow/sampling_params.py#L45

However, I guess this will result in several problems

  1. It is not recommended to use == for floating point numbers
  2. A small temperature will result in inf/nan

I typically use something like this https://github.com/lm-sys/FastChat/blob/a94fd259a97128f7f4483ddb760690f467888d84/fastchat/serve/inference.py#L227

@WoosukKwon, @zhuohan123 What do you think? If you are happy, I can change all "==" to "<=".

Frontend Improvements

  1. Current implementation of the FastAPI+asyncio+ray combination seems slow
  2. Merge Hao’s throughput profiling code.
  3. Make the frontend looks like OpenAI’s API.

Support FP32

          Yes, it does. It is our attention kernel that does not support FP32. More precisely, our attention kernel currently does not support some block sizes when FP32 is used. I will fix this in the future.

Originally posted by @WoosukKwon in #70 (comment)

Turn shareGPT data into a standard benchmark

  1. Extract out the lengths of the conversation rounds, and maybe have that data directly available from github.
  2. The current L-shape evaluation with binary search for throughput is hard to run and not scalable. We should find an easier way to benchmark the performance.

Tensor Parallel profiling result

Will update the profiling results in this PR.

BS=8, input_len=32, output_len=128

OPT-13B
TP 1: 3.5404738585154214 seconds
TP 2: 4.742188215255737 seconds
TP 4: 4.907034238179524 seconds

OPT-30B
TP 1: OOM
TP 2: 5.9848620891571045 seconds
TP 4: 5.943212985992432 seconds

Tokenizer overhead is significant when use_fast=False

After #114 , the server decodes the running sequences every step. This leads to significant overhead, especially when the slow tokenizer is used (e.g., LLaMA).

# opt-13b inference latency (bs 8, input 32, output 128)
Avg latency: 3.57 seconds
Tokenizer (fast): 0.14 seconds

# llama-13b inference latency (bs 8, input 32, output 128)
Avg latency: 5.28 seconds
Tokenizer (slow): 1.97 seconds

Support various sampling parameters

The parameters such as repetition_penalty and top_k are often used for sampling. It'd be nice to support them using the HuggingFace logit processors.

Bug in LLaMA fast tokenizer

In my environment, using the LLaMA fast tokenizer raises an error about protobuf:

  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 445, in __init__
    from .utils import sentencepiece_model_pb2 as model_pb2
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/transformers/utils/sentencepiece_model_pb2.py", line 91, in <module>
    _descriptor.EnumValueDescriptor(
  File "/opt/conda/envs/dev/lib/python3.9/site-packages/google/protobuf/descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

While downgrading the protobuf version removed the error, it slowed down the initialization time by ~8x.

  • Initialization with fast tokenizer & protobuf==3.20.3
real    4m18.476s
user    3m52.706s
sys     0m27.644s
  • Initialization with slow tokenizer
real    0m27.620s
user    0m8.011s
sys     0m19.237s

Add tests for models

We need tests for the models we support. The tests should ensure that the outputs of our models when using greedy sampling are equivalent to those of HF models.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.