Comments (4)
Without the fix, cacheflow can suddenly hang without any notice. Let's fix this.
from vllm.
There are two limits for the sequence length, and we should take whichever is smaller:
- Model limits: For example, for OPT-13B, the learned position embedding can support only
config.max_position_embeddings
positions. - vLLM limit: Specifically, an input cannot be longer than
max_num_batched_tokens
.
The real limit should be limit = min(model limit, vllm limit).
Also, for a request:
- If
len(prompt) > limit
, we should directly reject the request or return nothing. - If
len(prompt) + len(generated) > limit
, we should stop the generation and return the partial results with the finish reason to beSequenceStatus.FINISHED_LENGTH_CAPPED
.
In addition, we should not directly let the server fail when there is one request that is too long. We should return a proper error message or partial result.
from vllm.
from vllm.
Closing as this should now be fixed.
from vllm.
Related Issues (20)
- [Bug]: ValueError: The quantization method fp8 is not supported for the current GPU. Minimum capability: 90. Current capability: 86 HOT 3
- [New Model]: launch error of Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 HOT 1
- [Misc]: RuntimeError: Cannot find any model weights [vllm=0.4.0] HOT 1
- [Feature]: Set RoPE scaling parameters dynamically
- [Bug]: vllm 0.4.1 broke pip 517 compatability HOT 3
- [Bug]: Invalid inputs do not result in error HOT 5
- [Bug]: Call to CUDA function failed - unknown error HOT 7
- [Bug]: mistralai/Mixtral-8x22B-Instruct-v0.1 fails to load 2/3 times on aae08249acca69060d0a8220cab920e00520932c HOT 2
- [Usage]: why can't the `max_model_len` be greater than `max_position_embeddings` for llama2? HOT 1
- [Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.
- [Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. HOT 9
- [Installation]: GitHub access required during install for vllm >=0.4.1 (for cu12-libnccl.so.2.18.1) HOT 5
- [Usage]: Cannot use FlashAttention backend HOT 4
- [Bug]: v0.4.1 VLLM_USE_MODELSCOPE not working HOT 8
- [Feature]: Enable chunked prefill for neuron & cpu workers
- [Bug]: When I specify `max-tokens` and `min-tokens` at the same time, the service reports an error in `_apply_min_tokens_penalty` HOT 4
- Error while trying to load phi3 model [Misc]:
- [Usage]: When I installed version 0.4.1 and started `vllm.entrypoints.openai.api_server` with the `--engine-use-ray` parameter, I encountered some issues. HOT 1
- [Usage]: OOM!!Qwen1.5-MoE-A2.7B-Chat on 32GB HBM HOT 5
- How to specify which GPU to use when using the AsyncLLMEngine.from_engine_args method for inference HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vllm.