Check whether the input request is too long about vllm HOT 4 CLOSED

vllm-project commented on May 18, 2024

Check whether the input request is too long

from vllm.

Comments (4)

WoosukKwon commented on May 18, 2024 3

Without the fix, cacheflow can suddenly hang without any notice. Let's fix this.

from vllm.

zhuohan123 commented on May 18, 2024 2

There are two limits for the sequence length, and we should take whichever is smaller:

Model limits: For example, for OPT-13B, the learned position embedding can support only config.max_position_embeddings positions.
vLLM limit: Specifically, an input cannot be longer than max_num_batched_tokens.

The real limit should be limit = min(model limit, vllm limit).

Also, for a request:

If len(prompt) > limit, we should directly reject the request or return nothing.
If len(prompt) + len(generated) > limit, we should stop the generation and return the partial results with the finish reason to be SequenceStatus.FINISHED_LENGTH_CAPPED.

In addition, we should not directly let the server fail when there is one request that is too long. We should return a proper error message or partial result.

from vllm.

zhuohan123 commented on May 18, 2024

@LiuXiaoxuanPKU

from vllm.

hmellor commented on May 18, 2024

Closing as this should now be fixed.

from vllm.

Check whether the input request is too long about vllm HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent