Your current environment Collecting environment information...

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Usage]: How to determine how many concurrent requests can be supported in an acceptable time duration with demo api server? about vllm HOT 3 OPEN

senbinyu commented on June 1, 2024

[Usage]: How to determine how many concurrent requests can be supported in an acceptable time duration with demo api server?

from vllm.

Comments (3)

rkooo567 commented on June 1, 2024

How do you send requests? can you share the code here?

Also note that when your batch size is large, enough it will reach to compute bound. And once it hits the compute bound, increasing batch doesn't improve much performance.

Other possibility is you don't have enough kv caches to batch all requests. In this case, although you max num seqs is 256, it may never reach that batch size because you cannot batch requests more than your available kv caches.

from vllm.

senbinyu commented on June 1, 2024

@rkooo567 The demo codes are attched. I changed the .py files to .txt since it dosen't support the upload of .py files. I used threadpool to send requests in order to mimic the behavior of concurrent requests. DeepseekCoder is used as the engine model, after awq, loading model weights took 3.7GB. So i guess the compute bound rather than the kv cache might be the reason.
api_server.txt
github_demo.txt
multi_8192.json

from vllm.

rkooo567 commented on June 1, 2024

one thing you can try is to set disable_log_stats=False, and it can also show you the # of running requests. If it is close to max num seqs, I think it is the compute bound case. if it is too low, maybe a code bug (since you don't use much memory for model weights)

from vllm.

[Usage]: How to determine how many concurrent requests can be supported in an acceptable time duration with demo api server? about vllm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent