🚀 The feature, motivation and pitch In approaches like Medusa/EAG

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) about vllm HOT 4 OPEN

abhigoyal1997 commented on June 14, 2024 1

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra)

from vllm.

Comments (4)

abhigoyal1997 commented on June 14, 2024 3

I have implemented Medusa using this. If this makes sense and can be accepted as a contribution, I would love to create a PR (including the implementation of Medusa). I am also working on implementing the EAGLE approach.

from vllm.

abhigoyal1997 commented on June 14, 2024 2

Hi @KexinFeng
Currently what I've implemented only takes top-1 predictions to get single-sequence draft tokens. I agree that tree-style speculation is essential to get a significant acceleration. I've observed it in a torch.compile based implementation I worked on (based on gpt-fast), but I've not tried implementing that in vllm yet as it looked more complicated at the time and I knew it is already being worked on.

As for the current implementation of Medusa and EAGLE using a single sequence, I'll create a PR as soon as I've tested it a bit more and have company approvals.

from vllm.

youkaichao commented on June 14, 2024 1

cc @cadedaniel @LiuXiaoxuanPKU for visibility.

from vllm.

KexinFeng commented on June 14, 2024

@abhigoyal1997 This is indeed an important feature that people have been looking for. It's also within my exploration radar, and I look forward to its implementation in vllm.
Here is some detailed question. I know for Medusa, tree-draft-tokens play an essential role; for Eagle, it is also important. In your implementation, did you enable tree-draft-tokens, or is it still the single-sequence draft-tokens?

I'm asking this because I'm developing this tree-style speculation. And it will be a perfect match with the Medusa/Eagle/Hydra here. We can maybe combine the effort and see how the performance boost when the two technieques are put together. #4565 (comment)

from vllm.

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) about vllm HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent