Giter Site home page Giter Site logo

Comments (7)

cade avatar cade commented on June 6, 2024 7

As much as I would love to take credit for bringing Speculative Decoding to vLLM, I'm relatively certain the praise belongs to @cadedaniel. 😁

from vllm.

LiuXiaoxuanPKU avatar LiuXiaoxuanPKU commented on June 6, 2024 2

It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction, there is one more thing that is not mentioned but is worth doing, which is dyanmical candidate tree topology. This is a generalization of the dynamical speculation length mentioned above, and will be possible after enabling the tree-based speculative decoding on vllm. We are actually actively exploring this direction.

Another good thing about it is that it is orthogonal to the roadmaps above and thus will be compatible with them, as you mentioned. On the other hand, this direction also falls into the title of this RFC, the dynamical speculative decoding. So I mention it here to bring the community's attention to it and hope in the next step this implementation will be a contribution to vllm.

More specifically, in the 1D sequential spec-decoding, the spec_length can be dynamically set according to the predicted acceptance rate. In the tree-style spec-decoding, which is a generalization of 1D, the tree topology including the tree size will be dynamically set according to an acceptance rate vector. And then further speed up can be expected.

Yes! In the research, we also explore the idea of dynamically adjust top k in for tree-style speculation. Our preliminary results are promising, but the results are based on the simulation. Once we have the tree-style speculative decoding in vllm, we can also add that.

from vllm.

LiuXiaoxuanPKU avatar LiuXiaoxuanPKU commented on June 6, 2024 1

@LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

Thanks for the interest!

  1. If the two models are used for speculative decoding, yes vllm can already support that. Take a look at this worker, which contains a draft worker and a target worker. The draft worker is responsible for loading and executing the draft model, while the target work is used for the target model.
  2. Yes, it will be integrated into vllm.
  3. Currently, we are in the process of optimizing speculative decoding performance because dynamically adjusting it will not be interesting if the native speculative decoding performance is not good. Once we think the native speculative decoding performance is reasonable, we will add our method on top of it quickly. I am not sure how long this step will take, @cade might have more context here.
  4. For the timeline, since our method is very light weighted, milestone2 (pre collect some system numbers, support limited models such as llama-7b, llama-70b) can be done within one week. milestone3 is to fully automate the speculation, which will take longer, 1-2 month.

from vllm.

KexinFeng avatar KexinFeng commented on June 6, 2024

It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction, there is one more thing that is not mentioned but is worth doing, which is dyanmical candidate tree topology. This is a generalization of the dynamical speculation length mentioned above, and will be possible after enabling the tree-based speculative decoding on vllm. We are actually actively exploring this direction.

Another good thing about it is that it is orthogonal to the roadmaps above and thus will be compatible with them, as you mentioned. On the other hand, this direction also falls into the title of this RFC, the dynamical speculative decoding. So I mention it here to bring the community's attention to it and hope in the next step this implementation will be a contribution to vllm.

More specifically, in the 1D sequential spec-decoding, the spec_length can be dynamically set according to the predicted acceptance rate. In the tree-style spec-decoding, which is a generalization of 1D, the tree topology including the tree size will be dynamically set according to an acceptance rate vector. And then further speed up can be expected.

from vllm.

YuCheng-Qi avatar YuCheng-Qi commented on June 6, 2024

@LiuXiaoxuanPKU
Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

from vllm.

YuCheng-Qi avatar YuCheng-Qi commented on June 6, 2024

@LiuXiaoxuanPKU Thanks for your response, and best wishes to you as well!

from vllm.

KexinFeng avatar KexinFeng commented on June 6, 2024

@LiuXiaoxuanPKU It's great to know that the vllm is looking into the tree-style 2D speculation. Actually, I'm developing an implementation of this tree-style 2D speculation, which works on any tree topology. And similar to what you mentioned, my estimation also shows that the results will be promising. We can expect further boost in this direction. When I finish the implementation, I would like to create a PR and integrate this into vllm's speculation.

Some updates: I just notice that here #4669 people has Medusa/Eagle/Hydra implementation now. The tree-style speculation will be a good match with them.

from vllm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.