Giter Site home page Giter Site logo

qwen-vllm's Introduction

qwen-vllm

千问官方部署文档

核心技术原理

本项目旨在探索生产环境下的高并发推理服务端搭建方法,核心工作非常清晰,边角细节没有投入太多精力,希望对大家有帮助

  • vLLM支持Continuous batching of incoming requests高并发批推理机制,其SDK实现是在1个独立线程中运行推理并且对用户提供请求排队合批机制,能够满足在线服务的高吞吐并发服务能力
  • vLLM提供asyncio封装,在主线程中基于uvicorn+fastapi封装后的asyncio http框架,可以实现对外HTTP接口服务,并将请求提交到vLLM的队列进入到vLLM的推理线程进行continuous batching批量推理,主线程异步等待推理结果,并将结果返回到HTTP客户端
  • vLLM天然支持流式返回next token,基于fastapi可以按chunk流式返回流式推理成果,在客户端基于requests库流式接收chunk并复写控制台展示,实现了流式响应效果

安装注意

离线推理

python程序直接拉起模型,本地推理的方式。

python vllm_offline.py
提问:你好
你好!有什么我能帮你的吗?
提问:没事
好的,如果你需要任何帮助,请随时告诉我。

在线推理

启动一个远端python http服务端,通过http客户端调用的方式,并且可以流式返回推理结果。

启动HTTP服务端:

python vllm_server.py

启动HTTP客户端

python vllm_client.py

webui

启动vllm_server后,可以再单独运行gradio_webui.py,它是基于gradio实现的聊天webui,支持多轮对话和流式应答,底层会与vllm_server实时远程调用。

python gradio_webui.py

通义千问Prompt原理

1.8B预训练版本,训练数据:

输入:英国航空,中文简称英航,是英国的国家航空公司,也是寰宇一家的创始成员及国际航空集团旗下子公司。<|endoftext|> 输出:英航的主要枢纽为伦敦希思罗机场及伦敦盖特威克机场。英航是欧洲第二大的航空公司、西欧最大的航空公司及全球三间其中一间曾拥有协和客机的航空公司,其余两间为法国航空和新加坡航空。<|endoftext|>

1.8B-Chat版本,基于1.8B预训练版本进行微调(SFT,S监督学习,FT微调)训练数据:

输入:<|im_start|>system\nyou are ahelper assitant.\n<|im_end|> \n<|im_start|>user\n历史提问A?\n<|im_end|><|im_start|>assitant:历史回答A\n<|im_end|> \n<|im_start|>user\n历史提问B?\n<|im_end|><|im_start|>assitant:历史回答B\n<|im_end|> \n<|im_start|>user\n了解英国航空么?\n<|im_end|><|im_start|>assitant:\n<|endoftext|> 输出:英国航空,中文简称英航,是英国的国家航空公司。<|im_end|><|endoftext|>

qwen-vllm's People

Contributors

owenliang avatar

Stargazers

ZZF avatar  avatar  avatar pengyi zan avatar  avatar  avatar tzyy0807 avatar Simon avatar  avatar Tempestissimo avatar  avatar  avatar  avatar Charlie avatar Sqlver avatar blink_tah avatar  avatar  avatar Shown Chen  avatar  avatar JackLi avatar  avatar Jie Zhang avatar fansir avatar  avatar  avatar  avatar waker avatar Fonkie Chen avatar  avatar  avatar Terry Wang avatar  avatar ru3inggg avatar  avatar  avatar 张志诚 avatar  avatar Sin avatar ChronousZ avatar Steven_bb avatar zxjyes avatar  avatar rookielxy avatar  avatar Qiao Liang avatar skadai avatar Hanqing avatar Jinwang Wang avatar  avatar keke avatar  avatar James avatar  avatar xtj2020 avatar  avatar  avatar Kermit Griffeth avatar  avatar  avatar  avatar  avatar ka2007 avatar hl avatar Cyclotomic Fields avatar  avatar Raymond Bradtke avatar zxfeng avatar Ian-fei avatar chasingdream avatar Guangsi SHI avatar  avatar  avatar JiaoRui avatar  avatar  avatar Yuzhi ZHAO avatar Manjun Xiong avatar  avatar 千古兴亡知衡权 avatar  avatar  avatar  avatar Kerwin Wilson avatar  avatar  avatar meta avatar Jing Tang avatar 来新璐 avatar  avatar  avatar 南栖 avatar RP Xu avatar  avatar  avatar Jun Zhan avatar OwnLu avatar Fengzhao avatar Huang kai avatar 心の吾 avatar

Watchers

Yi Xu avatar  avatar ChunFuWu avatar  avatar

qwen-vllm's Issues

运行vllm_offline.py报错

版本cuda12.1 torch 2.1.0 vllm0.2.2,根据你的版本步骤,就报错, self.stop_words_ids=[self.tokenizer.im_start_id,self.tokenizer.im_end_id,self.tokenizer.eos_token_id]
AttributeError: 'Qwen2TokenizerFast' object has no attribute 'im_start_id',请问是我哪错了

lora加载如何实现

热插拔的lora加载如何实现,我看VLLM在加载lora时会有词表大小限制?

vllm推理提速不明显,如何解决?

history=None
for i in range(len(raw_prompts)):
    # len(raw_prompts) = 100
    q = raw_prompts[i]
    response, history = vllm_model.chat(query=q, history=history)
    print(response)
    history = history[:10]

之前没有用vllm,100个prompts大约是52s,使用vllm之后仍然是52s左右,似乎没有提速?
请问有人能帮忙看一下吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.