Giter Site home page Giter Site logo

gpt_server's Introduction

gpt_server logo

GPT Server

本项目依托fastchat的基础能力来提供openai server的能力.

  1. 在此基础上完美适配了更多的模型优化了fastchat兼容较差的模型
  2. 重新适配了vllm对模型适配较差,导致解码内容和hf不对齐的问题。
  3. 支持了vllmLMDeployhf的加载方式
  4. 支持所有兼容sentence_transformers的语义向量模型(Embedding和Reranker)
  5. 支持了Infinity后端,推理速度大于onnx/tensorrt,支持动态组批
  6. Chat模板无角色限制,使其完美支持了LangGraph Agent框架
  7. 支持了Function Calling (Tools) 能力(现阶段支持Qwen/ChatGLM,对Qwen支持更好)
  8. 支持多模态大模型
  9. 降低了模型适配的难度和项目使用的难度(新模型的适配仅需修改低于5行代码),从而更容易的部署自己最新的模型。

(仓库初步构建中,构建过程中没有经过完善的回归测试,可能会发生已适配的模型不可用的Bug,欢迎提出改进或者适配模型的建议意见。)


特色

  1. 支持多种推理后端引擎,vLLM和LMDeploy,LMDeploy后端引擎,每秒处理的请求数是 vLLM 的 1.36 ~ 1.85 倍
  2. 支持了Infinity后端,推理速度大于onnx/tensorrt,支持动态组批
  3. 全球唯一完美支持Tools(Function Calling)功能的开源框架。兼容LangChainbind_toolsAgentExecutorwith_structured_output写法(目前支持Qwen系列、GLM系列)
  4. 全球唯一扩展了openai库,实现Reranker模型。(代码样例见gpt_server/tests/test_openai_rerank.py)
  5. 支持多模态大模型
  6. 与FastChat相同的分布式架构

更新信息

8-14  支持了 InternVL2 系列多模态模型
7-28  支持embedding/reranker 的动态组批加速(infinity后端, 比onnx/tensorrt更快)
7-19  支持了多模态模型 glm-4v-gb 的LMDeploy PyTorch后端
6-22  支持了 Qwen系列、ChatGLM系列 function call (tools) 能力
6-12  支持了 qwen-2
6-5   支持了 Yinka、zpoint_large_embedding_zh 嵌入模型
6-5   支持了 glm4-9b系列(hf和vllm)
4-27  支持了 LMDeploy 加速推理后端
4-20  支持了 llama-3
4-13  支持了 deepseek
4-4   支持了 embedding模型 acge_text_embedding
3-9   支持了 reranker 模型 ( bge-reranker,bce-reranker-base_v1)
3-3   支持了 internlm-1.0 ,internlm-2.0
3-2   支持了 qwen-1.5 0.5B, 1.8B, 4B, 7B, 14B, and 72B
2-4   支持了 vllm 实现
1-6   支持了 Yi-34B
12-31 支持了 qwen-7b, qwen-14b
12-30 支持了 all-embedding(理论上支持所有的词嵌入模型)
12-24 支持了 chatglm3-6b 

路线

  • 支持HF后端
  • 支持vLLM后端
  • 支持LMDeploy后端
  • 支持 function call 功能 (tools)(Qwen系列、ChatGLM系列已经支持,后面有需求再继续扩展)
  • 支持多模态模型(初步支持glm-4v,其它模型后续慢慢支持)
  • 支持Embedding模型动态组批(实现方式:infinity后端)
  • 支持Reranker模型动态组批(实现方式:infinity后端)
  • 内置部分 tools (image_gen,code_interpreter,weather等)
  • 并行的function call功能(tools)

启用方式

Python启动

1. 配置python环境

# 1. 创建conda 环境
conda create -n gpt_server python=3.10

# 2. 激活conda 环境
conda activate gpt_server

# 3. 安装仓库(一定要使用 install.sh 安装,否则无法解决依赖冲突)
sh install.sh

2. 修改启动配置文件

修改模型后端方式(vllm,lmdeploy等)

config.yaml中:

work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch

修改embedding/reranker后端方式(embedding或embedding_infinity)

config.yaml中:

model_type: embedding # embedding 或 embedding_infinity  embedding_infinity后端速度远远大于 embedding

config.yaml

cd gpt_server/script
vim config.yaml
serve_args:
  host: 0.0.0.0
  port: 8082

models:
  - chatglm4:  #自定义的模型名称
      alias: null # 别名     例如  gpt4,gpt3
      enable: true  # false true
      model_name_or_path: /home/dev/model/THUDM/glm-4-9b-chat/
      model_type: chatglm  # qwen  chatglm3 yi internlm
      work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        # - 1
        - 0
  
  - qwen:  #自定义的模型名称
      alias: gpt-4,gpt-3.5-turbo,gpt-3.5-turbo-16k # 别名     例如  gpt4,gpt3
      enable: true  # false true
      model_name_or_path: /home/dev/model/qwen/Qwen1___5-14B-Chat/ 
      model_type: qwen  # qwen  chatglm3 yi internlm
      work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 1
      # - gpus:
      #   - 3

  # Embedding 模型
  - bge-base-zh:
      alias: null # 别名   
      enable: true  # false true
      model_name_or_path: /home/dev/model/Xorbits/bge-base-zh-v1___5/
      model_type: embedding # embedding_infinity 
      work_mode: hf
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 2
 # reranker 模型
  - bge-reranker-base:
      alias: null # 别名   
      enable: true  # false true
      model_name_or_path: /home/dev/model/Xorbits/bge-reranker-base/
      model_type: embedding # embedding_infinity
      work_mode: hf
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 2

3. 运行命令

start.sh

cd gpt_server/script
sh start.sh

支持的模型以及推理后端

推理速度: LMDeploy TurboMind > vllm > LMDeploy PyTorch > HF

LLM

Models / BackEnd HF vllm LMDeploy TurboMind LMDeploy PyTorch
chatglm4-9b
chatglm3-6b ×
Qwen (7B, 14B, etc.))
Qwen-1.5 (0.5B--72B)
Qwen-2
Yi-34B
Internlm-1.0
Internlm-2.0
Deepseek
Llama-3
Models / BackEnd HF vllm LMDeploy TurboMind LMDeploy PyTorch
glm-4v-9b × × ×
InternVL2 × ×

Embedding模型

原则上支持所有的Embedding/Rerank 模型

推理速度: Infinity >> HF

以下模型经过测试:

Embedding/Rerank HF Infinity
bge-reranker
bce-reranker
bge-embedding
bce-embedding
piccolo-base-zh-embedding
acge_text_embedding
Yinka
zpoint_large_embedding_zh
xiaobu-embedding

目前 xiaobu-embedding C-MTEB榜单排行第一(MTEB: https://huggingface.co/spaces/mteb/leaderboard)

4. 使用 openai 库 进行调用

见 gpt_server/tests 目录 样例测试代码: https://github.com/shell-nlp/gpt_server/tree/main/tests

5. 使用WebUI

cd gpt_server/tests
python web_demo.py

WebUI界面:

web_demo.png

Docker安装

1. 构建镜像

docker build --rm -f "Dockerfile" -t gpt_server:latest "." 

2. Docker Compose启动

docker-compose  -f "docker-compose.yml" up -d --build gpt_server

致谢

FastChat : https://github.com/lm-sys/FastChat

vLLM : https://github.com/vllm-project/vllm

LMDeploy https://github.com/InternLM/lmdeploy

infinityhttps://github.com/michaelfeil/infinity

Star History

Star History Chart

gpt_server's People

Contributors

shell-nlp avatar

Stargazers

 avatar luobo123 avatar  avatar  avatar ipruning avatar mak avatar lishuang avatar  avatar  avatar Xhark avatar  avatar  avatar pc avatar  avatar Ben Rood avatar  avatar  avatar  avatar banshan avatar Roger GOU avatar ḕℏỈẍȓ avatar  avatar Shuang Zeng avatar elucida avatar  avatar  avatar Timothy avatar  avatar Kitsune2077 avatar Kerwin Wilson avatar Laurie avatar  avatar  avatar egenchen avatar gaoanwei avatar Xin Li avatar Jarvis Song avatar dir avatar  avatar  avatar Kermit Griffeth avatar tomjamescn avatar Alansec avatar Viggo avatar luyizhou avatar  avatar Zhijian Ding avatar deniswen avatar  avatar Yue Ying avatar  avatar  avatar TaurusDuan avatar  avatar  avatar  avatar shadow avatar 1420734331@qq.com avatar  avatar TUR1NG avatar tico Ag avatar  avatar  avatar dgo2dance avatar nicoyin avatar  avatar

Watchers

 avatar ḕℏỈẍȓ avatar  avatar

gpt_server's Issues

依赖好多

感觉仓库的依赖好多啊,下的时间太长了 ,可以精简一下嘛

加载Embedding模型时,指定使用CPU,但还是加载到GPU

在查看相关代码,发现在配置文件指定CPU后,其实只是把系统变量:CUDA_VISIBLE_DEVICES = '', 但在使用sentence_transformers 加载模型时,sentence_transformers 中未指定device参数,就按自己默认加载,默认发现有GPU卡,就使用GPU加载了。

改动如下,供参考:
+++ gpt_server/model_worker/embedding.py
@@ -1,3 +1,4 @@
+import os
from typing import List
from gpt_server.model_worker.base import ModelWorkerBase
import sentence_transformers
@@ -27,17 +28,27 @@
# model_kwargs = {"device": "cuda"}
self.encode_kwargs = {"normalize_embeddings": True, "batch_size": 64}
self.mode = "embedding"
+

  •    self.pool = {}
    
  •    self.device = "cuda"
    
  •    if not os.environ.get("CUDA_VISIBLE_DEVICES", ""):
    
  •        self.device = f"cpu"
    
  •    # rerank
       for model_name in model_names:
           if "rerank" in model_name:
               self.mode = "rerank"
               break
    
  •    if self.mode == "rerank":
    
  •        self.client = sentence_transformers.CrossEncoder(model_name=model_path)
    
  •        print("正在使用 rerank 模型...")
    
  •        self.client = sentence_transformers.CrossEncoder(model_name=model_path, device=self.device)
    
  •        print("正在使用 rerank 模型...", self.device)
       elif self.mode == "embedding":
    
  •        self.client = sentence_transformers.SentenceTransformer(model_path)
    
  •        print("正在使用 embedding 模型...")
    
  •        self.client = sentence_transformers.SentenceTransformer(model_path, device=self.device)
    
  •        if self.device == "cpu":
    
  •            self.pool = self.client.start_multi_process_pool(['cpu', 'cpu', 'cpu', 'cpu'])
    
  •        print("正在使用 embedding 模型...", self.device)
    
 def generate_stream_gate(self, params):
     pass

http api 请求 reranker 模型报错

请求:

POST http://10.0.80.31:8082/v1/embeddings
#Authorization: Bearer EMPTY
Content-Type: application/json

{
  "model": "bce-reranker-base_v1",
  "input": [
    "你是谁",
    "今年几岁",
    "你多大了"
  ],
  "extra_json": {
    "query": "你多大了"
  }
}

响应:

HTTP/1.1 400 Bad Request
date: Mon, 01 Jul 2024 08:37:55 GMT
server: uvicorn
content-length: 65
content-type: application/json

{
  "object": "error",
  "message": "Internal Server Error",
  "code": 50001
}

服务错误堆栈:

2024-07-01 08:41:47.905 | INFO     | __main__:get_embeddings:58 - params {'model': 'bce-reranker-base_v1', 'input': ['你是谁', '今年几岁', '你多大了'], 'encoding_format': None, 'query': None}
2024-07-01 08:41:47.905 | INFO     | __main__:get_embeddings:59 - worker_id: 4f496ce6
Batches:   0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]
Batches:   0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]
2024-07-01 08:41:47 | ERROR | stderr |
INFO:     127.0.0.1:58486 - "POST /worker_get_embeddings HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/model_worker/base.py", line 293, in api_get_embeddings
    embedding = await worker.get_embeddings(params)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/model_worker/embedding.py", line 75, in get_embeddings
    scores = self.client.predict(sentence_pairs)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 338, in predict
    for features in iterator:
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 138, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'NoneType' object has no attribute 'strip'
2024-07-01 08:41:47 | INFO | stdout | INFO:     10.0.12.165:54154 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request

使用 python 示例请求没问题:

from openai import OpenAI

# 新版本 opnai
client = OpenAI(api_key="EMPTY", base_url="http://10.0.80.31:8082/v1")
data = client.embeddings.create(
    model="bce-reranker-base_v1",
    input=["你是谁", "今年几岁", "你多大了"],
    extra_body={"query": "你多大了"},
)

print(data.data)

【求助】构建镜像出错

root@debiandocker build -t gpt_server:v0.2.1 .
[+] Building 0.7s (8/8) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 674B 0.0s
=> [internal] load metadata for docker.io/continuumio/miniconda3:main 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 88B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 31.27kB 0.0s
=> [1/4] FROM docker.io/continuumio/miniconda3:main 0.0s
=> CACHED [2/4] COPY ./ /gpt_server 0.0s
=> CACHED [3/4] WORKDIR /gpt_server 0.0s
=> ERROR [4/4] RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && 0.6s

[4/4] RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ && conda config --set show_channel_urls yes && pip install -r requirements.txt && pip cache purge:
0.496 sed: can't read /etc/apt/sources.list: No such file or directory


Dockerfile:7

6 |
7 | >>> RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list &&
8 | >>> sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list &&
9 | >>> pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple &&
10 | >>> conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ &&
11 | >>> conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ &&
12 | >>> conda config --set show_channel_urls yes &&
13 | >>> pip install -r requirements.txt && pip cache purge
14 | CMD ["/bin/bash"]

ERROR: failed to solve: process "/bin/sh -c sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ && conda config --set show_channel_urls yes && pip install -r requirements.txt && pip cache purge" did not complete successfully: exit code: 2
root@debian:/mnt/gpt_server-api#

在windows下执行异常

请问是不支持windows吗?用conda建立一个新环境并执行python main.py后提示如下
The "freeze_support()" line can be omitted if the program is not going to be frozen to produce an executable.

纯CPU无法启动

最新版本,这是配置文件

  • qwen: #自定义的模型名称
    alias: gpt-4,gpt-3.5-turbo,gpt-3.5-turbo-16k # 别名 例如 gpt4,gpt3
    enable: true # false true
    model_name_or_path: /home/k/Qwen1.5-7B-Chat-GPTQ-Int4/
    model_type: qwen # qwen chatglm3 yi internlm
    work_mode: vllm # vllm hf
    device: cpu # gpu / cpu
    workers:
    • gpus:
      • 0
        启动后提示 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.