shell-nlp / gpt_server Goto Github PK

View Code? Open in Web Editor NEW

66.0 3.0 7.0 1.14 MB

gpt_server是一个用于生产级部署LLMs或Embedding的开源框架。

License: Apache License 2.0

Python 99.14% Shell 0.27% Dockerfile 0.59%

gpt_server's Introduction

GPT Server

本项目依托fastchat的基础能力来提供openai server的能力.

在此基础上完美适配了更多的模型，优化了fastchat兼容较差的模型
重新适配了vllm对模型适配较差，导致解码内容和hf不对齐的问题。
支持了vllm、LMDeploy和hf的加载方式
支持所有兼容sentence_transformers的语义向量模型（Embedding和Reranker）
支持了Infinity后端，推理速度大于onnx/tensorrt，支持动态组批
Chat模板无角色限制，使其完美支持了LangGraph Agent框架
支持了Function Calling (Tools) 能力（现阶段支持Qwen/ChatGLM，对Qwen支持更好）
支持多模态大模型
降低了模型适配的难度和项目使用的难度(新模型的适配仅需修改低于5行代码)，从而更容易的部署自己最新的模型。

（仓库初步构建中，构建过程中没有经过完善的回归测试，可能会发生已适配的模型不可用的Bug,欢迎提出改进或者适配模型的建议意见。）

特色

支持多种推理后端引擎，vLLM和LMDeploy，LMDeploy后端引擎，每秒处理的请求数是 vLLM 的 1.36 ~ 1.85 倍
支持了Infinity后端，推理速度大于onnx/tensorrt，支持动态组批
全球唯一完美支持Tools（Function Calling）功能的开源框架。兼容LangChain的 bind_tools、AgentExecutor、with_structured_output写法（目前支持Qwen系列、GLM系列）
全球唯一扩展了openai库,实现Reranker模型。(代码样例见gpt_server/tests/test_openai_rerank.py)
支持多模态大模型
与FastChat相同的分布式架构

更新信息

8-14  支持了 InternVL2 系列多模态模型
7-28  支持embedding/reranker 的动态组批加速（infinity后端, 比onnx/tensorrt更快）
7-19  支持了多模态模型 glm-4v-gb 的LMDeploy PyTorch后端
6-22  支持了 Qwen系列、ChatGLM系列 function call (tools) 能力
6-12  支持了 qwen-2
6-5   支持了 Yinka、zpoint_large_embedding_zh 嵌入模型
6-5   支持了 glm4-9b系列（hf和vllm）
4-27  支持了 LMDeploy 加速推理后端
4-20  支持了 llama-3
4-13  支持了 deepseek
4-4   支持了 embedding模型 acge_text_embedding
3-9   支持了 reranker 模型 （ bge-reranker，bce-reranker-base_v1）
3-3   支持了 internlm-1.0 ,internlm-2.0
3-2   支持了 qwen-1.5 0.5B, 1.8B, 4B, 7B, 14B, and 72B
2-4   支持了 vllm 实现
1-6   支持了 Yi-34B
12-31 支持了 qwen-7b, qwen-14b
12-30 支持了 all-embedding(理论上支持所有的词嵌入模型)
12-24 支持了 chatglm3-6b

路线

支持HF后端
支持vLLM后端
支持LMDeploy后端
支持 function call 功能 (tools)（Qwen系列、ChatGLM系列已经支持,后面有需求再继续扩展）
支持多模态模型（初步支持glm-4v,其它模型后续慢慢支持）
支持Embedding模型动态组批(实现方式：infinity后端)
支持Reranker模型动态组批(实现方式：infinity后端)
内置部分 tools (image_gen,code_interpreter,weather等)
并行的function call功能（tools）

启用方式

Python启动

1. 配置python环境

# 1. 创建conda 环境
conda create -n gpt_server python=3.10

# 2. 激活conda 环境
conda activate gpt_server

# 3. 安装仓库（一定要使用 install.sh 安装,否则无法解决依赖冲突）
sh install.sh

2. 修改启动配置文件

修改模型后端方式（vllm,lmdeploy等）

config.yaml中：

work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch

修改embedding/reranker后端方式（embedding或embedding_infinity）

config.yaml中：

model_type: embedding # embedding 或 embedding_infinity  embedding_infinity后端速度远远大于 embedding

config.yaml

cd gpt_server/script
vim config.yaml

serve_args:
  host: 0.0.0.0
  port: 8082

models:
  - chatglm4:  #自定义的模型名称
      alias: null # 别名     例如  gpt4,gpt3
      enable: true  # false true
      model_name_or_path: /home/dev/model/THUDM/glm-4-9b-chat/
      model_type: chatglm  # qwen  chatglm3 yi internlm
      work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        # - 1
        - 0
  
  - qwen:  #自定义的模型名称
      alias: gpt-4,gpt-3.5-turbo,gpt-3.5-turbo-16k # 别名     例如  gpt4,gpt3
      enable: true  # false true
      model_name_or_path: /home/dev/model/qwen/Qwen1___5-14B-Chat/ 
      model_type: qwen  # qwen  chatglm3 yi internlm
      work_mode: vllm  # vllm hf lmdeploy-turbomind  lmdeploy-pytorch
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 1
      # - gpus:
      #   - 3

  # Embedding 模型
  - bge-base-zh:
      alias: null # 别名   
      enable: true  # false true
      model_name_or_path: /home/dev/model/Xorbits/bge-base-zh-v1___5/
      model_type: embedding # embedding_infinity 
      work_mode: hf
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 2
 # reranker 模型
  - bge-reranker-base:
      alias: null # 别名   
      enable: true  # false true
      model_name_or_path: /home/dev/model/Xorbits/bge-reranker-base/
      model_type: embedding # embedding_infinity
      work_mode: hf
      device: gpu  # gpu / cpu
      workers:
      - gpus:
        - 2

3. 运行命令

start.sh

cd gpt_server/script
sh start.sh

支持的模型以及推理后端

推理速度： LMDeploy TurboMind > vllm > LMDeploy PyTorch > HF

LLM

Models / BackEnd	HF	vllm	LMDeploy TurboMind	LMDeploy PyTorch
chatglm4-9b	√	√	√	√
chatglm3-6b	√	√	×	√
Qwen (7B, 14B, etc.))	√	√	√	√
Qwen-1.5 (0.5B--72B)	√	√	√	√
Qwen-2	√	√	√	√
Yi-34B	√	√	√	√
Internlm-1.0	√	√	√	√
Internlm-2.0	√	√	√	√
Deepseek	√	√	√	√
Llama-3	√	√	√	√

VLM (视觉大模型榜单 https://rank.opencompass.org.cn/leaderboard-multimodal)

Models / BackEnd	HF	vllm	LMDeploy TurboMind	LMDeploy PyTorch
glm-4v-9b	×	×	×	√
InternVL2	×	×	√	√

Embedding模型

原则上支持所有的Embedding/Rerank 模型

推理速度： Infinity >> HF

以下模型经过测试：

Embedding/Rerank	HF	Infinity
bge-reranker	√	√
bce-reranker	√	√
bge-embedding	√	√
bce-embedding	√	√
piccolo-base-zh-embedding	√	√
acge_text_embedding	√	√
Yinka	√	√
zpoint_large_embedding_zh	√	√
xiaobu-embedding	√	√

目前 xiaobu-embedding C-MTEB榜单排行第一(MTEB: https://huggingface.co/spaces/mteb/leaderboard)

4. 使用 openai 库进行调用

见 gpt_server/tests 目录样例测试代码: https://github.com/shell-nlp/gpt_server/tree/main/tests

5. 使用WebUI

cd gpt_server/tests
python web_demo.py

WebUI界面:

Docker安装

1. 构建镜像

docker build --rm -f "Dockerfile" -t gpt_server:latest "."

2. Docker Compose启动

docker-compose  -f "docker-compose.yml" up -d --build gpt_server

致谢

FastChat : https://github.com/lm-sys/FastChat

vLLM : https://github.com/vllm-project/vllm

LMDeploy ： https://github.com/InternLM/lmdeploy

infinity ： https://github.com/michaelfeil/infinity

Star History

gpt_server's People

Contributors

Stargazers

Watchers

Forkers

ticoag yinjuxin leixy76 taurusduan tkhjp puwei0000 itwks

gpt_server's Issues

依赖好多

感觉仓库的依赖好多啊，下的时间太长了，可以精简一下嘛

加载Embedding模型时，指定使用CPU，但还是加载到GPU

在查看相关代码，发现在配置文件指定CPU后，其实只是把系统变量：CUDA_VISIBLE_DEVICES = '', 但在使用sentence_transformers 加载模型时，sentence_transformers 中未指定device参数，就按自己默认加载，默认发现有GPU卡，就使用GPU加载了。

改动如下，供参考：
+++ gpt_server/model_worker/embedding.py
@@ -1,3 +1,4 @@
+import os
from typing import List
from gpt_server.model_worker.base import ModelWorkerBase
import sentence_transformers
@@ -27,17 +28,27 @@
# model_kwargs = {"device": "cuda"}
self.encode_kwargs = {"normalize_embeddings": True, "batch_size": 64}
self.mode = "embedding"
+

```
   self.pool = {}
```
```
   self.device = "cuda"
```

   if not os.environ.get("CUDA_VISIBLE_DEVICES", ""):

```
       self.device = f"cpu"
```

   # rerank
   for model_name in model_names:
       if "rerank" in model_name:
           self.mode = "rerank"
           break

```
   if self.mode == "rerank":
```

       self.client = sentence_transformers.CrossEncoder(model_name=model_path)

       print("正在使用 rerank 模型...")

       self.client = sentence_transformers.CrossEncoder(model_name=model_path, device=self.device)

       print("正在使用 rerank 模型...", self.device)
   elif self.mode == "embedding":

       self.client = sentence_transformers.SentenceTransformer(model_path)

       print("正在使用 embedding 模型...")

       self.client = sentence_transformers.SentenceTransformer(model_path, device=self.device)

```
       if self.device == "cpu":
```

           self.pool = self.client.start_multi_process_pool(['cpu', 'cpu', 'cpu', 'cpu'])

       print("正在使用 embedding 模型...", self.device)

 def generate_stream_gate(self, params):
     pass

http api 请求 reranker 模型报错

请求：

POST http://10.0.80.31:8082/v1/embeddings
#Authorization: Bearer EMPTY
Content-Type: application/json

{
  "model": "bce-reranker-base_v1",
  "input": [
    "你是谁",
    "今年几岁",
    "你多大了"
  ],
  "extra_json": {
    "query": "你多大了"
  }
}

响应：

HTTP/1.1 400 Bad Request
date: Mon, 01 Jul 2024 08:37:55 GMT
server: uvicorn
content-length: 65
content-type: application/json

{
  "object": "error",
  "message": "Internal Server Error",
  "code": 50001
}

服务错误堆栈：

2024-07-01 08:41:47.905 | INFO     | __main__:get_embeddings:58 - params {'model': 'bce-reranker-base_v1', 'input': ['你是谁', '今年几岁', '你多大了'], 'encoding_format': None, 'query': None}
2024-07-01 08:41:47.905 | INFO     | __main__:get_embeddings:59 - worker_id: 4f496ce6
Batches:   0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]
Batches:   0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s]
2024-07-01 08:41:47 | ERROR | stderr |
INFO:     127.0.0.1:58486 - "POST /worker_get_embeddings HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/model_worker/base.py", line 293, in api_get_embeddings
    embedding = await worker.get_embeddings(params)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/model_worker/embedding.py", line 75, in get_embeddings
    scores = self.client.predict(sentence_pairs)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 338, in predict
    for features in iterator:
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/ubuntu/lxr/gpt_server-main/gpt_server/lib/python3.10/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py", line 138, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'NoneType' object has no attribute 'strip'
2024-07-01 08:41:47 | INFO | stdout | INFO:     10.0.12.165:54154 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request

使用 python 示例请求没问题：

from openai import OpenAI

# 新版本 opnai
client = OpenAI(api_key="EMPTY", base_url="http://10.0.80.31:8082/v1")
data = client.embeddings.create(
    model="bce-reranker-base_v1",
    input=["你是谁", "今年几岁", "你多大了"],
    extra_body={"query": "你多大了"},
)

print(data.data)

推理后端及接口问题（支持 tools功能）

大佬，请教下关于框架的几个问题：

chatcompletion接口支持传入tool吗
rerank、embedding支持并发推理吗
有没有接口文档，fastapi自带的那种

rerank模型能否支持 http://ip:8082/v1/rerank 调用？

cohere 的rerank调用是通过 "POST /v1/rerank HTTP/1.1" 这样的请求，Jina也是，能否api也能支持这个规范？

embedding/reranker加速推理

将实现embedding/reranker的动态组批，即类似于 vllm的动态批处理

【求助】构建镜像出错

root@debiandocker build -t gpt_server:v0.2.1 .
[+] Building 0.7s (8/8) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 674B 0.0s
=> [internal] load metadata for docker.io/continuumio/miniconda3:main 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 88B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 31.27kB 0.0s
=> [1/4] FROM docker.io/continuumio/miniconda3:main 0.0s
=> CACHED [2/4] COPY ./ /gpt_server 0.0s
=> CACHED [3/4] WORKDIR /gpt_server 0.0s
=> ERROR [4/4] RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && 0.6s

[4/4] RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ && conda config --set show_channel_urls yes && pip install -r requirements.txt && pip cache purge:
0.496 sed: can't read /etc/apt/sources.list: No such file or directory

Dockerfile:7

6 |
7 | >>> RUN sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list &&
8 | >>> sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list &&
9 | >>> pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple &&
10 | >>> conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ &&
11 | >>> conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ &&
12 | >>> conda config --set show_channel_urls yes &&
13 | >>> pip install -r requirements.txt && pip cache purge
14 | CMD ["/bin/bash"]

ERROR: failed to solve: process "/bin/sh -c sed -i 's/deb.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && sed -i 's/security.debian.org/mirrors.ustc.edu.cn/g' /etc/apt/sources.list && pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/web/simple && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ && conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/ && conda config --set show_channel_urls yes && pip install -r requirements.txt && pip cache purge" did not complete successfully: exit code: 2
root@debian:/mnt/gpt_server-api#

qwen: #自定义的模型名称
alias: gpt-4,gpt-3.5-turbo,gpt-3.5-turbo-16k # 别名例如 gpt4,gpt3
enable: true # false true
model_name_or_path: /home/k/Qwen1.5-7B-Chat-GPTQ-Int4/
model_type: qwen # qwen chatglm3 yi internlm
work_mode: vllm # vllm hf
device: cpu # gpu / cpu
workers:
- gpus:
  - 0
    启动后提示 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

会不会考虑支持多模态模型

会不会考虑支持多模态模型，比如 glm-4v-9b