我部署了xinference，版本是0.9.0. 一旦对显存的请求过大，服务就会hang住。表现为：刷新UI没有任何模型显示，所有正在运行的模型全部异常退

我复现一下我的操作过程，方便复现。 <div class="snippet-clipboard-content notranslate position-relat

希望服务稳定性提升 about inference HOT 2 CLOSED

yinghaodang commented on June 20, 2024

希望服务稳定性提升

from inference.

Comments (2)

qinxuye commented on June 20, 2024

日志完整的能贴一下吗

from inference.

yinghaodang commented on June 20, 2024

我复现一下我的操作过程，方便复现。

version: '3.8'

services:
  xinference-local:
    image: xprobe/xinference:v0.9.0
    container_name: xinference-local
    ports:
      - 9998:9997
    environment:
      - XINFERENCE_MODEL_SRC=modelscope
      - XINFERENCE_HOME=/root/MODEL_PATH
    volumes:
      - ${MODEL_PATH}:/root/MODEL_PATH
    restart: always
    shm_size: '128g'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command: xinference-local -H 0.0.0.0 --log-level debug
    networks:
      - xinference-local
networks:
  xinference-local:
    driver: bridge
    ipam:
      driver: default
      config:
        - subnet: "172.30.2.0/24"

${MODEL_PATH}是一个文件夹，里面的子文件夹是模型文件夹。
我的显卡是4张英伟达的A100，然后在部署页面选择部署了72B int4量化的Qwen(这个是使用Xinference平台下载的模型)。同样方法测试过Baichuan2-7b-chat模型，都是独占4张卡，都是选择pytorch的模型。
下面是测试大模型使用的代码

from ragas import evaluate

from langchain.chat_models import ChatOpenAI
from ragas.llms.base import LangchainLLMWrapper

inference_server_url = ""

# create vLLM Langchain instance
chat = ChatOpenAI(
   model="Baichuan2-13B-Chat",
   openai_api_key="no-key",
   openai_api_base=inference_server_url,
   max_tokens=1024,
   temperature=0.5,
)

# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm = LangchainLLMWrapper(chat)

from ragas.metrics import (
   context_precision,
   answer_relevancy,
   faithfulness,
   context_recall,
   context_relevancy,
   answer_correctness,
   answer_similarity
)
from ragas.metrics.critique import harmfulness

# change the LLM

faithfulness.llm = vllm
context_relevancy.llm = vllm
context_recall.llm = vllm
context_precision.llm = vllm
answer_similarity.llm = vllm
answer_relevancy.llm = vllm # Invalid key: 0 is out of bounds for size 0
answer_correctness.llm = vllm
harmfulness.llm = vllm

from langchain.embeddings import HuggingFaceEmbeddings

modelPath = "bge-large-zh"
# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}
# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}
# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
   model_name=modelPath,     # Provide the pre-trained model's path
   model_kwargs=model_kwargs, # Pass the model configuration options
   encode_kwargs=encode_kwargs # Pass the encoding options
)

answer_relevancy.embeddings = embeddings
answer_correctness.embeddings = embeddings
answer_similarity.embeddings = embeddings

# evaluate
from ragas import evaluate

result = evaluate(
   dataset,
   metrics=[context_precision], # 1
)

print(result)

测试代码大概是这个意思，删删改改多次，可能有bug。核心调用大模型的其实就是最后一句result = evaluate(
dataset,
metrics=[context_precision], # 1
)
我测试的时候dataset大概只有100条数据，后台观察nvidia-smi，会发现显存占用率一直攀升，直到占满显存，显存占用归0.
后台日志就是普通的Out of memory.(现在卡还在其他用途，之后有空闲了再贴)
然后Web前端表现为Hang住，无法查看正在运行的模型，也无法重新部署模型。

from inference.

希望服务稳定性提升 about inference HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent