Comments (5)
确认此问题是由 registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda11.8.0-py310-torch2.1.0-tf2.14.0-1.10.0
镜像中VLLM_USE_MODELSCOPE
环境变量导致,,设置为False 即可避免404 相关问题
此外,禁用vLLM 可通过 XINFERENCE_DISABLE_VLLM=1
环境变量设置,部分vLLM 对部分架构较老的显卡支持不足
from inference.
@wertycn 我这边modelscope下载正常,没有复现你的情况。ZhipuAI/chatglm3-6b 这个是public仓库,不需要登录的,xinference里面所有需要下载的模型都是从公开仓库下载,因此完全不涉及需要api key下载的情况。chatglm3 6b正常是可以跑在VLLM上的。
from inference.
类似的问题我在modelscope 的社区中也有用户遇到,使用vLLM推理会报错未鉴权 ,可能是同一个问题
from inference.
运行环境镜像如下:
FROM registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda11.8.0-py310-torch2.1.0-tf2.14.0-1.10.0
RUN pip install "xinference[all]"
重新尝试运行了下完整日志及堆栈如下:
2024-03-11 13:56:46,269 xinference.core.supervisor 116 INFO Xinference supervisor 0.0.0.0:12406 started
2024-03-11 13:56:47,475 xinference.core.worker 116 INFO Starting metrics export server at 0.0.0.0:None
2024-03-11 13:56:47,478 xinference.core.worker 116 INFO Checking metrics export server...
2024-03-11 13:56:47,767 xinference.core.worker 116 INFO Metrics server is started at: http://0.0.0.0:40579
2024-03-11 13:56:47,768 xinference.core.worker 116 INFO Xinference worker 0.0.0.0:12406 started
2024-03-11 13:56:47,769 xinference.core.worker 116 INFO Purge cache directory: /data/xinterface/cache
2024-03-11 13:56:54,580 xinference.api.restful_api 11 INFO Starting Xinference at endpoint: http://0.0.0.0:8000
2024-03-11 13:57:23,251 xinference.model.llm.llm_family 116 INFO Caching from Modelscope: ZhipuAI/chatglm3-6b
2024-03-11 13:57:23,298 - modelscope - INFO - PyTorch version 2.1.2 Found.
2024-03-11 13:57:23,300 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2024-03-11 13:57:23,300 - modelscope - INFO - Loading ast index from /data/xinterface/modelscope/ast_indexer
2024-03-11 13:57:23,364 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 44f0b88effe82ceea94a98cf99709694 and a total number of 946 components indexed
2024-03-11 13:57:24,695 xinference.model.llm.llm_family 116 INFO Cache /data/xinterface/cache/chatglm3-pytorch-6b exists
2024-03-11 13:57:24,706 xinference.model.llm.vllm.core 135 INFO Loading chatglm3 with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}
2024-03-11 13:57:24,747 - modelscope - INFO - PyTorch version 2.1.2 Found.
2024-03-11 13:57:24,749 - modelscope - INFO - TensorFlow version 2.14.0 Found.
2024-03-11 13:57:24,749 - modelscope - INFO - Loading ast index from /data/xinterface/modelscope/ast_indexer
2024-03-11 13:57:24,799 - modelscope - INFO - Loading done! Current index file version is 1.10.0, with md5 44f0b88effe82ceea94a98cf99709694 and a total number of 946 components indexed
2024-03-11 13:57:26,203 - modelscope - ERROR - Authentication token does not exist, failed to access model /data/xinterface/cache/chatglm3-pytorch-6b which may not exist or may be private. Please login first.
2024-03-11 13:57:26,206 xinference.core.worker 116 ERROR Failed to load model chatglm3-1-0
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
await model_ref.load()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
self._model.load()
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 272, in create_engine_configs
model_config = ModelConfig(self.model, self.tokenizer,
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 96, in __init__
model_path = snapshot_download(model_id=model,
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 96, in snapshot_download
revision = _api.get_valid_revision(
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/api.py", line 478, in get_valid_revision
all_revisions = self.list_model_revisions(
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/api.py", line 448, in list_model_revisions
handle_http_response(r, logger, cookies, model_id)
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/errors.py", line 98, in handle_http_response
raise HTTPError('Response details: %s, Request id: %s' %
requests.exceptions.HTTPError: [address=0.0.0.0:38863, pid=135] Response details: 404 page not found, Request id: 5b04d587cfc04c1db2c078a62a3fcf40
2024-03-11 13:57:26,256 xinference.api.restful_api 11 ERROR [address=0.0.0.0:38863, pid=135] Response details: 404 page not found, Request id: 5b04d587cfc04c1db2c078a62a3fcf40
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 689, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 803, in launch_builtin_model
await _launch_model()
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 767, in _launch_model
await _launch_one_model(rep_model_uid)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 748, in _launch_one_model
await worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
async with lock:
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
result = await result
File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
await model_ref.load()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
self._model.load()
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 617, in from_engine_args
engine_configs = engine_args.create_engine_configs()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 272, in create_engine_configs
model_config = ModelConfig(self.model, self.tokenizer,
File "/opt/conda/lib/python3.10/site-packages/vllm/config.py", line 96, in __init__
model_path = snapshot_download(model_id=model,
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/snapshot_download.py", line 96, in snapshot_download
revision = _api.get_valid_revision(
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/api.py", line 478, in get_valid_revision
all_revisions = self.list_model_revisions(
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/api.py", line 448, in list_model_revisions
handle_http_response(r, logger, cookies, model_id)
File "/opt/conda/lib/python3.10/site-packages/modelscope/hub/errors.py", line 98, in handle_http_response
raise HTTPError('Response details: %s, Request id: %s' %
requests.exceptions.HTTPError: [address=0.0.0.0:38863, pid=135] Response details: 404 page not found, Request id: 5b04d587cfc04c1db2c078a62a3fcf40
from inference.
使用官方镜像xprobe/xinference:v0.9.2
启动,会因为CUDA 版本不匹配 ,出现如下报错, 完整的堆栈放到最后,造成此问题的一个可能的原因是: modelscope镜像中一些环境变量设置,可能导致执行逻辑异常,晚些时候我编译一个纯净的CUDA 11.8 镜像再试试
RuntimeError: [address=0.0.0.0:44904, pid=79] CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
完整日志及堆栈信息:
2024-03-11 06:05:15,909 xinference.core.supervisor 62 INFO Xinference supervisor 0.0.0.0:24067 started
2024-03-11 06:05:18,191 xinference.core.worker 62 INFO Starting metrics export server at 0.0.0.0:None
2024-03-11 06:05:18,197 xinference.core.worker 62 INFO Checking metrics export server...
2024-03-11 06:05:18,318 xinference.core.worker 62 INFO Metrics server is started at: http://0.0.0.0:44346
2024-03-11 06:05:18,320 xinference.core.worker 62 INFO Xinference worker 0.0.0.0:24067 started
2024-03-11 06:05:18,320 xinference.core.worker 62 INFO Purge cache directory: /data/xinterface/cache
2024-03-11 06:05:24,113 xinference.api.restful_api 7 INFO Starting Xinference at endpoint: http://0.0.0.0:8000
2024-03-11 06:05:48,777 xinference.model.llm.llm_family 62 INFO Caching from Hugging Face: THUDM/chatglm3-6b
2024-03-11 06:05:48,777 xinference.model.llm.llm_family 62 WARNING Cache /data/xinterface/cache/chatglm3-pytorch-6b exists, but it was from modelscope
2024-03-11 06:05:48,784 xinference.model.llm.vllm.core 79 INFO Loading chatglm3 with following model config: {'tokenizer_mode': 'auto', 'trust_remote_code': True, 'tensor_parallel_size': 1, 'block_size': 16, 'swap_space': 4, 'gpu_memory_utilization': 0.9, 'max_num_seqs': 256, 'quantization': None, 'max_model_len': 4096}
INFO 03-11 06:05:48 llm_engine.py:72] Initializing an LLM engine with config: model='/data/xinterface/cache/chatglm3-pytorch-6b', tokenizer='/data/xinterface/cache/chatglm3-pytorch-6b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
WARNING 03-11 06:05:48 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
2024-03-11 06:06:03,133 xinference.core.worker 62 ERROR Failed to load model chatglm3-1-0
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
await model_ref.load()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
self._model.load()
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
return engine_class(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in __init__
self._init_cache()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 308, in _init_cache
num_blocks = self._run_workers(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 116, in profile_num_available_blocks
self.model_runner.profile_run()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 599, in profile_run
self.execute_model(seqs, kv_caches)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 534, in execute_model
hidden_states = model_executable(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 344, in forward
hidden_states = self.transformer(input_ids, positions, kv_caches,
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 314, in forward
hidden_states = self.encoder(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1
518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 271, in forward
hidden_states = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 209, in forward
attention_output = self.self_attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 104, in forward
qkv, _ = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 211, in forward
output_parallel = self.linear_method.apply_weights(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 72, in apply_weights
return F.linear(x, weight, bias)
RuntimeError: [address=0.0.0.0:44904, pid=79] CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other A
PI call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2024-03-11 06:06:03,307 xinference.api.restful_api 7 ERROR [address=0.0.0.0:44904, pid=79] CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/xinference/api/restful_api.py", line 689, in launch_model
model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 803, in launch_builtin_model
await _launch_model()
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 767, in _launch_model
await _launch_one_model(rep_model_uid)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/supervisor.py", line 748, in _launch_one_model
await worker_ref.launch_builtin_model(
File "xoscar/core.pyx", line 284, in __pyx_actor_method_wrapper
async with lock:
File "xoscar/core.pyx", line 287, in xoscar.core.__pyx_actor_method_wrapper
result = await result
File "/opt/conda/lib/python3.10/site-packages/xinference/core/utils.py", line 45, in wrapped
ret = await func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/worker.py", line 569, in launch_builtin_model
await model_ref.load()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/opt/conda/lib/python3.10/site-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/opt/conda/lib/python3.10/site-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 524, in xoscar.core._BaseActor.__on_receive__
result = func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/xinference/core/model.py", line 239, in load
self._model.load()
File "/opt/conda/lib/python3.10/site-packages/xinference/model/llm/vllm/core.py", line 139, in load
self._engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
return engine_class(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 114, in __init__
self._init_cache()
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 308, in _init_cache
num_blocks = self._run_workers(
File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 116, in profile_num_available_blocks
self.model_runner.profile_run()
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-pack
ages/vllm/worker/model_runner.py", line 599, in profile_run
self.execute_model(seqs, kv_caches)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 534, in execute_model
hidden_states = model_executable(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 344, in forward
hidden_states = self.transformer(input_ids, positions, kv_caches,
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 314, in forward
hidden_states = self.encoder(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 271, in forward
hidden_states = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/
lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 209, in forward
attention_output = self.self_attention(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/chatglm.py", line 104, in forward
qkv, _ = self.query_key_value(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 211, in forward
output_parallel = self.linear_method.apply_weights(
File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 72, in apply_weights
return F.linear(x, weight, bias)
RuntimeError: [address=0.0.0.0:44904, pid=79] CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
from inference.
Related Issues (20)
- chatTTS BUG HOT 2
- 可以支持Qwen-Audio这个模型吗 HOT 1
- 无法拉取docker镜像 HOT 1
- BUG: XINFERENCE_MODEL_SRC=huggingface 在中文系统下不生效 HOT 1
- cogvlm2,internvl-chat error
- BUG: Could not start custom registered model: sdxl-turbo HOT 1
- BUG: NCCL error: HOT 11
- Qwen2-7B & Qwen2-72B support context length 128K? HOT 1
- 关于自己注册的模型无法使用vLLM加速的问题。
- BUG:cogvlm2-llama3-chinese-chat-19B model crushed--Expected all tensors to be on the same device HOT 1
- 不支持自定义rerank模型 HOT 1
- Support tools of Qwen2-72B-Instruct HOT 1
- QUESTION HOT 1
- 新版本的xinference,glm4不支持 tools 能力?
- 自行下载的模型如何注册,请给个例子 HOT 1
- 机器上有两张卡,启动一直报错 HOT 1
- 启动Qwen2-72B-Instruct-GPTQ-Int4时报错:KeyError: 'original_max_position_embeddings' HOT 1
- when i update inference to least,Custom definition Rerank model happen error HOT 2
- xinference用pip安装后如何修改huggingface模型默认位置? HOT 1
- huggingface源和modelscope源下载的模型cache不能共存?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from inference.