Comments (9)
看上去是连了0.0.0.0,你可以用具体的机器ip试试,例如:xinference-worker -e http://10.100.108.220:9997 --log-level=debug
from inference.
xinference-worker -e http://10.100.108.220:9997 --log-level=debug
worker的那个ip我是通过传参数传进去的,后面我在sh里写死了,运行以后还是报同样的错误:
worker.sh
xinference-worker -e http://10.100.108.220:9997/ --log-level=debug
from inference.
10.100.108.220
确定ip没错吧?
from inference.
10.100.108.220
确定ip没错吧?
这个是server端当前的日志,应该没错吧
2024-02-05 11:02:50,956 xinference.core.supervisor 269 INFO Xinference supervisor 10.100.108.220:25589 started
2024-02-05 11:02:55,956 xinference.core.supervisor 269 DEBUG Enter get_status, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>,), kwargs: {}
2024-02-05 11:02:55,956 xinference.core.supervisor 269 DEBUG Leave get_status, elapsed time: 0 s
2024-02-05 11:02:57,448 xinference.api.restful_api 140 INFO Starting Xinference at endpoint: http://10.100.108.220:9997
/usr/local/lib/python3.10/dist-packages/xinference/api/restful_api.py:476: UserWarning:
Xinference ui is not built at expected directory: /usr/local/lib/python3.10/dist-packages/xinference/web/ui/build/
To resolve this warning, navigate to /usr/local/lib/python3.10/dist-packages/xinference/web/ui/
And build the Xinference ui by running "npm run build"
warnings.warn(
2024-02-05 11:08:37,100 xinference.core.supervisor 269 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:49736'), kwargs: {}
2024-02-05 11:40:37,037 xinference.core.supervisor 269 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:50276'), kwargs: {}
2024-02-05 11:49:15,908 xinference.core.supervisor 269 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:48767'), kwargs: {}
2024-02-05 13:54:49,201 xinference.core.supervisor 269 DEBUG Enter add_worker, args: (<xinference.core.supervisor.SupervisorActor object at 0x7fc16030c400>, '0.0.0.0:30828'), kwargs: {}
from inference.
看着supervisor日志是有Enter add_worker的,worker的报错还是跟最开始一样吗?
from inference.
看着supervisor日志是有Enter add_worker的,worker的报错还是跟最开始一样吗?
是的,还是一样
2024-02-05 13:54:47,482 xinference.core.worker 121 INFO Starting metrics export server at 0.0.0.0:None
2024-02-05 13:54:47,483 xinference.core.worker 121 INFO Checking metrics export server...
2024-02-05 13:54:49,170 xinference.core.worker 121 INFO Metrics server is started at: http://0.0.0.0:41831
Traceback (most recent call last):
File "/usr/local/bin/xinference-worker", line 8, in <module>
sys.exit(worker())
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/cmdline.py", line 349, in worker
main(
File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 94, in main
loop.run_until_complete(task)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 65, in _start_worker
await start_worker_components(
File "/usr/local/lib/python3.10/dist-packages/xinference/deploy/worker.py", line 43, in start_worker_components
await xo.create_actor(
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 78, in create_actor
return await ctx.create_actor(actor_cls, *args, uid=uid, address=address, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 143, in create_actor
return self._process_result_message(result)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 596, in create_actor
await self._run_coro(message.message_id, actor.__post_create__())
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/usr/local/lib/python3.10/dist-packages/xinference/core/worker.py", line 179, in __post_create__
await self._supervisor_ref.add_worker(self.address)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 227, in send
return self._process_result_message(result)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 102, in _process_result_message
raise message.as_instanceof_cause()
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 657, in send
result = await self._run_coro(message.message_id, coro)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/pool.py", line 368, in _run_coro
return await coro
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 384, in __on_receive__
return await super().__on_receive__(message) # type: ignore
File "xoscar/core.pyx", line 558, in __on_receive__
raise ex
File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
async with self._lock:
File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
with debug_async_timeout('actor_lock_timeout',
File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
result = await result
File "/usr/local/lib/python3.10/dist-packages/xinference/core/utils.py", line 44, in wrapped
ret = await func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xinference/core/supervisor.py", line 917, in add_worker
worker_ref = await xo.actor_ref(address=worker_address, uid=WorkerActor.uid())
File "/usr/local/lib/python3.10/dist-packages/xoscar/api.py", line 125, in actor_ref
return await ctx.actor_ref(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 196, in actor_ref
future = await self._call(actor_ref.address, message, wait=False)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/context.py", line 77, in _call
return await self._caller.call(
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 180, in call
client = await self.get_client(router, dest_address)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/core.py", line 68, in get_client
client = await router.get_client(dest_address, from_who=self)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 143, in get_client
client = await self._create_client(client_type, address, **kw)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/router.py", line 157, in _create_client
return await client_type.connect(address, local_address=local_address, **kw)
File "/usr/local/lib/python3.10/dist-packages/xoscar/backends/communication/socket.py", line 255, in connect
(reader, writer) = await asyncio.open_connection(host=host, port=port, **kwargs)
File "/usr/lib/python3.10/asyncio/streams.py", line 48, in open_connection
transport, _ = await loop.create_connection(
File "/usr/lib/python3.10/asyncio/base_events.py", line 1076, in create_connection
raise exceptions[0]
File "/usr/lib/python3.10/asyncio/base_events.py", line 1060, in create_connection
sock = await self._connect_sock(
File "/usr/lib/python3.10/asyncio/base_events.py", line 969, in _connect_sock
await self.sock_connect(sock, address)
File "/usr/lib/python3.10/asyncio/selector_events.py", line 501, in sock_connect
return await fut
File "/usr/lib/python3.10/asyncio/selector_events.py", line 541, in _sock_connect_cb
raise OSError(err, f'Connect call failed {address}')
ConnectionRefusedError: [address=10.100.108.220:25589, pid=269] [Errno 111] Connect call failed ('0.0.0.0', 30828)
from inference.
分布式下,worker -H 指定当前 worker 的 ip
from inference.
分布式下,worker -H 指定当前 worker 的 ip
成功了,谢谢!
from inference.
分布式下,worker -H 指定当前 worker 的 ip
还想请教一个问题,我一台机子有八张卡,我用四张卡启了一个qwen 72b的模型,但是在launch的时候oom了,我单卡的显存是80G,肯定是够的,请问在启动的时候还需要设置什么吗?下面是我sh的命令:
pip install xinference
MASTER_IP=$(ifconfig | grep -o 'inet [0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+' | grep -v '127.0.0.1' | head -n 1 | awk '{print $2}')
xinference-local -H "$MASTER_IP" --port 9997 --log-level=debug
from inference.
Related Issues (20)
- BUG 启动部分modelscope模型时,自动选择的vLLM无法启动 HOT 5
- BUG: OpenAI Stream API 实现未完全按照OpenAI 格式输出,可能导致部分OpenAI API Client 实现调用报错 HOT 5
- FEAT: Support Yi-9B
- KV cache 的大小是哪里控制的? HOT 2
- FEAT: support gorilla-openfunctions-v2 function calling HOT 1
- ENH: support user defined model for vllm backend HOT 1
- 当设置--log-level=debug时,希望后台会打印模型的输出结果
- BUG:使用vllm部署qwen1.5-chat 72b模型出错 HOT 5
- BUG:中止流式对话后,后段报错:parallel generation is not supported by ggml HOT 1
- FEAT: Support stopping stream output when client interrupts
- 百川模型的切割存在问题 HOT 1
- 自注册chatglm3最新的模型文件启动报错,其他模型都正常 HOT 1
- useless fstring HOT 2
- Embedding 并行 HOT 6
- BUG:多GPU环境,docker v0.9.2 modelscope源启动bge-reranker-base和bge-reranker-large模型失败 HOT 9
- 在启动bge-reranker-bas时出错 HOT 5
- BUG:v0.9.2 modelscope源启动bge-reranker-base和bge-reranker-large模型失败 HOT 5
- FEAT: support DeepSeek-VL HOT 1
- DOC: internal design HOT 1
- BUG:运行bge-large-zh-v1.5报错 HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from inference.