I moved the examples/llama3_distributed.py to the root to get around exo. module issue
Then I ran it after having 2 nodes successfully connect (2x 64GB unified memory M2 Max).
(exo) β exo git:(d2184f5) β python llama3_distributed.py
Fetching 13 files: 100%|ββββββββββββββββββββ| 13/13 [00:00<00:00, 103073.63it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/Users/arthur/exo/llama3_distributed.py", line 89, in <module>
asyncio.run(run_prompt(args.prompt))
File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/Users/arthur/exo/llama3_distributed.py", line 52, in run_prompt
await peer.reset_shard(shard)
File "/Users/arthur/exo/exo/networking/grpc/grpc_peer_handle.py", line 78, in reset_shard
await self.stub.ResetShard(request)
File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in __await__
raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.0.0.161:8080: Failed to connect to remote host: FD shutdown"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2024-07-16T18:27:55.354269-07:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.0.0.161:8080: Failed to connect to remote host: FD shutdown"}"
>
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 110, in grpc._cython.cygrpc.shutdown_grpc_aio
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 114, in grpc._cython.cygrpc.shutdown_grpc_aio
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 78, in grpc._cython.cygrpc._actual_aio_shutdown
AttributeError: 'NoneType' object has no attribute 'POLLER'
Exception ignored in: 'grpc._cython.cygrpc.AioChannel.__dealloc__'
Traceback (most recent call last):
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 110, in grpc._cython.cygrpc.shutdown_grpc_aio
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 114, in grpc._cython.cygrpc.shutdown_grpc_aio
File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 78, in grpc._cython.cygrpc._actual_aio_shutdown
AttributeError: 'NoneType' object has no attribute 'POLLER'
(exo) β exo git:(d2184f5) β
After running two ndoes and getting these logs from DEBUG=9 python main.py in two python3.12 environments.
(exo) β exo git:(d2184f5) β DEBUG=9 python3 main.py --wait-for-peers 1
Server started, listening on 0.0.0.0:8080
Starting peer discovery process...
No peers discovered yet, retrying in 1 second...
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Discovered new peer 5a264eca-e3f9-4e31-8c5d-9592c101d3f1 at 172.20.10.5:8080
Discovered first peer: <exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x1197f9310>
Current number of known peers: 1. Waiting 5 seconds to discover more...
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Waiting additional 1 seconds for more peers.
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Current number of known peers: 1. Waiting 5 seconds to discover more...
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Collecting topoloy max_depth=3 visited={'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
No new peers discovered in the last grace period. Ending discovery process.
Starting with the following peers: [<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x1197f9310>]
Connecting to new peers...
Connected to 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: False
Connected to peer 5a264eca-e3f9-4e31-8c5d-9592c101d3f1
Collecting topoloy max_depth=4 visited=set()
Collecting topoloy max_depth=2 visited={'5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'cd2cc476-bdea-476f-83d6-d30de7c353f4', '667348c3-7da3-44a5-a964-6a3060ec82c0'}
Already visited 5a264eca-e3f9-4e31-8c5d-9592c101d3f1. Skipping...
Collecting topoloy max_depth=2 visited={'5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'cd2cc476-bdea-476f-83d6-d30de7c353f4', '667348c3-7da3-44a5-a964-6a3060ec82c0'}
Already visited 5a264eca-e3f9-4e31-8c5d-9592c101d3f1. Skipping...
Collected topology from: 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: Topology(Nodes: {cd2cc476-bdea-476f-83d6-d30de7c353f4: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), node2: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), e6198415-ef00-41ed-9c4b-6cf37641136b: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 667348c3-7da3-44a5-a964-6a3060ec82c0: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 495bdca3-f769-429c-8da7-7064f554ace3: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536)}, Edges: {cd2cc476-bdea-476f-83d6-d30de7c353f4: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: {'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}, node2: {'495bdca3-f769-429c-8da7-7064f554ace3'}, 495bdca3-f769-429c-8da7-7064f554ace3: {'node2', '5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, e6198415-ef00-41ed-9c4b-6cf37641136b: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 667348c3-7da3-44a5-a964-6a3060ec82c0: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}})
Collected topology: Topology(Nodes: {cd2cc476-bdea-476f-83d6-d30de7c353f4: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), node2: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), e6198415-ef00-41ed-9c4b-6cf37641136b: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 667348c3-7da3-44a5-a964-6a3060ec82c0: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 495bdca3-f769-429c-8da7-7064f554ace3: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536)}, Edges: {cd2cc476-bdea-476f-83d6-d30de7c353f4: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: {'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}, b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, e6198415-ef00-41ed-9c4b-6cf37641136b: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 667348c3-7da3-44a5-a964-6a3060ec82c0: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 495bdca3-f769-429c-8da7-7064f554ace3: {'node2', '5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, node2: {'495bdca3-f769-429c-8da7-7064f554ace3'}})
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Node 2 logs look similar. It looks like they are able to discover, but when I try to run inference I get the above