exo-explore / exo Goto Github PK

View Code? Open in Web Editor NEW

4.3K 39.0 188.0 8.61 MB

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

License: GNU General Public License v3.0

Python 86.79% CSS 3.77% HTML 4.83% JavaScript 4.57% Shell 0.04%

exo's Introduction

exo: Run your own AI cluster at home with everyday devices. Maintained by exo labs.

Discord | Telegram | X

Forget expensive NVIDIA GPUs, unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, Linux, pretty much any device!

Update: Exo Supports Llama 3.1

Now the default models, run 8B, 70B and 405B parameter models on your own devices

See the code

Get Involved

exo is experimental software. Expect bugs early on. Create issues so they can be fixed. The exo labs team will strive to resolve issues quickly.

We also welcome contributions from the community. We have a list of bounties in this sheet.

Features

Wide Model Support

exo supports LLaMA (MLX and tinygrad) and other popular models.

Dynamic Model Partitioning

exo optimally splits up models based on the current network topology and device resources available. This enables you to run larger models than you would be able to on any single device.

Automatic Device Discovery

exo will automatically discover other devices using the best method available. Zero manual configuration.

ChatGPT-compatible API

exo provides a ChatGPT-compatible API for running models. It's a one-line change in your application to run models on your own hardware using exo.

Device Equality

Unlike other distributed inference frameworks, exo does not use a master-worker architecture. Instead, exo devices connect p2p. As long as a device is connected somewhere in the network, it can be used to run models.

Exo supports different partitioning strategies to split up a model across devices. The default partitioning strategy is ring memory weighted partitioning. This runs an inference in a ring where each device runs a number of model layers proportional to the memory of the device.

ring topology

Installation

The current recommended way to install exo is from source.

Prerequisites

Python>=3.12.0 is required because of issues with asyncio in previous versions.

From source

git clone https://github.com/exo-explore/exo.git
cd exo
pip install .
# alternatively, with venv
source install.sh

Troubleshooting

If running on Mac, MLX has an install guide with troubleshooting steps

Documentation

Example Usage on Multiple MacOS Devices

Device 1:

python3 main.py

Device 2:

python3 main.py

That's it! No configuration required - exo will automatically discover the other device(s).

The native way to access models running on exo is using the exo library with peer handles. See how in this example for Llama 3.

exo starts a ChatGPT-like WebUI (powered by tinygrad tinychat) on http://localhost:8000

For developers, exo also starts a ChatGPT-compatible API endpoint on http://localhost:8000/v1/chat/completions. Example with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-8b",
     "messages": [{"role": "user", "content": "What is the meaning of exo?"}],
     "temperature": 0.7
   }'

Debugging

Enable debug logs with the DEBUG environment variable (0-9).

DEBUG=9 python3 main.py

Known Issues

🚧 As the library is evolving so quickly, the iOS implementation has fallen behind Python. We have decided for now not to put out the buggy iOS version and receive a bunch of GitHub issues for outdated code. We are working on solving this properly and will make an announcement when it's ready. If you would like access to the iOS implementation now, please email [email protected] with your GitHub username explaining your use-case and you will be granted access on GitHub.

Inference Engines

exo supports the following inference engines:

✅ MLX
✅ tinygrad
🚧 llama.cpp

Networking Modules

✅ GRPC
🚧 Radio
🚧 Bluetooth

exo's People

Contributors

Stargazers

Watchers

Forkers

sherbetlemon47 mawutory utopic-dev cryptoleek-team abdoiiii ysterbal redwansikder briancpark gmh5225 lavrovd apotl i-dream-in-cod3 007harshmahajan khankindle wldasf mattroyer theonetrueguy liamdgray diode23 sekmet repos-ai-local geminos-dev kustomzone hbcbh1999 segmond bilawalriaz wuchirat wds33817 ffos samheutmaker daviddelaurier andvarfolomeev kbsk s04 vkleban fourpartswater free-dragon the-alex-b hirajanwin jenningsje destryteeter emanuelfromflorence architectureofthings ivanfioravanti stacey-kellough nitinjoseph11 ronabop techventurebuilder ramiiyan mzbac zisequkuai makermotion dnzdlklc fundou cloudnepal ahmed-sabri madbomber songkq to3d bperin jesusoctavioas mecasual19 alankw-ong avain itsknk xc0r gradientdcntr sorokinvld professeurfalken badjeff 3ricpeng frank2033 jiayong asd12l alihaskar brunoscaglione opensesamedoors yluiop123 augustdzw oiolong vmalgi nsxsoft kerwinchina mojowebs gtrguy17 azaj01 coki230 priyanshupareek winpkay misterye ainisa20 weibiaoyi abhishek818 dhruvmalik007 wr-maox guochaopeng mvkvc smartjoy-tech xiangxud abhisekjha

exo's Issues

Support for Intel based Macs

Any chance to also support Intel based Macs at some point?

I understand that Linux support is cuda based but those were pretty capable computational machines that now just gather dust

Emit statistics

tokens / sec
memory usage
gpu utilisation
bytes sent / received
num errors
MFU (great metric. see e.g. https://x.com/__tinygrad__/status/1814519105346810038)

Is windows supported ?

I tried to install deps on a windows machine and got deps errors, I don't see anything about windows in the readme, so is windows supported ?

Broadcast should use asyncio.gather

Self explanatory. No need to await each one sequentially

Windows support

Since tinygrad dropped official support for Windows some time ago, we should look for alternative ways to do this. Perhaps llama.cpp

[BOUNTY - $200] Radio Networking Module

Motivation: The goal of exo is to support any device in any setting. Radio is useful for settings with low connectivity e.g. ships.

What: exo supports networking modules, which consist of servers, discovery and peer handles. For a reference implementation, see grpc. This bounty is for creating a radio networking module. The protocol that is used over radio is left up to the implementer.

Reward: $200 Bounty paid out with USDC on Ethereum, email [email protected].

Fully parallelise pre-fill

Insight: pre-fill has 0 need for synchronisation unlike generation so can be fully parallelised. Related: #12

Ring Attention for coupling the data transfer with computation of attention block matrices

A promising idea from the community:

Support Mixture of Expert (MoE) Models

Where can I set the path for the large model I have already downloaded?

My environment seems to be installed successfully and I can open the chat website, but I'm unsure where to set the path for the large model?"

[BOUNTY - $200] Share kv cache between nodes for redundancy

#23 (comment)

Perhaps after each inference, we synchronise the full kv cache between all nodes. This should be fairly straightforward, we can broadcast the entire cache.

this would allow for saving context even when a node goes down.

How does the partition work?

When I was trying to shard the model for Deepseek v2 between 2 nodes, it seems the node receiving the request always tries to load the entire model without sharding.

Thunderbolt discovery for Macs

should already be supported
just check that it prioritises thunderbolt over WiFi
Thanks apple for making thunderbolt usable

Consider allowing IPv4 exposure for a working node

Hi there,

Thank for this. 👏

It appears this ecosystem runs on localhost:8000. I'm curious if you can allow a node to run on addresses like 192.168.1.1. I used this private Class C address as an example.

By implementing this, you would enable people to properly use this system for chatting with a local LLM, utilizing their local hardware that they might have previously considered "junk".

If this external exposure is already possible, could you please direct me to the section in the documentation where this information is provided?

Thanks!

Example setups with benchmarks

Users want to know exactly which setups work, how to set them up, and what the benchmarks are.

A simple benchmark we can do is Mac Minis. We have 4 of them, so we can just progressively add Mac minis and measure the tok/sec.

Ref:

Unable to perform inference using NVIDIA devices and tinygrad

I'm running exo using a 4090 and a 3090 that exist on the same network. The devices can discover eachother just fine but are unable to perform inference using tinygrad.

Here is a video demonstrating the failure on a clean install of the latest version of exo.

Request:

curl http://100.68.16.19:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-8b",
     "messages": [{"role": "user", "content": "What is the meaning of exo?"}],
     "temperature": 0.7
   }'

Logs:

{"detail": "Error processing prompt (see logs with DEBUG>=2): compile failed: <null>(20): error: more than one user-defined conversion from \"nv_bfloat16\" to \"half\" applies:\n            function \"__half::__half(float)\"\n/usr/include/cuda_fp16.hpp(214): here\n            function \"__half::__half(short)\"\n/usr/include/cuda_fp16.hpp(227): here\n            function \"__half::__half(unsigned short)\"\n/usr/include/cuda_fp16.hpp(228): here\n            function \"__half::__half(int)\"\n/usr/include/cuda_fp16.hpp(229): here\n            function \"__half::__half(unsigned int)\"\n/usr/include/cuda_fp16.hpp(230): here\n            function \"__half::__half(long long)\"\n/usr/include/cuda_fp16.hpp(231): here\n            function \"__half::__half(unsigned long long)\"\n/usr/include/cuda_fp16.hpp(232): here\n\n<null>(21): error: more than one user-defined conversion from \"nv_bfloat16\" to \"half\" applies:\n            function \"__half::__half(float)\"\n/usr/include/cuda_fp16.hpp(214): here\n            function \"__half::__half(short)\"\n/usr/include/cuda_fp16.hpp(227): here\n            function \"__half::__half(unsigned short)\"\n/usr/include/cuda_fp16.hpp(228): here\n            function \"__half::__half(int)\"\n/usr/include/cuda_fp16.hpp(229): here\n            function \"__half::__half(unsigned int)\"\n/usr/include/cuda_fp16.hpp(230): here\n            function \"__half::__half(long long)\"\n/usr/include/cuda_fp16.hpp(231): here\n            function \"__half::__half(unsigned long long)\"\n/usr/include/cuda_fp16.hpp(232): here\n\n<null>(22): error: more than one user-defined conversion from \"nv_bfloat16\" to \"half\" applies:\n            function \"__half::__half(float)\"\n/usr/include/cuda_fp16.hpp(214): here\n            function \"__half::__half(short)\"\n/usr/include/cuda_fp16.hpp(227): here\n            function \"__half::__half(unsigned short)\"\n/usr/include/cuda_fp16.hpp(228): here\n            function \"__half::__half(int)\"\n/usr/include/cuda_fp16.hpp(229): here\n            function \"__half::__half(unsigned int)\"\n/usr/include/cuda_fp16.hpp(230): here\n            function \"__half::__half(long long)\"\n/usr/include/cuda_fp16.hpp(231): here\n            function \"__half::__half(unsigned long long)\"\n/usr/include/cuda_fp16.hpp(232): here\n\n<null>(23): error: more than one user-defined conversion from \"nv_bfloat16\" to \"half\" applies:\n            function \"__half::__half(float)\"\n/usr/include/cuda_fp16.hpp(214): here\n            function \"__half::__half(short)\"\n/usr/include/cuda_fp16.hpp(227): here\n            function \"__half::__half(unsigned short)\"\n/usr/include/cuda_fp16.hpp(228): here\n            function \"__half::__half(int)\"\n/usr/include/cuda_fp16.hpp(229): here\n            function \"__half::__half(unsigned int)\"\n/usr/include/cuda_fp16.hpp(230): here\n            function \"__half::__half(long long)\"\n/usr/include/cuda_fp16.hpp(231): here\n            function \"__half::__half(unsigned long long)\"\n/usr/include/cuda_fp16.hpp(232): here\n\n4 errors detected in the compilation of \"<null>\".\n\u0000"}

Broadcast token results throughout the network

Because of exo's architecture, currently only tail nodes support the ChatGPT API endpoint. This is frustrating for users (see e.g. #23) because the tail node changes dynamically (unless you force a node-id but this is non obvious). This would also be more in line with exo's node equality philosophy.

We already support receiving requests to generate from any node. We can support the ChatGPT API endpoint from any node as long as we support retrieving the response. This should be fairly straightforward - we just need to propagate results.

✅ Initially, we will just support broadcasting the final response to degree 1 neighbours

Switch grpc peer connections over to the best interface available

This PR implements prioritising thunderbolt interfaces for grpc peer connections: #47
Right now if a peer is already connected on WiFi, and then you connect thunderbolt, the connection will not upgrade to Thunderbolt. It would require a full restart of the node
We should switch over to the best interface available (for now the best is thunderbolt)

[BOUNTY - $200] Batched Requests

Motivation: Batching multiple inference requests together can speed up inference. Batching can even be leveraged with single-input settings for speedups with e.g. staged speculative decoding.

What: Currently, exo handles inference requests separately. This bounty is for batching inferences together, so that multiple inputs can be passed through model shards together in a single pass.

Reward: $200 Bounty paid out with USDC on Ethereum, email [email protected]

Broadcast and Listen discovery tasks block

Oversight from me. These currently block. I imagine this is slowing down the entire node right now and the only reason it works is because broadcast unblocks listen in a loop. This is bad. Fix needed ASAP. Offending file: https://github.com/exo-explore/exo/blob/fa9d4169557492986ff663936a4a3110a2430374/exo/networking/grpc/grpc_discovery.py

No module named 'exo.inference'

I have installed exo lab with pip command , but still get new error like below:

Traceback (most recent call last):
File "/home/drc-whlab/Downloads/exo-main/examples/llama3_distributed.py", line 5, in
from exo.inference.mlx.sharded_utils import get_model_path, load_tokenizer
ModuleNotFoundError: No module named 'exo'

after: 'pip install exo'

(exo) drc-whlab@drc-whlab:~/Downloads/exo-main/examples$ python llama3_distributed.py
Traceback (most recent call last):
File "/home/drc-whlab/Downloads/exo-main/examples/llama3_distributed.py", line 5, in
from exo.inference.mlx.sharded_utils import get_model_path, load_tokenizer
ModuleNotFoundError: No module named 'exo.inference'

ephemeral node ports

This would fix a lot of issues, particularly around peers being dropped when they lose connection. We currently have no way to know if a new node was started on the same port or if the old node is still running. This would also fix #14 and #15

This kind of thing:

async def find_free_port(start_port=49152, end_port=65535, max_attempts=100):
    """Find a free port in the ephemeral port range."""
    for _ in range(max_attempts):
        port = random.randint(start_port, end_port)
        try:
            with closing(socket.socket(socket.AF_INET, socket.SOCK_DGRAM)) as sock:
                sock.bind(('', port))
                return port
        except OSError:
            continue
    raise RuntimeError(f"Unable to find a free port after {max_attempts} attempts")

15 nodes, job not divided properly

I tried 15 nodes and it didn't distribute the job but had them all do the whole thing and combined all of their outputs.

No module named `aiohttp`

After following the instructions:

git clone [email protected]:exo-explore/exo.git
cd exo
virtualenv develop_env
source develop_env/bin/activate
 pip install -r requirements.txt

I get:

python3 main.py

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Traceback (most recent call last):
  File "/Users/tfr/Documents/TestArea/exo/main.py", line 14, in <module>
    from exo.api import ChatGPTAPI
  File "/Users/tfr/Documents/TestArea/exo/exo/api/__init__.py", line 1, in <module>
    from exo.api.chatgpt_api import ChatGPTAPI
  File "/Users/tfr/Documents/TestArea/exo/exo/api/chatgpt_api.py", line 6, in <module>
    from aiohttp import web
ModuleNotFoundError: No module named 'aiohttp'

Here is some info that might.be useful:

% uname -a
Darwin studio.local 23.5.0 Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64

% python --version
Python 3.10.14

Propagate token results throughout the entire network (not just depth=1)

Related: #24

Broadcast to the entire network (for now degree=1 is fine since that's the only configuration people are using with UDP/GRPC)

Detect if Intel Mac, use tinygrad inference engine

[BOUNTY - $500] Pipeline Parallel Inference

Prerequisite: #1

Motivation: exo should use device resources as efficiently as possible. Current implementation underutilises available resources.

What: See https://pytorch.org/docs/stable/pipeline.html

Reward: $500 Bounty paid out with USDC on Ethereum, email [email protected].

Add our own runners for CI/CD

The goal of exo is to run on any device, anywhere. As a result, exo runs some exotic setups, e.g. thunderbolt, which need to be maintained and tested continuously. As you can see in this issue, often there are things implemented that are hard to test because not everyone has the right setup: #47

The ideal thing to do would be have our own runners that include all these exotic setups like thunderbolt, radio, bluetooth, iPhone/iPad and whatever else we end up doing.

Downloading models should happen async

Currently downloading models blocks the main thread, since MLX and tinygrad are synchronous libraries.
We should change this to work async so e.g. discovery doesn’t get blocked (side effect of this right now is peers get dropped for being idle)

Make a topology visualisation

The topology visualisation should:

Show the topology of the network we're using
Show which layers on on which device (i.e. the partitions)
Show which device is active at any time (highlight that device)
Show the DeviceCapabilities of each node (model, chip, memory, flops)
Show GPU poor/rich see #33

Smarter model downloading

See: #14

Right now, every device downloads the entire model which is unnecessary. The design of exo makes things like this particularly difficult since all nodes are equal and p2p.

Error in peer discovery: '_UnixSelectorEventLoop' object has no attribute 'sock_recvfrom'

Error occurs after installation when running:

python3 main.py

Environment:
Python 3.10.7
Mac M1

Fix ChatGPT API endpoint

Currently doesn't work - which means the only way to access inference is via peer handles in Python, which isn't very user-friendly for applications to add a new library.

Context collapse when running Llama3-70B

I am seeing contextual collapse after just a few messages when running llama3-70B on a Mac mini cluster. This only appears to affect 70B, and doesn't seem to affect 8B.

The environment has 4 Mac mini nodes, 2x M2 with 16GB of RAM, and 2x M2 Pro with 32GB of RAM. All are running macOS Sonoma 14.4, with Python 3.12.4 installed via brew and added to path. I made sure to pull the latest code from today, with dependencies re-installed as well on each node.

Nodes are started in sequential order, so that the first node is an M2 Pro node and is also used as the endpoint when hitting the API. I notice that if I start in a different order, the API doesn't return, although generation still succeeds out to the terminal.

First prompt:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write a brief history of the iPhone."}],                
     "temperature": 0.7
   }'

This always times out on first run, since the models sequentially pull across the nodes before generation starts. The terminal generates the following text:

Here is a brief history of the iPhone:

**2007: The First Generation**

The first iPhone was announced by Steve Jobs, the co-founder and CEO of Apple, on January 9, 2007, at the Macworld conference in San Francisco. The iPhone was a revolutionary device that combined a mobile phone, an iPod, and an internet communications device into one product. It was released to the public on June 29, 2007.

**2008: The 3G iPhone**

The second generation of iPhone, known as the iPhone 3G, was released in 2008. This device added support for 3G networks, allowing for faster data speeds.

**2009: The iPhone 3GS**

The third generation of iPhone, known as the iPhone 3GS, added video recording capabilities, improved battery life, and a faster processor.

**2010: The iPhone 4**

The fourth generation of iPhone, known as the iPhone 4, featured a new design, a higher resolution display, improved cameras, and a faster processor.

**2011: The iPhone 4S**

The fifth generationofi iPhone, known asiPhone 4S, added Siri, a virtual assistant, improved battery life, improved cameras, a faster processor.

Subsequent API calls return successfully, but seem to keep the context rather than treating the message as a new prompt:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Please continue."}],
     "temperature": 0.7
   }'

{"id": "chatcmpl-1b2c6ebd-87dd-4b58-87aa-5dda0490b205", "object": "chat.completion", "created": 1721249846, "model": "llama-3-70b", "usage": {"prompt_tokens": 13, "completion_tokens": 257, "total_tokens": 270}, "choices": [{"message": {"role": "assistant", "content": "**2012: The iPhone 5**\n\nThe sixth generation of iPhone, known asiPhone 5, featured a larger display, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2013: The iPhone 5C**\n\nThe seventh generationofiiPhone, known asiPhone 5C, featured a new design, improved cameras, a faster processor, improved battery life, a new Lightning connector.\n\n**2014: The iPhone 6**\n\nThe eighth generationofiiPhone,known asiPhone 6, featured a larger display, improved cameras,a faster processor, improved battery life,a new Lightning connector.\n\n**2015: The iPhone 6 Plus**\n\nThe ninth generationofiiPhone,known asiPhone 6 Plus, featured a new design, improved cameras,a faster processor,improved battery life,a new Lightning connector.\n\n**2017: The iPhone X**\n\nThe tenth generationofiiPhone,knownasiPhone X, featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2018: The iPhone XS**\n\nThe eleventh generationofiiPhone,knownasiPhone XS,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2019: The"}, "logprobs": null, "finish_reason": "stop", "index": 0}]}%

You can see that the quality of generation degrades over time, certain grammatical/spelling errors also persist once introduced. Additionally, if I send a very different question, it ignores it and repeats the context of the previous question:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "llama-3-70b",
     "messages": [{"role": "user", "content": "Write about llamas."}],
     "temperature": 0.7
   }'
{"id": "chatcmpl-8b7b2f5e-6f59-4374-a6af-18e70b6639ef", "object": "chat.completion", "created": 1721249971, "model": "llama-3-70b", "usage": {"prompt_tokens": 15, "completion_tokens": 134, "total_tokens": 149}, "choices": [{"message": {"role": "assistant", "content": "**2020: The iPhone 12**\n\nThe twelfth generationofiiPhone,knownasiPhone 12,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2021: The iPhone 13**\n\nThe thirteenth generationofiiPhone,knownasiPhone 13,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\n**2022: The iPhone 14**\n\nThe fourteenth generationofiiPhone,knownasiPhone 14,featured a new design,improved cameras,a faster processor,improvedbattery life,a new Lightning connector.\n\nAnd so on."}, "logprobs": null, "finish_reason": "stop", "index": 0}]}

ChatGPT API Endpoint Response Streaming

Currently generation takes a long time and it only returns after it's entirely complete.
We can implement streaming so the API caller can consume the response as it's generated.

Tinygrad automatic model partitioning

Right now, exo requires manually implementing new models. Since tinygrad represents computations as a op DAG, it should be possible to automatically partition the model at that level.

Exo module issue

See: #14 (comment)
Not sure exactly what the issue is but in some case we use relative imports which may be causing issues. We should replace all relative imports.

Bug: NVidia / CUDA not detected

Trying out exo for the first time and it seems it's not detecting my 3 NVidia GPUs / CUDA:

Other CUDA apps (e.g. llama.cpp, ollama, exllamav2 etc...) are working fine.

Nvidia Driver: 555.58.02
Distro: Fedora 40
exo git SHA: 9fa0cb1 (current as 2024-07-20)
CUDA 12.5
CUDA Packages:

rpm -qa | grep -iE 'cuda'
cuda-nvdisasm-12-5-12.5.39-1.x86_64
cuda-opencl-12-5-12.5.39-1.x86_64
cuda-gcc-12.3.0-2.fc40.x86_64
cuda-cccl-12-5-12.5.39-1.x86_64
cuda-gcc-c++-12.3.0-2.fc40.x86_64
cuda-opencl-devel-12-5-12.5.39-1.x86_64
cuda-cuobjdump-12-5-12.5.39-1.x86_64
cuda-profiler-api-12-5-12.5.39-1.x86_64
cuda-nvcc-12.4.99-2.fc40.x86_64
cuda-cudnn-9.0.0.312-2.fc40.x86_64
libcudnn8-8.9.7.29-1.cuda12.2.x86_64
cuda-nvvm-12-3-12.3.107-1.x86_64
cuda-nvvm-12-2-12.2.140-1.x86_64
cuda-crt-12-3-12.3.107-1.x86_64
cuda-crt-12-2-12.2.140-1.x86_64
cuda-nvvm-12-4-12.4.131-1.x86_64
cuda-crt-12-4-12.4.131-1.x86_64
cuda-nvcc-12-4-12.4.131-1.x86_64
cuda-nvcc-12-2-12.2.140-1.x86_64
cuda-nvcc-12-3-12.3.107-1.x86_64
cuda-nvcc-12-1-12.1.105-1.x86_64
cuda-toolkit-12-4-config-common-12.4.127-1.noarch
cuda-cudart-12-4-12.4.127-1.x86_64
cuda-opencl-12-4-12.4.127-1.x86_64
cuda-nvrtc-12-4-12.4.127-1.x86_64
cuda-nvprof-12-4-12.4.127-1.x86_64
cuda-nvml-devel-12-4-12.4.127-1.x86_64
cuda-nvdisasm-12-4-12.4.127-1.x86_64
cuda-cccl-12-4-12.4.127-1.x86_64
cuda-cudart-devel-12-4-12.4.127-1.x86_64
cuda-nvvp-12-4-12.4.127-1.x86_64
cuda-libraries-12-4-12.4.1-1.x86_64
cuda-nvrtc-devel-12-4-12.4.127-1.x86_64
cuda-opencl-devel-12-4-12.4.127-1.x86_64
cuda-nsight-compute-12-4-12.4.1-1.x86_64
cuda-profiler-api-12-4-12.4.127-1.x86_64
cuda-nvtx-12-4-12.4.127-1.x86_64
cuda-nvprune-12-4-12.4.127-1.x86_64
cuda-nsight-systems-12-4-12.4.1-1.x86_64
cuda-nsight-12-4-12.4.127-1.x86_64
cuda-gdb-12-4-12.4.127-1.x86_64
cuda-driver-devel-12-4-12.4.127-1.x86_64
cuda-libraries-devel-12-4-12.4.1-1.x86_64
cuda-visual-tools-12-4-12.4.1-1.x86_64
cuda-documentation-12-4-12.4.127-1.x86_64
cuda-cuxxfilt-12-4-12.4.127-1.x86_64
cuda-cupti-12-4-12.4.127-1.x86_64
cuda-cuobjdump-12-4-12.4.127-1.x86_64
cuda-compiler-12-4-12.4.1-1.x86_64
cuda-sanitizer-12-4-12.4.127-1.x86_64
cuda-command-line-tools-12-4-12.4.1-1.x86_64
cuda-tools-12-4-12.4.1-1.x86_64
cuda-toolkit-12-4-12.4.1-1.x86_64
libnccl-2.22.3-1+cuda12.5.x86_64
libnccl-devel-2.22.3-1+cuda12.5.x86_64
cuda-toolkit-config-common-12.5.82-1.noarch
cuda-toolkit-12-config-common-12.5.82-1.noarch
cuda-toolkit-12-5-config-common-12.5.82-1.noarch
cuda-cudart-12-5-12.5.82-1.x86_64
cuda-nvrtc-12-5-12.5.82-1.x86_64
cuda-nvprof-12-5-12.5.82-1.x86_64
cuda-nvml-devel-12-5-12.5.82-1.x86_64
cuda-nvvp-12-5-12.5.82-1.x86_64
cuda-libraries-12-5-12.5.1-1.x86_64
cuda-nvrtc-devel-12-5-12.5.82-1.x86_64
cuda-cudart-devel-12-5-12.5.82-1.x86_64
cuda-nvvm-12-5-12.5.82-1.x86_64
cuda-nvtx-12-5-12.5.82-1.x86_64
cuda-nvprune-12-5-12.5.82-1.x86_64
cuda-nsight-systems-12-5-12.5.1-1.x86_64
cuda-nsight-12-5-12.5.82-1.x86_64
cuda-gdb-12-5-12.5.82-1.x86_64
cuda-driver-devel-12-5-12.5.82-1.x86_64
cuda-libraries-devel-12-5-12.5.1-1.x86_64
cuda-documentation-12-5-12.5.82-1.x86_64
cuda-cuxxfilt-12-5-12.5.82-1.x86_64
cuda-cupti-12-5-12.5.82-1.x86_64
cuda-crt-12-5-12.5.82-1.x86_64
cuda-nvcc-12-5-12.5.82-1.x86_64
cuda-compiler-12-5-12.5.1-1.x86_64
cuda-nsight-compute-12-5-12.5.1-1.x86_64
cuda-visual-tools-12-5-12.5.1-1.x86_64
cuda-sanitizer-12-5-12.5.81-1.x86_64
cuda-command-line-tools-12-5-12.5.1-1.x86_64
cuda-tools-12-5-12.5.1-1.x86_64
cuda-toolkit-12-5-12.5.1-1.x86_64
cuda-toolkit-12.5.1-1.x86_64
xorg-x11-drv-nvidia-cuda-libs-555.58.02-1.fc40.x86_64
xorg-x11-drv-nvidia-cuda-libs-555.58.02-1.fc40.i686
xorg-x11-drv-nvidia-cuda-555.58.02-1.fc40.x86_64

How to set the max tokens for web ui?

Hello,

How to set the max tokens for the web ui?

Thanks,
Nan

URLs in main panel

People don’t know how to open tinychat. Display url permanently in the panel so they can click the link.

GPU rich / poor based on topology and device capabilities connected in the network

Show something like this:

Instructions for iOS

A lot of people have asked how to get exo running on iOS
Now that things are a lot more stable, we should have parity with iOS soon
Once we have that, create docs on how to run on iOS

CI/CD pipeline

Run unit tests
Integration tests on different device targets and network configurations
Linter

Smart context handling

Perhaps something like https://github.com/tinygrad/tinygrad/blob/master/examples/llama3.py -- this doesn't prefill part of the prompt that's already been filled, it's super simple to implement.

Prefill only fills first shard

Currently prompt prefill only fills the first shard.

Surprised that this "bug" (feature?) slipped through and that generation is still so good despite what I would think would make generation much worse.

...Now that I think of it, this might actually be a discovery. Prefilling the first K layers speeds up time-to-first-token and doesn't seem to have too much of an impact on performance. Any research papers on this?

[BOUNTY - $100] Add support for LLaVA

Motivation: Vision models are useful.

What: exo supports different inference engines. Choose an inference engine (currently only MLX and tinygrad) for your implementation. You can probably find an existing LLaVA implementation somewhere (perhaps in https://github.com/ml-explore/mlx-examples for MLX). Modify that implementation to support sharded inference.

Reward: $100 Bounty paid out with USDC on Ethereum, email [email protected].

Issue running llama3_distributed.py:

I moved the examples/llama3_distributed.py to the root to get around exo. module issue

Then I ran it after having 2 nodes successfully connect (2x 64GB unified memory M2 Max).

Here is the output I get:

(exo) ➜  exo git:(d2184f5) ✗ python llama3_distributed.py
Fetching 13 files: 100%|████████████████████| 13/13 [00:00<00:00, 103073.63it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/Users/arthur/exo/llama3_distributed.py", line 89, in <module>
    asyncio.run(run_prompt(args.prompt))
  File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/Users/arthur/exo/llama3_distributed.py", line 52, in run_prompt
    await peer.reset_shard(shard)
  File "/Users/arthur/exo/exo/networking/grpc/grpc_peer_handle.py", line 78, in reset_shard
    await self.stub.ResetShard(request)
  File "/Users/arthur/anaconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 318, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.0.0.161:8080: Failed to connect to remote host: FD shutdown"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-07-16T18:27:55.354269-07:00", grpc_status:14, grpc_message:"failed to connect to all addresses; last error: UNAVAILABLE: ipv4:10.0.0.161:8080: Failed to connect to remote host: FD shutdown"}"
>
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 110, in grpc._cython.cygrpc.shutdown_grpc_aio
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 114, in grpc._cython.cygrpc.shutdown_grpc_aio
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 78, in grpc._cython.cygrpc._actual_aio_shutdown
AttributeError: 'NoneType' object has no attribute 'POLLER'
Exception ignored in: 'grpc._cython.cygrpc.AioChannel.__dealloc__'
Traceback (most recent call last):
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 110, in grpc._cython.cygrpc.shutdown_grpc_aio
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 114, in grpc._cython.cygrpc.shutdown_grpc_aio
  File "src/python/grpcio/grpc/_cython/_cygrpc/aio/grpc_aio.pyx.pxi", line 78, in grpc._cython.cygrpc._actual_aio_shutdown
AttributeError: 'NoneType' object has no attribute 'POLLER'
(exo) ➜  exo git:(d2184f5) ✗

After running two ndoes and getting these logs from DEBUG=9 python main.py in two python3.12 environments.

Here is node1 server logs:

(exo) ➜  exo git:(d2184f5) ✗ DEBUG=9 python3 main.py --wait-for-peers 1
Server started, listening on 0.0.0.0:8080
Starting peer discovery process...
No peers discovered yet, retrying in 1 second...
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Discovered new peer 5a264eca-e3f9-4e31-8c5d-9592c101d3f1 at 172.20.10.5:8080
Discovered first peer: <exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x1197f9310>
Current number of known peers: 1. Waiting 5 seconds to discover more...
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Waiting additional 1 seconds for more peers.
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Current number of known peers: 1. Waiting 5 seconds to discover more...
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
Collecting topoloy max_depth=3 visited={'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
No new peers discovered in the last grace period. Ending discovery process.
Starting with the following peers: [<exo.networking.grpc.grpc_peer_handle.GRPCPeerHandle object at 0x1197f9310>]
Connecting to new peers...
Connected to 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: False
Connected to peer 5a264eca-e3f9-4e31-8c5d-9592c101d3f1
Collecting topoloy max_depth=4 visited=set()
Collecting topoloy max_depth=2 visited={'5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'cd2cc476-bdea-476f-83d6-d30de7c353f4', '667348c3-7da3-44a5-a964-6a3060ec82c0'}
Already visited 5a264eca-e3f9-4e31-8c5d-9592c101d3f1. Skipping...
Collecting topoloy max_depth=2 visited={'5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'cd2cc476-bdea-476f-83d6-d30de7c353f4', '667348c3-7da3-44a5-a964-6a3060ec82c0'}
Already visited 5a264eca-e3f9-4e31-8c5d-9592c101d3f1. Skipping...
Collected topology from: 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: Topology(Nodes: {cd2cc476-bdea-476f-83d6-d30de7c353f4: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), node2: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), e6198415-ef00-41ed-9c4b-6cf37641136b: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 667348c3-7da3-44a5-a964-6a3060ec82c0: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 495bdca3-f769-429c-8da7-7064f554ace3: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536)}, Edges: {cd2cc476-bdea-476f-83d6-d30de7c353f4: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: {'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}, node2: {'495bdca3-f769-429c-8da7-7064f554ace3'}, 495bdca3-f769-429c-8da7-7064f554ace3: {'node2', '5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, e6198415-ef00-41ed-9c4b-6cf37641136b: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 667348c3-7da3-44a5-a964-6a3060ec82c0: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}})
Collected topology: Topology(Nodes: {cd2cc476-bdea-476f-83d6-d30de7c353f4: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), node2: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), e6198415-ef00-41ed-9c4b-6cf37641136b: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 667348c3-7da3-44a5-a964-6a3060ec82c0: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536), 495bdca3-f769-429c-8da7-7064f554ace3: DeviceCapabilities(model='MacBook Pro', chip='Apple M2 Max', memory=65536)}, Edges: {cd2cc476-bdea-476f-83d6-d30de7c353f4: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 5a264eca-e3f9-4e31-8c5d-9592c101d3f1: {'b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23', 'e6198415-ef00-41ed-9c4b-6cf37641136b', '667348c3-7da3-44a5-a964-6a3060ec82c0', '495bdca3-f769-429c-8da7-7064f554ace3', 'cd2cc476-bdea-476f-83d6-d30de7c353f4'}, b2696e5e-a8b1-4ffd-9f17-9c4eba9c3a23: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, e6198415-ef00-41ed-9c4b-6cf37641136b: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 667348c3-7da3-44a5-a964-6a3060ec82c0: {'5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, 495bdca3-f769-429c-8da7-7064f554ace3: {'node2', '5a264eca-e3f9-4e31-8c5d-9592c101d3f1'}, node2: {'495bdca3-f769-429c-8da7-7064f554ace3'}})
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.5', 49621): {'type': 'discovery', 'node_id': '5a264eca-e3f9-4e31-8c5d-9592c101d3f1', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}
received from peer ('172.20.10.3', 56126): {'type': 'discovery', 'node_id': 'cd2cc476-bdea-476f-83d6-d30de7c353f4', 'grpc_port': 8080, 'device_capabilities': {'model': 'MacBook Pro', 'chip': 'Apple M2 Max', 'memory': 65536}}

Node 2 logs look similar. It looks like they are able to discover, but when I try to run inference I get the above

Will the tokens/s be increasing after adding more nodes to the cluster?

Hi,

Will the inference tokens/s be increasing after adding more nodes to the cluster? I plan to set up a 10-node mac mini cluster for the test. They have the same hardware configuration.