evilsocket / cake Goto Github PK

Distributed LLM inference for mobile, desktop and server.

License: Other

Rust 95.59% Swift 2.71% Makefile 1.70%

cake's Introduction

Join the project community on our server!

Cake is a Rust framework for distributed inference of large models like LLama3 based on Candle. The goal of the project is being able to run big (70B+) models by repurposing consumer hardware into an heterogeneous cluster of iOS, Android, macOS, Linux and Windows devices, effectively leveraging planned obsolescence as a tool to make AI more accessible and democratic.

⚠ This is experimental code that's being actively developed and changed very quickly, expect bugs ⚠

The idea is to shard the transformer blocks to multiple devices in order to be able to run the inference on models that wouldn't normally fit in the GPU memory of a single device. Inferences over contiguous transformer blocks on the same worker are batched in order to minimize latency due to data transfer.

Support

OS	Architectures	Acceleration	Status
GNU/Linux	arm, arm64, x86_64	-	✅
GNU/Linux	arm, arm64, x86_64	CUDA	✅
GNU/Linux	arm, arm64, x86_64	BLAS	✅
Windows	x86_64	BLAS	untested
Windows	x86_64	CUDA	untested
macOS	x86_64	-	✅
macOS	aarch64	-	✅
macOS	aarch64	Metal	✅
Android	arm, arm64, x86_64	-	✅
Android	arm, arm64, x86_64	CUDA	untested
iOS / iPadOS	aarch64	-	✅
iOS / iPadOS	aarch64	Metal	🛠️ 90% done, WIP
Web	-	WebGPU	in theory possible, not done

CUDA >= 12.2 is required for CUDA accelerated systems.

Compile

With Rust installed, you can build the core library and the CLI utilities with different accelerations.

Without acceleration (will use CPU):

cargo build --release

With Metal acceleration for Apple Silicon:

cargo build --release --features metal

With CUDA acceleration:

cargo build --release --features cuda

To generate the iOS bindings in the app that can then be compiled and deployed via XCode:

make ios

Using

Run a worker node:

cake-cli --model /path/to/Meta-Llama-3-8B \ # model path, read below on how to optimize model size for workers
         --mode worker \                    # run as worker
         --name worker0 \                   # worker name in topology file
         --topology topology.yml \          # topology
         --address 0.0.0.0:10128            # bind address

Run a master node with an OpenAI compatible REST API:

cake-cli --model /path/to/Meta-Llama-3-8B \ # model path
         --api 0.0.0.0:8080               \ # API bind address
         --topology topology.yml            # topology file

Where topology.yml determines which layers are served by which worker (you can find a list of all the layers of a model in its tensor index file):

linux_server_1:
  host: 'linux_server.host:10128'
  description: 'NVIDIA Titan X Pascal (12GB)'
  layers:
    - 'model.layers.0-5'

linux_server_2:
  host: 'linux_server2.host:10128'
  description: 'NVIDIA GeForce 3080 (10GB)'
  layers:
    - 'model.layers.6-16'

iphone:
  host: 'iphone.host:10128'
  description: 'iPhone 15 Pro Max'
  layers:
    - 'model.layers.17'

ipad:
  host: 'ipad.host:10128'
  description: 'iPad'
  layers:
    - 'model.layers.18-19'

macbook:
  host: 'macbook.host:10128'
  description: 'M1 Max'
  layers:
    - 'model.layers.20-31'

You can now interact with the cluster by:

curl http://master-ip:8080/api/v1/chat/completions \                                                                                                                           ~  
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
        {   
            "role": "system",
            "content": "You are a helpful AI assistant."
        },  
        {   
            "role": "user",
            "content": "Why is the sky blue?"
        }
    ]
}'

Splitting the Model

As a memory and disk space optimization, you might want to give the worker only the data it actually needs from the model instead of the whole folder, in which case you can use the cake-split-model utility. For instance to generate a smaller version of the llama3 safetensors, you can:

cake-split-model --model-path path/to/Meta-Llama-3-8B \ # source model to split
                 --topology path/to/topology.yml \      # topology file
                 --output output-folder-name            # output folder where all the workers data bundles will be saved

This will create a smaller folder with only the required layers tensors and the topology file for the specific worker. Remember to also copy other model contents (config.json, tokenizer.json, etc) in the worker bundle before deploying it.

License

Released under the GPL 3 license. To see the licenses of the project dependencies, install cargo license with cargo install cargo-license and then run cargo license.

cake's People

Contributors

Stargazers

Watchers

Forkers

celdiniz zeroxclem carloshkayser daviddelaurier zisequkuai mclamee guiyu albertbj yuang-deng thargyi74 b0xtch fredwen2008 harodggg marcinguy iamleon121 yutaoxu gspony yifengchen9 dbc-chenf weizihua machinelearningsystem homenglau zhfly021 wheatj hejdb becky2006 apegeek cryptoleek-team henry0249 seabree-12 imanu20 tengjunhe cloudenginehub jqk6 surpass rustloverthecoder thelongestusernameofall hechel frank-tao lrsnowx blueeyes-iflytek acproject jason-shen digitseer shabbirhasan1 ismest smartyhouses sorokinvld yaojunluo samwen heysaeed dnndhh smallverse 1form1 oyelowo obolos linecode a1afreejk sonnguyen1004 monadkai sarnaz1304 ebinum83 2nami2 0xf965 cvcuiwei b08240 chrisx101010 winartodev padre33 wykrichard veryvanya xiaojun207 schrobli zxc1221 narsis77 guochaopeng weykon winpkay fellowtraveler manfar thomascherickal catnip202x dashan996 imxxu vhghhd xkjwzx thormaxx lifugithub ar4s-gh lainera gqadonis smartjoy-tech kigichang jesusoctavioas hien akuowen smile-ttxp armersong q741242673 1dsoni

cake's Issues

Inquiries about the possibility of supporting windows systems

Hello developers, I see you this project, really awesome, I have been struggling with the lack of performance of the device, and do not have much money to buy A100 graphics card, because to buy milk powder for the child, haha, would like to consult whether there is the intention to join the windows system, I see that the mac, linux, android have support, we are mainly on the side of windows 7! I see mac, linux, and android are all supported, we are mainly windows 7 on our side, we have six computers, it would be nice to have a cluster that supports windows.

Req Support for Llama 3.1

Llama 3.1 got released, anything needs to change to get it working.

第二次请求会报错

您好，第一次请求的时候会正常输出，第二次请求会报错，主节点的服务也会终止
工作节点运行命令

CUDA_VISIBLE_DEVICES=3 ./cake-cli --model /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct --mode worker --name worker0 --topology /sdc/jky/cake/topology.yml --address 0.0.0.0:10128

主节点运行命令

CUDA_VISIBLE_DEVICES=3,4,5,6,7 ./cake-cli --model /home/pre_trained_model/Llama3-Chinese-8B-Instruct --api 0.0.0.0:8080 --topology /home/jky/cake/topology.yml

报错如下：

thread 'tokio-runtime-worker' panicked at /sdc/jky/cake/cake-core/src/cake/worker.rs:215:26:
called `Result::unwrap()` on an `Err` value: cannot broadcast [29, 29] to [1, 32, 29, 170]
   0: candle_core::error::Error::bt
   1: candle_core::layout::Layout::broadcast_as
   2: candle_core::tensor::Tensor::broadcast_as
   3: cake_core::models::llama3::cache::Cache::apply_attention_mask
   4: cake_core::models::llama3::attention::CausalSelfAttention::forward
   5: <cake_core::models::llama3::transformer::Transformer as cake_core::cake::Forwarder>::forward::{{closure}}
   6: cake_core::cake::worker::Worker<G>::run::{{closure}}::{{closure}}
   7: tokio::runtime::task::core::Core<T,S>::poll
   8: tokio::runtime::task::harness::Harness<T,S>::poll
   9: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  10: tokio::runtime::scheduler::multi_thread::worker::Context::run
  11: tokio::runtime::context::set_scheduler
  12: tokio::runtime::context::runtime::enter_runtime
  13: tokio::runtime::scheduler::multi_thread::worker::run
  14: tokio::runtime::task::core::Core<T,S>::poll
  15: tokio::runtime::task::harness::Harness<T,S>::poll
  16: tokio::runtime::blocking::pool::Inner::run
  17: std::sys_common::backtrace::__rust_begin_short_backtrace
  18: core::ops::function::FnOnce::call_once{{vtable.shim}}
  19: std::sys::pal::unix::thread::Thread::new::thread_start
  20: <unknown>
  21: <unknown>


Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: <cake_core::models::llama3::transformer::Transformer as cake_core::cake::Forwarder>::forward::{{closure}}
   2: cake_core::cake::worker::Worker<G>::run::{{closure}}::{{closure}}
   3: tokio::runtime::task::core::Core<T,S>::poll
   4: tokio::runtime::task::harness::Harness<T,S>::poll
   5: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   6: tokio::runtime::scheduler::multi_thread::worker::Context::run
   7: tokio::runtime::context::set_scheduler
   8: tokio::runtime::context::runtime::enter_runtime
   9: tokio::runtime::scheduler::multi_thread::worker::run
  10: tokio::runtime::task::core::Core<T,S>::poll
  11: tokio::runtime::task::harness::Harness<T,S>::poll
  12: tokio::runtime::blocking::pool::Inner::run
  13: std::sys_common::backtrace::__rust_begin_short_backtrace
  14: core::ops::function::FnOnce::call_once{{vtable.shim}}
  15: std::sys::pal::unix::thread::Thread::new::thread_start
  16: <unknown>
  17: <unknown>
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: cake_core::cake::worker::Worker<G>::run::{{closure}}::{{closure}}
   4: tokio::runtime::task::core::Core<T,S>::poll
   5: tokio::runtime::task::harness::Harness<T,S>::poll
   6: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run
   8: tokio::runtime::context::set_scheduler
   9: tokio::runtime::context::runtime::enter_runtime
  10: tokio::runtime::scheduler::multi_thread::worker::run
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::harness::Harness<T,S>::poll
  13: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Error in model.forward: error in forward batch operation for block

The first time I call the API, it works fine. However, when I call the REST API for the second time, the master node reports the error:
cake/api/mod.rs:98:10: called Result::unwrap() on an Err value: error in model.forward: error in forward batch operation for block 29: error receiving response for Batch
Additionally, one of my workers will also trigger an error and then stop:
src/cake/worker.rs:225:26: called Result::unwrap() on an Err value: cannot broadcast [28, 28] to [1, 32, 28, 65]

Questions for Xcode setup for testing/dev

Taking a quick look:

https://github.com/evilsocket/llama3-cake/blob/main/cake-ios/src/lib.rs is the secret sauce for iOS?

I have a Mac Studio. Can I use the iOS simulator to simulate a few devices for testing this, if so how?

At a high level how is the model split between devices? Would it be possible to write a WebGPU client where you plug in the IP address, ssh creds, and it becomes a worker node?

May I ask why I am unable to download the model and use the product through Huggingface

root@llama01:/www/cake# /www/cake/target/release/cake-cli --model /www/llama --mode worker --name linux_server_1 --address 0.0.0.0:9527 --topology /www/cake/topology.yml
[2024-08-08T16:11:12Z INFO ] [Worker] dtype=F16 device=Cpu mem=5.3 MiB
[2024-08-08T16:11:12Z INFO ] loading configuration from /www/llama/config.json
[2024-08-08T16:11:12Z INFO ] loading topology from /www/cake/topology.yml
[2024-08-08T16:11:12Z INFO ] loading tensors in /www/llama/model.safetensors.index.json
[2024-08-08T16:11:12Z INFO ] loading tensors from /www/llama/model.safetensors.index.json ...
[2024-08-08T16:11:12Z INFO ] loading model-00002-of-00004.safetensors ...
Error: cannot find tensor model-00002-of-00004.safetensors.self_attn.q_proj.weight

Unable to build without CUDA

Tried on debian server and termux. Results are same

CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true RUST_BACKTRACE=full cargo build --release
warning: /home/dankcat/cake/cake-ios/Cargo.toml: `crate_type` is deprecated in favor of `crate-type` and will not work in the 2024 edition
(in the `cake` library target)
   Compiling cudarc v0.11.7
   Compiling candle-kernels v0.6.0
   Compiling zstd-sys v2.0.12+zstd.1.5.6
   Compiling block-buffer v0.10.4
error: failed to run custom build command for `candle-kernels v0.6.0`

Caused by:
  process didn't exit successfully: `/home/dankcat/cake/target/release/build/candle-kernels-15ec0a2c0042f062/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=src/compatibility.cuh
  cargo:rerun-if-changed=src/cuda_utils.cuh
  cargo:rerun-if-changed=src/binary_op_macros.cuh
  cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP

  --- stderr
  thread 'main' panicked at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.5/src/lib.rs:489:18:
  `nvidia-smi` failed. Ensure that you have CUDA installed and that `nvidia-smi` is in your PATH.: Os { code: 2, kind: NotFound, message: "No such file or directory" }
  stack backtrace:
     0:     0x55c93d687785 - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
     1:     0x55c93d687785 - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
     2:     0x55c93d687785 - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
     3:     0x55c93d687785 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
     4:     0x55c93d6ac13b - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
     5:     0x55c93d6ac13b - core::fmt::write::hc090a2ffd6b28c4a
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
     6:     0x55c93d68420f - std::io::Write::write_fmt::h8898bac6ff039a23
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
     7:     0x55c93d68755e - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
     8:     0x55c93d68755e - std::sys_common::backtrace::print::ha96650907276675e
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
     9:     0x55c93d688a49 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
    10:     0x55c93d68878d - std::panicking::default_hook::h207342be97478370
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
    11:     0x55c93d688ee3 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
    12:     0x55c93d688dc4 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
    13:     0x55c93d687c49 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
    14:     0x55c93d688af7 - rust_begin_unwind
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
    15:     0x55c93d5e10f3 - core::panicking::panic_fmt::hdc63834ffaaefae5
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
    16:     0x55c93d5e1546 - core::result::unwrap_failed::h82b551e0ff2b2176
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
    17:     0x55c93d5f00d8 - core::result::Result<T,E>::expect::h0d780f1427a920a0
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1034:23
    18:     0x55c93d6058fc - bindgen_cuda::compute_cap::h544f29d1dbea88ae
                                 at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.5/src/lib.rs:485:19
    19:     0x55c93d60216f - <bindgen_cuda::Builder as core::default::Default>::default::hc8d3c33e79e06ed7
                                 at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen_cuda-0.1.5/src/lib.rs:48:27
    20:     0x55c93d5e2e5f - build_script_build::main::h601c987ee98bf43b
                                 at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-kernels-0.6.0/build.rs:7:19
    21:     0x55c93d5e270b - core::ops::function::FnOnce::call_once::h3413b6fc62df34af
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
    22:     0x55c93d5e1e6e - std::sys_common::backtrace::__rust_begin_short_backtrace::hbdfe41c52daab1ec
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:155:18
    23:     0x55c93d5e22d1 - std::rt::lang_start::{{closure}}::h51c795f7d1b1d218
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:159:18
    24:     0x55c93d67ead0 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h6abeee5a7794ceb5
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:284:13
    25:     0x55c93d67ead0 - std::panicking::try::do_call::hd6e966bb06877057
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
    26:     0x55c93d67ead0 - std::panicking::try::hc9b3807f5768cb19
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
    27:     0x55c93d67ead0 - std::panic::catch_unwind::h94a757c154076c6e
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
    28:     0x55c93d67ead0 - std::rt::lang_start_internal::{{closure}}::hc5223fb36050c743
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:141:48
    29:     0x55c93d67ead0 - std::panicking::try::do_call::hddf7b4e1ebeb3f69
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
    30:     0x55c93d67ead0 - std::panicking::try::h1842860a1f941a31
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
    31:     0x55c93d67ead0 - std::panic::catch_unwind::h009016ccf811d4c3
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
    32:     0x55c93d67ead0 - std::rt::lang_start_internal::h3ed4fe7b2f419135
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:141:20
    33:     0x55c93d5e22aa - std::rt::lang_start::hff6e3b582a875b8d
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:158:17
    34:     0x55c93d5e306e - main
    35:     0x7f40ce51524a - <unknown>
    36:     0x7f40ce515305 - __libc_start_main
    37:     0x55c93d5e1761 - _start
    38:                0x0 - <unknown>
warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for `cudarc v0.11.7`

Caused by:
  process didn't exit successfully: `/home/dankcat/cake/target/release/build/cudarc-5c6a5152ed8f4c4d/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-env-changed=CUDA_ROOT
  cargo:rerun-if-env-changed=CUDA_PATH
  cargo:rerun-if-env-changed=CUDA_TOOLKIT_ROOT_DIR

  --- stderr
  thread 'main' panicked at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/build.rs:55:10:
  Failed to execute `nvcc`: Os { code: 2, kind: NotFound, message: "No such file or directory" }
  stack backtrace:
     0:     0x564532b54cf5 - std::backtrace_rs::backtrace::libunwind::trace::h1a07e5dba0da0cd2
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
     1:     0x564532b54cf5 - std::backtrace_rs::backtrace::trace_unsynchronized::h61b9b8394328c0bc
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
     2:     0x564532b54cf5 - std::sys_common::backtrace::_print_fmt::h1c5e18b460934cff
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:68:5
     3:     0x564532b54cf5 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:44:22
     4:     0x564532b75a2b - core::fmt::rt::Argument::fmt::h07af2b4071d536cd
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/rt.rs:165:63
     5:     0x564532b75a2b - core::fmt::write::hc090a2ffd6b28c4a
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/fmt/mod.rs:1157:21
     6:     0x564532b5290f - std::io::Write::write_fmt::h8898bac6ff039a23
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/io/mod.rs:1832:15
     7:     0x564532b54ace - std::sys_common::backtrace::_print::h4e80c5803d4ee35b
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:47:5
     8:     0x564532b54ace - std::sys_common::backtrace::print::ha96650907276675e
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:34:9
     9:     0x564532b55d89 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
    10:     0x564532b55acd - std::panicking::default_hook::h207342be97478370
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:298:9
    11:     0x564532b56223 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:795:13
    12:     0x564532b56104 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664:13
    13:     0x564532b551b9 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171:18
    14:     0x564532b55e37 - rust_begin_unwind
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
    15:     0x564532b25f53 - core::panicking::panic_fmt::hdc63834ffaaefae5
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
    16:     0x564532b26366 - core::result::unwrap_failed::h82b551e0ff2b2176
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5
    17:     0x564532b2c438 - core::result::Result<T,E>::expect::h33784a2d338b94a7
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1034:23
    18:     0x564532b316f6 - build_script_build::cuda_version_from_build_system::h4a38442c7c737c00
                                 at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/build.rs:52:18
    19:     0x564532b3133a - build_script_build::main::h77dc56d88b14ee07
                                 at /home/dankcat/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/build.rs:37:34
    20:     0x564532b2e5cb - core::ops::function::FnOnce::call_once::h2274ad654a6bbd1b
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:250:5
    21:     0x564532b340fe - std::sys_common::backtrace::__rust_begin_short_backtrace::hff1eff237bf98703
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:155:18
    22:     0x564532b2b3d1 - std::rt::lang_start::{{closure}}::h214b04bede10fd10
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:159:18
    23:     0x564532b4f850 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h6abeee5a7794ceb5
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:284:13
    24:     0x564532b4f850 - std::panicking::try::do_call::hd6e966bb06877057
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
    25:     0x564532b4f850 - std::panicking::try::hc9b3807f5768cb19
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
    26:     0x564532b4f850 - std::panic::catch_unwind::h94a757c154076c6e
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
    27:     0x564532b4f850 - std::rt::lang_start_internal::{{closure}}::hc5223fb36050c743
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:141:48
    28:     0x564532b4f850 - std::panicking::try::do_call::hddf7b4e1ebeb3f69
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
    29:     0x564532b4f850 - std::panicking::try::h1842860a1f941a31
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
    30:     0x564532b4f850 - std::panic::catch_unwind::h009016ccf811d4c3
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
    31:     0x564532b4f850 - std::rt::lang_start_internal::h3ed4fe7b2f419135
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:141:20
    32:     0x564532b2b3aa - std::rt::lang_start::ha16ce9452477e973
                                 at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/rt.rs:158:17
    33:     0x564532b331fe - main
    34:     0x7fec3bef624a - <unknown>
    35:     0x7fec3bef6305 - __libc_start_main
    36:     0x564532b26541 - _start
    37:                0x0 - <unknown>

Need support for fine-tuning glm and more fine-tuning models

Thank you to the author for creating such a great inference framework
By the way, Windows Cuda compiles normally
Hope to support finetuned models
I am using a translator

Building on ubuntu errors `cuMemAdvise_v2` on cuda 12.1

Compiling tracing-core v0.1.32
error[E0599]: no method named `cuMemAdvise_v2` found for reference `&'static driver::sys::sys_12010::Lib` in the current scope
   --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/result.rs:613:10
    |
612 | /     lib()
613 | |         .cuMemAdvise_v2(dptr, num_bytes, advice, location)
    | |_________-^^^^^^^^^^^^^^
    |
help: there is a method `cuMemAdvise` with a similar name
    |
613 |         .cuMemAdvise(dptr, num_bytes, advice, location)
    |          ~~~~~~~~~~~

error[E0599]: no method named `cuMemPrefetchAsync_v2` found for reference `&'static driver::sys::sys_12010::Lib` in the current scope
     --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/result.rs:628:10
      |
627   | /     lib()
628   | |         .cuMemPrefetchAsync_v2(dptr, num_bytes, location, 0, stream)
      | |_________-^^^^^^^^^^^^^^^^^^^^^
      |
help: there is a method `cuMemPrefetchAsync` with a similar name, but with different arguments
     --> /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.7/src/driver/sys/sys_12010.rs:13548:5
      |
13548 | /     pub unsafe fn cuMemPrefetchAsync(
13549 | |         &self,
13550 | |         devPtr: CUdeviceptr,
13551 | |         count: usize,
13552 | |         dstDevice: CUdevice,
13553 | |         hStream: CUstream,
13554 | |     ) -> CUresult {
      | |_________________^

Is it possible to use quantized models?

Firs of all, I wanna thank you for your hard work, I love this project and I thinks it's awesome to be able to handle inference on different devices.
As for me, the point in splitting a model among different devices, lays in my current RAM limitations, so I guess it would have much more sense to be able to use quantized versions of the big models.

无法找到指定的文件

模型路径是正确的，不知道这个报错指的是哪个文件找不到

[Worker] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=207.4 MiB
 loading topology from topology.yml
loading configuration from /sdc/pre_trained_model/Llama3-Chinese-8B-Instruct/config.json
Error: No such file or directory (os error 2)

Use hf_hub crate to pull model

I have a working version

Thanks for the FOSS! Suggestion for future possible backends runtimes: Vulkan, OpenCL, SYCL/OpenVino/intel GPU, AMD gpu/ROCm/HIP.

Thanks for the FOSS!

Suggestion for future possible backends runtimes: Vulkan, OpenCL, SYCL/OpenVino/intel GPU, AMD gpu/ROCm/HIP.

Vulkan and OpenCL both have the possibility of being very portable to GPUs and also to some extent CPUs that have supporting SW for it.

SYCL can run on various CPU / GPU platforms; it / openvino etc. is the primary ideal target to support intel gpus.

About the reason of having cluster nodes

Thanks for your valuable contribution.
I have the following question that I need some clarification. It would probably be also noteworthy to be mentioned in the README description for more clarity.
From my basic understanding, in cake we are splitting the model into its layers and distributing those layers to separate nodes because a huge 70B model will not fit into a single normal GPU. So my question is that what would be the benefit of having a cluster of these nodes on our network instead of having only a single worker and just loading and offloading each layer of model one by one? Because my understanding is that the model inference is sequential, so one node has to wait for the process of previous layers to finish to start its process. So basically having multiple nodes would appear redundant. Unless, we have some sort of pipelining mechanism that would feed batches to the nodes one at a time to perform pipelining. Is that our intention here? Could you please provide some guidance and explanation on this? Thanks again.

Standardized Android client

Would it be possible to build a Android client so you can enter the IP of the master node so you can start inference directly? Also, new SoCs of Qualcomm have built-in NPUs, would it be possible to utilise these dedicated hardware?

Dockerfile support

Hereby I have successfully compiled your project with Docker, and am willing to share with anyone struggling to do the same.

Since this software is in alpha, I advise the author to use this as reference and build official docker image for this project, before static linking and AppImage.

The filesystem structure is:

├── build.sh # build script
├── cake # cloned repository
├── cargo_config.toml # cargo mirror config
├── Dockerfile_intermediate # building intermediate image
└── run.sh # run the final container

Content of build.sh:

INTERMEDIATE_IMAGE_NAME=cake_llm_intermediate
IMAGE_NAME=cake_llm

INTERMEDIATE_CONTAINER_NAME=cake_container_intermediate
CONTAINER_NAME=cake_container

git clone https://github.com/evilsocket/cake

docker kill $CONTAINER_NAME
docker rm $CONTAINER_NAME
docker rmi $INTERMEDIATE_IMAGE_NAME

docker build -t $INTERMEDIATE_IMAGE_NAME -f Dockerfile_intermediate .


read -p "Do you want to continue? (y/n): " answer

case $answer in
    [Yy]* ) echo "You chose yes.";;
    [Nn]* ) echo "You chose no."; exit 1;;
    * ) echo "Please answer yes or no."; exit 1;;
esac

docker kill $INTERMEDIATE_CONTAINER_NAME
docker rm $INTERMEDIATE_CONTAINER_NAME

docker rmi $IMAGE_NAME
docker run -d --privileged --gpus 1 --name $INTERMEDIATE_CONTAINER_NAME $INTERMEDIATE_IMAGE_NAME tail -f /dev/null
docker exec -w /root/cake $INTERMEDIATE_CONTAINER_NAME cargo build
docker commit $INTERMEDIATE_CONTAINER_NAME $IMAGE_NAME 

docker kill $INTERMEDIATE_CONTAINER_NAME
docker rm $INTERMEDIATE_CONTAINER_NAME

Content of Dockerfile_intermediate:

FROM nvidia/cuda:12.4.0-base-ubuntu22.04

RUN rm /etc/apt/apt.conf.d/docker-clean
RUN apt update
RUN apt install -y build-essential curl

RUN apt install -y cuda-nvcc-12-4 cuda-nvrtc-dev-12-4 libcublas-dev-12-4 libcurand-dev-12-4

RUN apt install -y cargo

COPY cake /root/cake

COPY cargo_config.toml /root/.cargo/config.toml

Content of run.sh:

IMAGENAME=cake_llm
CONTAINER_NAME=cake_container

docker kill $CONTAINER_NAME
docker rm $CONTAINER_NAME

MODEL_PATH=/root/data/Meta-Llama-3-8B-Instruct
TOPOFILE=/root/data/topology.yaml

docker run -it --rm --mount type=bind,source=<source_path>,target=/root/data,ro -e LD_LIBRARY_PATH=/usr/local/cuda-12.4/targets/x86_64-linux/lib/ --name $CONTAINER_NAME --privileged --gpus 1 $IMAGENAME /root/cake/target/debug/cake-cli --model $MODEL_PATH --topology $TOPOFILE

PTX代码使用了一个不被支持的工具链进行编译

您好，我在使用中遇到了新的问题
运行命令

RUST_LOG=debug CUDA_VISIBLE_DEVICES=2 ./cake-cli --model /data1/pre_trained_model/Llama-3-8B-Instruct --topology /sdc/jky/cake/topology.yml

报错如下：

[2024-07-17T06:24:01Z DEBUG] device is cuda 0
[2024-07-17T06:24:01Z INFO ] [Master] dtype=F16 device=Cuda(CudaDevice(DeviceId(1))) mem=220.7 MiB
[2024-07-17T06:24:01Z INFO ] loading configuration from /data1/pre_trained_model/Llama-3-8B-Instruct/config.json
[2024-07-17T06:24:01Z INFO ] loading topology from /sdc/jky/cake/topology.yml
[2024-07-17T06:24:01Z DEBUG] cache::n_elem = 128
[2024-07-17T06:24:01Z DEBUG] cache::theta = [ 1.0000e0, 8.1462e-1, 6.6360e-1, 5.4058e-1, 4.4037e-1, 3.5873e-1, 2.9223e-1,
     2.3805e-1, 1.9392e-1, 1.5797e-1, 1.2869e-1, 1.0483e-1, 8.5397e-2, 6.9566e-2,
     5.6670e-2, 4.6164e-2, 3.7606e-2, 3.0635e-2, 2.4955e-2, 2.0329e-2, 1.6560e-2,
     1.3490e-2, 1.0990e-2, 8.9523e-3, 7.2927e-3, 5.9407e-3, 4.8394e-3, 3.9423e-3,
     3.2114e-3, 2.6161e-3, 2.1311e-3, 1.7360e-3, 1.4142e-3, 1.1520e-3, 9.3847e-4,
     7.6450e-4, 6.2277e-4, 5.0732e-4, 4.1327e-4, 3.3666e-4, 2.7425e-4, 2.2341e-4,
     1.8199e-4, 1.4825e-4, 1.2077e-4, 9.8381e-5, 8.0143e-5, 6.5286e-5, 5.3183e-5,
     4.3324e-5, 3.5292e-5, 2.8750e-5, 2.3420e-5, 1.9078e-5, 1.5542e-5, 1.2660e-5,
     1.0313e-5, 8.4015e-6, 6.8440e-6, 5.5752e-6, 4.5417e-6, 3.6997e-6, 3.0139e-6,
     2.4551e-6]
    Tensor[[64], f32, cuda:0]
Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.") when loading cast_u32_f32

无法编译成功

C:\Users\Administrator\Desktop\cake>cargo build --release
warning: C:\Users\Administrator\Desktop\cake\cake-ios\Cargo.toml: crate_type is deprecated in favor of crate-type and will not work in the 2024 edition
(in the cake library target)
Compiling cudarc v0.11.7
Compiling candle-kernels v0.6.0
Compiling clap_lex v0.7.1
Compiling bit-vec v0.6.3
Compiling strsim v0.11.1
Compiling nom v7.1.3
Compiling console v0.15.8
Compiling esaxx-rs v0.1.10
error: failed to run custom build command for cudarc v0.11.7
note: To improve backtraces for build dependencies, set the CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true environment variable to enable debug information generation.

Caused by:
process didn't exit successfully: C:\Users\Administrator\Desktop\cake\target\release\build\cudarc-95f6bdd5c33de08a\build-script-build (exit code: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-env-changed=CUDA_ROOT
cargo:rerun-if-env-changed=CUDA_PATH
cargo:rerun-if-env-changed=CUDA_TOOLKIT_ROOT_DIR

--- stderr
thread 'main' panicked at C:\Users\Administrator.cargo\registry\src\index.crates.io-6f17d22bba15001f\cudarc-0.11.7\build.rs:82:14:
Unsupported cuda toolkit version: 11.0. Please raise a github issue.
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\std\src\panicking.rs:652
1: core::panicking::panic_fmt
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\core\src\panicking.rs:72
2: <alloc::vec::Vec as core::iter::traits::collect::FromIterator>::from_iter
3: <alloc::vec::Vec as core::iter::traits::collect::FromIterator>::from_iter
4: core::ops::function::FnOnce::call_once
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.
warning: build failed, waiting for other jobs to finish...
error: failed to run custom build command for candle-kernels v0.6.0
note: To improve backtraces for build dependencies, set the CARGO_PROFILE_RELEASE_BUILD_OVERRIDE_DEBUG=true environment variable to enable debug information generation.

Caused by:
process didn't exit successfully: C:\Users\Administrator\Desktop\cake\target\release\build\candle-kernels-644872f2b8f06ed1\build-script-build (exit code: 101)
--- stdout
cargo:rerun-if-changed=build.rs
cargo:rerun-if-changed=src/compatibility.cuh
cargo:rerun-if-changed=src/cuda_utils.cuh
cargo:rerun-if-changed=src/binary_op_macros.cuh
cargo:info=["/usr", "/usr/local/cuda", "/opt/cuda", "/usr/lib/cuda", "C:/Program Files/NVIDIA GPU Computing Toolkit", "C:/CUDA"]
cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP

--- stderr
thread 'main' panicked at C:\Users\Administrator.cargo\registry\src\index.crates.io-6f17d22bba15001f\bindgen_cuda-0.1.5\src\lib.rs:492:9:
assertion left == right failed
left: "Field "compute_cap" is not a valid field to query."
right: "compute_cap"
stack backtrace:
0: std::panicking::begin_panic_handler
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\std\src\panicking.rs:652
1: core::panicking::panic_fmt
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\core\src\panicking.rs:72
2: core::panicking::assert_failed_inner
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\core\src\panicking.rs:409
3: core::panicking::assert_failed
4: bindgen_cuda::cuda_include_dir::{{closure}}
5: <bindgen_cuda::Builder as core::default::Default>::default
6: std::rt::lang_start
7: std::rt::lang_start
8: __ImageBase
9: std::rt::lang_start
10: std::rt::lang_start_internal
at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library\std\src\rt.rs:141
11: std::rt::lang_start
12: main
13: invoke_main
at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
14: __scrt_common_main_seh
at D:\a_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
15: BaseThreadInitThunk
16: RtlUserThreadStart
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1340 C+G Insufficient Permissions N/A |
| 0 N/A N/A 12420 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 12928 C+G ...m Files\ToDesk\ToDesk.exe N/A |
| 0 N/A N/A 13352 C+G ...artMenuExperienceHost.exe N/A |
| 0 N/A N/A 13940 C+G ...d\runtime\WeChatAppEx.exe N/A |
| 0 N/A N/A 14476 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 15492 C+G ...2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 17964 C+G ...ray\lghub_system_tray.exe N/A |
| 0 N/A N/A 18256 C+G ...e\PhoneExperienceHost.exe N/A |
| 0 N/A N/A 18744 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 19444 C+G ...lPanel\SystemSettings.exe N/A |
+-----------------------------------------------------------------------------+

C:\Users\Administrator\Desktop\cake>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:48_Pacific_Daylight_Time_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.relgpu_drvr445TC445_37.28540450_0

Cross-device mapping

add support for cross-GPU device mapping

Related issue/feature:
EricLBuehler/mistral.rs#395

bug with tokenizer and gibberish output

the tokenizer has issues resolving a few tokens including special ones (they will be shown in the output as ), which is causing all sorts of gibberish output ... it's probably a matter of parsing the model/tokenizer.json properly