Giter Site home page Giter Site logo

candle-vllm's Introduction

candle vLLM

Continuous integration

Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.

Features

  • OpenAI compatible API server provided for serving LLMs.
  • Highly extensible trait-based system to allow rapid implementation of new module pipelines,
  • Streaming support in generation.
  • Efficient management of key-value cache with PagedAttention.
  • Continuous batching.
  • In-situ quantization

Develop Status

Currently, candle-vllm supports chat serving for the following models.

Model ID Model Type Supported Speed (A100, BF16) Throughput (BF16, bs=16) Quantized (A100, Q4K)
#1 LLAMA/LLAMA2/LLaMa3/LLaMa3.1 74 tks/s (7B), 65 tks/s (LLaMa3.1 8B) 553 tks/s (LLaMa3.1 8B) 75 tks/s (LLaMa3.1 8B)
#2 Mistral 70 tks/s (7B) 585 tks/s (7B) 96 tks/s (7B)
#3 Phi (v1, v1.5, v2) 97 tks/s (2.7B, F32+BF16) TBD -
#4 Phi-3 (3.8B, 7B) 107 tks/s (3.8B) 744 tks/s (3.8B) 135 tks/s (3.8B)
#5 Yi 75 tks/s (6B) 566 tks/s (6B) 105 tks/s (6B)
#6 StableLM 99 tks/s (3B) TBD -
#7 BigCode/StarCode TBD TBD TBD -
#8 ChatGLM TBD TBD TBD -
#9 QWen2 (1.8B, 7B) 148 tks/s (1.8B) 784 tks/s (1.8B) -
#10 Google Gemma 130 tks/s (2B) TBD -
#11 Blip-large (Multimodal) TBD TBD TBD -
#12 Moondream-2 (Multimodal LLM) TBD TBD TBD -

Demo Chat with candle-vllm (61-65 tokens/s, LLaMa3.1 8B, bf16, on A100)

LLaMa3.1-8B-A100-1.mp4

Usage

See this folder for some examples.

Step 1: Run Candle-VLLM service (assume llama2-7b model weights downloaded)

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt install libssl-dev
sudo apt install pkg-config
git clone [email protected]:EricLBuehler/candle-vllm.git
cd candle-vllm
cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama

You may also run specific model using huggingface model-id, e.g.,

cargo run --release -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama

Run latest LLaMa3.1 using local weights

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3

Step 2:

Option 1: Chat with ChatUI (recommended)

Install ChatUI and its dependencies:

git clone [email protected]:guoqingbao/candle-vllm-demo.git
cd candle-vllm-demo
apt install npm #install npm if needed
npm install n -g #update node js if needed
n stable #update node js if needed
npm i -g pnpm #install pnpm manager
pnpm install #install ChatUI dependencies

Launching the ChatUI:

pnpm run dev # run the ChatUI

Option 2: Chat completion request with HTTP post

curl -X POST "http://127.0.0.1:2000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
           "model": "llama7b",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

Sample response:

{"id":"cmpl-53092967-c9cf-40e0-ae26-d7ac786d59e8","choices":[{"message":{"content":" Learning any programming language requires a combination of theory, practice, and dedication. Here are some steps and resources to help you learn Rust effectively:\n\n1. Start with the basics:\n\t* Understand the syntax and basic structure of Rust programs.\n\t* Learn about variables, data types, loops, and control structures.\n\t* Familiarize yourself with Rust's ownership system and borrowing mechanism.\n2. Read the Rust book:\n\t* The Rust book is an official resource that provides a comprehensive introduction to the language.\n\t* It covers topics such","role":"[INST]"},"finish_reason":"length","index":0,"logprobs":null}],"created":1718784498,"model":"llama7b","object":"chat.completion","usage":{"completion_tokens":129,"prompt_tokens":29,"total_tokens":158}}

Option 3: Chat completion with with openai package

In your terminal, install the openai Python package by running pip install openai. I use version 1.3.5.

Then, create a new Python file and write the following code:

import openai

openai.api_key = "EMPTY"

openai.base_url = "http://localhost:2000/v1/"

completion = openai.chat.completions.create(
    model="llama",
    messages=[
        {
            "role": "user",
            "content": "Explain how to best learn Rust.",
        },
    ],
    max_tokens = 64,
)
print(completion.choices[0].message.content)

After the candle-vllm service is running, run the Python script and enjoy efficient inference with an OpenAI compatible API server!

Batched requests

Refer to examples/benchmark.py

async def benchmark():
    model = "mistral7b"
    max_tokens = 1024
    # 16 requests
    prompts = ["Explain how to best learn Rust.", 
               "Please talk about deep learning in 100 words.", 
               "Do you know the capital city of China? Talk the details of you known.", 
               "Who is the best female actor in the world? Explain why.",
               "How to dealing with depression?",
               "How to make money in short time?",
               "What is the future trend of large language model?",
               "The famous tech companies in the world.",
               "Explain how to best learn Rust.", 
               "Please talk about deep learning in 100 words.", 
               "Do you know the capital city of China? Talk the details of you known.", 
               "Who is the best female actor in the world? Explain why.",
               "How to dealing with depression?",
               "How to make money in short time?",
               "What is the future trend of large language model?",
               "The famous tech companies in the world."]
    
    # send 16 chat requests at the same time
    tasks: List[asyncio.Task] = []
    for i in range(len(prompts)):
        tasks.append(
            asyncio.create_task(
                chat_completion(model, max_tokens, prompts[i]))
        )

    # obtain the corresponding stream object for each request
    outputs: List[Stream[ChatCompletionChunk]] = await asyncio.gather(*tasks)

    # tasks for streaming chat responses
    tasks_stream: List[asyncio.Task] = []
    for i in range(len(outputs)):
        tasks_stream.append(
            asyncio.create_task(
                stream_response(i, outputs[i]))
        )

    # gathering the response texts
    outputs: List[(int, str)] = await asyncio.gather(*tasks_stream)

    # print the results, you may find chat completion statistics in the backend server (i.e., candle-vllm)
    for idx, output in outputs:
        print("\n\n Response {}: \n\n {}".format(idx, output))


asyncio.run(benchmark())

In-situ quantization for consumer-grade GPUs

Candle-vllm now supports in-situ quantization, allowing the transformation of default weights (F32/F16/BF16) into any GGML format during model loading. This feature helps conserve GPU memory, making it more efficient for consumer-grade GPUs (e.g., RTX 4090). For example, 4-bit quantization can reduce GPU memory usage to less than 12GB for 8B models, while bring 13B models down to 24GB. To use this feature, simply supply the quant parameter when running candle-vllm.

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --quant q4k

Options for quant parameters: ["q4_0", "q4_1", "q5_0", "q5_1", "q8_0", "q2k", "q3k","q4k","q5k","q6k"]

Please note:

  1. It may takes few minutes to load F32/F16/BF16 models into quantized;

  2. Batched processing still requires further optimizations when operating in quantization mode.

Usage Help

For general configuration help, run cargo run -- --help.

For model-specific help, run cargo run -- --port 2000 <MODEL_TYPE> --help

For local model weights, run cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama, change the path when needed.

MODEL_TYPE = ["llama", "llama3", "mistral", "phi2", "phi3", "qwen2", "gemma", "yi", "stable-lm"]

WEIGHT_FILE_PATH = Corresponding weight path for the given model type

cargo run --release -- --port 2000 --weight-path <WEIGHT_FILE_PATH> <MODEL_TYPE>

or

MODEL_ID = Huggingface model id

cargo run --release -- --port 2000 --model-id <MODEL_ID> <MODEL_TYPE>

For kvcache configuration, set kvcache_mem_cpu and kvcache_mem_gpu, default 4GB CPU memory and 4GB GPU memory for kvcache.

For chat history settings, set record_conversation to true to let candle-vllm remember chat history. By default, candle-vllm does not record chat history; instead, the client sends both the messages and the contextual history to candle-vllm. If record_conversation is set to true, the client sends only new chat messages to candle-vllm, and candle-vllm is responsible for recording the previous chat messages. However, this approach requires per-session chat recording, which is not yet implemented, so the default approach record_conversation=false is recommended.

For chat streaming, the stream flag in chat request need to be set to True.

You may supply penalty and temperature to the model to prevent potential repetitions, for example:

cargo run --release -- --port 2000 --weight-path /home/mistral_7b/ mistral --repeat-last-n 64 --penalty 1.1 --temperature 0.7

--max-gen-tokens parameter is used to control the maximum output tokens per chat response. The value will be set to 1/5 of max_sequence_len by default.

For consumer GPUs, it is suggested to run the models under GGML formats, e.g.,

cargo run --release -- --port 2000 --weight-path /home/Meta-Llama-3.1-8B-Instruct/ llama3 --quant q4k

where quant is one of ["q4_0", "q4_1", "q5_0", "q5_1", "q8_0", "q2k", "q3k","q4k","q5k","q6k"].

Report issue

Installing candle-vllm is as simple as the following steps. If you have any problems, please create an issue.

Contributing

The following features are planned to be implemented, but contributions are especially welcome:

Resources

candle-vllm's People

Contributors

bm777 avatar ericlbuehler avatar guoqingbao avatar sigma-andex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

candle-vllm's Issues

candle-vllm build issue

I have followed readme document, on ubuntu 23.10 hitting at this build issue, please help me on this

errors detected in the compilation of "kernels/flash_fwd_hdim256_bf16_sm80.cu".
--error 0x1 --

thread '' panicked bindgen_cuda-0.1.4/src/lib.rs:262:21:
nvcc error while executing compiling: "nvcc" "--gpu-architecture=sm_75" "-c" "-o" "/home/tantrajya/Documents/rustprojects/candle-vllm-master/target/release/build/candle-flash-attn-f1535da02b90d1cc/out/flash_fwd_hdim256_bf16_sm80-21dd0f7dd998e506.o" "--default-stream" "per-thread" "-std=c++17" "-O3" "-U__CUDA_NO_HALF_OPERATORS__" "-U__CUDA_NO_HALF_CONVERSIONS__" "-U__CUDA_NO_HALF2_OPERATORS__" "-U__CUDA_NO_BFLOAT16_CONVERSIONS__" "-Icutlass/include" "--expt-relaxed-constexpr" "--expt-extended-lambda" "--use_fast_math" "--verbose" "kernels/flash_fwd_hdim256_bf16_sm80.cu"

When using non-stream mode, the client is blocking.

When I run the Yi-6B model inference service using the command cargo run --release -- --port 2000 --weight-path /home/yi-6b/ yi --repeat-last-n 32 and then execute python examples/llama.py as a client test, the client program keeps blocking. After pressing Ctrl+C, the following output can be seen on the server:

Prompt "<|im_start|>user\n Explain how to best learn Rust. <|im_end|>"
Request cmpl-8f264377-117d-4e63-87b3-41621869c188 with length 13 added to sequence group.
Send stream response error!
Request cmpl-8f264377-117d-4e63-87b3-41621869c188 decoding finished in 0 seconds

 [1 requests] Prefilling: 13 prompt tokens processed in 289 seconds

 [1 requests] Decoding: 0 tokens processed in 0 seconds (0 tokens/s)

However, running python examples/benchmark.py works successfully. The following curl test command also gets stuck.

curl -X POST "http://127.0.0.1:2000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -d '{
           "model": "llama7b",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

Qwen 2 model broken

To reproduce:

cargo run --release -- --port 2000 --hf-token TOK --model-id Qwen/Qwen2-1.5B qwen2 --repeat-last-n 64

And send a curl request, for example:

curl -X POST "http://127.0.0.1:2000/v1/chat/completions"      -H "Content-Type: application/json"      -H "Authorization: Bearer YOUR_API_KEY"      -d '{
           "model": "qwen2",
           "messages": [
               {"role": "user", "content": "Explain how to best learn Rust."}
           ],
           "temperature": 0.7,
          "max_tokens": 128,
          "stop": {"Single":"</s>"}
       }'

The Phi 3 model works, however, and the Qwen 2 does not. I think this may be because the num_attention_heads != num_key_value_heads. In Phi 3 they are the same, but in Qwen 2 they are 12 and 2 respectively. Gentle ping to @guoqingbao, could you please take a look?

Error:
cannot broadcast [1, 2, 128, 13] to [1, 12, 128, 13]

Using candle-vllm as crate in rust?

Hi Eric, great rust programm.

I am looking for a crate so I can use a chatbot function within my rust programm. I tried to to that with candle. I hope it will be more documented in den future.

Will it be possible to call a function of candle-vllm without starting an explicit server? So I can use candle-vllm within my programm.

Thanks

Best regards Gerhard

Mistral does not load safetensors

Currently, this is likely caused by the use of mistralai/Mistral-7B-v0.1. This model card only has .bin files, and the Mistral7BLoader looks for .safetensor files.

Support running without the --hf-token parameter and using ~/.cache/huggingface/token instead

Sharing the model cache with Python's Huggingface Transformers including the ~/.cache/huggingface/token file would be nice, it makes it a bit easier to run, and in some cases is a bit more flexible to have a file with the secret instead of an environment variable.

Currently running without the --hf-token parameter it crashes:

root@a07c64fd168b:~# RUST_BACKTRACE=full /root/.cargo/bin/candle-vllm --port 8080 llama7b --repeat-last-n 4096
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', src/openai/pipelines/llama.rs:93:52
stack backtrace:
   0:     0x55d28053b680 - std::backtrace_rs::backtrace::libunwind::trace::hc74d68e4e9860fea
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x55d28053b680 - std::backtrace_rs::backtrace::trace_unsynchronized::hf8e4c2f2a5bd29df
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55d28053b680 - std::sys_common::backtrace::_print_fmt::h4f6fe0d2839f79bc
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x55d28053b680 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd19378c79598340a
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x55d28059415f - core::fmt::rt::Argument::fmt::h59bd005c578b819c
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/core/src/fmt/rt.rs:138:9
   5:     0x55d28059415f - core::fmt::write::hc6d0fea41b7d9e9f
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/core/src/fmt/mod.rs:1094:21
   6:     0x55d28057ac09 - std::io::Write::write_fmt::haa7b0f41639b47d7
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/io/mod.rs:1714:15
   7:     0x55d28053b485 - std::sys_common::backtrace::_print::h5d490a94ac60fe6f
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x55d28053b485 - std::sys_common::backtrace::print::hbaa9c19cd32d8c11
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x55d280562b41 - std::panicking::default_hook::{{closure}}::h8910079f61b67bae
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:269:22
  10:     0x55d2805627df - std::panicking::default_hook::h7edb9fd7f54e7e29
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:288:9
  11:     0x55d280562fe9 - std::panicking::rust_panic_with_hook::h21aae3dc3ef5bbce
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:705:13
  12:     0x55d28053b921 - std::panicking::begin_panic_handler::{{closure}}::ha72f1c35114340ed
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:595:13
  13:     0x55d28053b756 - std::sys_common::backtrace::__rust_end_short_backtrace::h4b9d50b70e928d76
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/sys_common/backtrace.rs:151:18
  14:     0x55d280562d12 - rust_begin_unwind
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:593:5
  15:     0x55d27fd42e73 - core::panicking::panic_fmt::h1f248589eaa015cb
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/core/src/panicking.rs:67:14
  16:     0x55d27fd42f03 - core::panicking::panic::ha7576b296f353a40
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/core/src/panicking.rs:117:5
  17:     0x55d27fe64d05 - <candle_vllm::openai::pipelines::llama::LlamaLoader as candle_vllm::openai::pipelines::ModelLoader>::download_model::h35852d7688395361
  18:     0x55d27fdd2772 - <tokio::task::local::RunUntil<T> as core::future::future::Future>::poll::h840672626d529183
  19:     0x55d27fdba6d5 - <core::pin::Pin<P> as core::future::future::Future>::poll::hada91331ad8af9c6
  20:     0x55d27fd6847e - tokio::runtime::scheduler::current_thread::Context::enter::hcadae44c6bd4a972
  21:     0x55d27fdba3fc - tokio::runtime::context::scoped::Scoped<T>::set::h8c46da17f5738f19
  22:     0x55d27fda2d8d - tokio::runtime::context::set_scheduler::h3ae1db3c76e6eddd
  23:     0x55d27fd68886 - tokio::runtime::scheduler::current_thread::CoreGuard::block_on::h9087bc7e3b6248ad
  24:     0x55d27fd5acdd - tokio::runtime::context::runtime::enter_runtime::h83b56f50d8f9f15a
  25:     0x55d27fdc1ca4 - tokio::runtime::runtime::Runtime::block_on::h1be1cae2d7092a47
  26:     0x55d27fdeda2f - candle_vllm::main::h75bc07d9ebcc7535
  27:     0x55d27fd9fcef - std::sys_common::backtrace::__rust_begin_short_backtrace::hf956e45bfe2dcf62
  28:     0x55d27fd9fd22 - std::rt::lang_start::{{closure}}::hf05d036a2c013dd7
  29:     0x55d280554d97 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h04bbe96bde0bc897
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/core/src/ops/function.rs:284:13
  30:     0x55d280554d97 - std::panicking::try::do_call::he78cea8c5c5e38dc
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:500:40
  31:     0x55d280554d97 - std::panicking::try::h90115c76987a2911
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:464:19
  32:     0x55d280554d97 - std::panic::catch_unwind::hf9383826be3f5d9b
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panic.rs:142:14
  33:     0x55d280554d97 - std::rt::lang_start_internal::{{closure}}::h9558c46bab60ed0d
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/rt.rs:148:48
  34:     0x55d280554d97 - std::panicking::try::do_call::hc9c637fb9cb529b5
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:500:40
  35:     0x55d280554d97 - std::panicking::try::hdc72401b46fd9b08
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panicking.rs:464:19
  36:     0x55d280554d97 - std::panic::catch_unwind::hedb40674d8218c74
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/panic.rs:142:14
  37:     0x55d280554d97 - std::rt::lang_start_internal::h4c2f75864f09731e
                               at /build/rustc-N04w3E/rustc-1.72.1+dfsg0ubuntu1~bpo0/library/std/src/rt.rs:148:20
  38:     0x55d27fdeda95 - main
  39:     0x7f6377001d90 - <unknown>
  40:     0x7f6377001e40 - __libc_start_main
  41:     0x55d27fd4ec65 - _start
  42:                0x0 - <unknown>

KV Cache causes breakage

The current implementation of the Llama pipeline has a bug that likely relates to the KV cache (standard right now, see #3). This causes a crash with the error Error: cannot broadcast [4, 4] to [1, 32, 4, 39].

Minimum Reproducible:

Run this command:
RUST_BACKTRACE=1 HF_TOKEN=... cargo run --release -- --hf-token HF_TOKEN --port 2000 llama --repeat-last-n 64

And then this, twice:
curl http://localhost:2000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "llama", "messages": "San Francisco is a" }'

The error should occur on the second run.

Can the architectural design be improved?

Thanks for your candle-vllm.

I have my own LLM and tokenizer with diffrent prompt generation, and I want to make it serve as OpenAI compatiable API. I also don't want to modify the code of candle-vllm, want just use it as a library.

But now,looking at your design, it seems like you want to do everything by yourself.

pub enum SeperatorStyle {
#[default]
AddColonSingle,
AddColonTwo,
AddColonSpaceSingle,

I wonder if model, tokenizer and prompt generation can be abstracted into traits?

mistral error

I've ran the example with model mistral but when I try to run the script second time I get

openai.InternalServerError: Error: shape mismatch in broadcast_add, lhs: [1, 32, 62, 189], rhs: [1, 1, 62, 62]

I've seen this error before when I tried to use candle's LLM on a list of inputs. I think it's the issue with candle's mutability. Did you solve it for LLaMa? I couldn't test it since I don't have access to LLaMa on huggingface (my request for access is pending)

Support streaming of tokens

I noticed that OpenAI's stream=True could be perhaps implemented here:

for _ in 0..sampling.n {
let mut tokens = xs.get_ids().to_vec();
let (result, tokens_gen) = self.forward_inner(
&mut tokens,
&sampling,
&device,
&eos_token_id,
&mut logits_processor,
)?;
tokens_generated += tokens_gen.completion_tokens;
choices.push(result);
}

KV Cache and Scheduler tracking issue

This is a tracking issue for the KV cache and its associated scheduler. Note that batching is handled by the scheduler.

Short-term overview

  • Add basic framework
    • Scheduler
    • CacheEngine
  • Implement the Scheduler
    • Implement BlockEngine to determine allocation capacity
    • Use SequenceGroup to prepare for future sampling methods
    • Add preemption method
    • Add method to possibly allocate a slot
    • Add swapping in
  • Use scheduled blocks
    • FFI calls to the vLLM CUDA kernels for swapping, copying blocks - thanks @sigma-andex for their FFI development
    • Schedule new blocks on each token generation cycle
    • Prepare inputs for input, decode

Wrong URL for downloading models

Maybe this error is from one of the candle-vllm dependencies, in which case I will appreciate help with knowing which one and reporting it there.

Tried to run the server and the URL is badly formed:

root@a07c64fd168b:/candle-vllm# /root/.cargo/bin/candle-vllm --hf-token HF_TOKEN --port 8080 llama7b --repeat-last-n 4096
Error: APIError { data: "request error: https://huggingface.co/meta-llama/Llama-27b-chat-hf/resolve/main/tokenizer.json: status code 404" }

Pass tensor pointers

  • Added a generic and zero-cost Conjoined type which upholds an important aliasing variant
    • Implemented DeviceRepr on this type
  • This will be passed to the copy_blocks_internal_kernel kernel.

Add model support for gemma 9b

Symptoms

I found that using google/gemma-9b-it raises an error by stating that below.

(Some(_), Some(_)) => panic!("both hidden_act and hidden_activation are set"),

Support Mixtral-8x7B-v0.1

Currently the best open model so maybe worth incorporating as soon as some other stuff gets a bit more completed and stabilized in the project.

Running without huggingface token cache raises an error

Description

When trying to run candle-vllm with a Hugging Face model, the program fails if the ~/.cache/huggingface/token file doesn't exist. It appears that the program is not correctly handling the case where the token file or its parent directories don't exist.

Steps to Reproduce

  1. Ensure ~/.cache/huggingface/token doesn't exist
  2. Run candle-vllm with a command like:
cargo run --release -- --port 2000 --model-id meta-llama/Llama-2-7b-chat-hf llama
  1. Observe the error

Current Behavior

The program fails with an error: Error: APIError { data: "No such file or directory (os error 21)" }

Expected Behavior

The program should:

  1. Create the ~/.cache/huggingface directory if it doesn't exist
  2. Create an empty token file if it doesn't exist
  3. Prompt the user to enter their Hugging Face token if the file is empty
  4. Alternatively, provide a clear error message instructing the user to create the token file

Workaround

Currently, users need to manually create the directory and file:

mkdir -p ~/.cache/huggingface
touch ~/.cache/huggingface/token
vim ~/.cache/huggingface/token  # To add the token manually

Additional Context

This issue affects the user experience, especially for new users who may not be familiar with the expected file structure. Implementing proper file and directory handling would improve the ease of use of candle-vllm.

Thanks for awesome project for rescuing vllm from python!

candle-flash-attn linking error with Red Hat based distributions

I am trying to make the following (unfinished) Dockerfile work:

# Note: if building on a machine with a different GPU or no GPU then check
# https://developer.nvidia.com/cuda-gpus and pass the value without the decimal point to
# CUDA_COMPUTE_CAP directly without the $(...), for example for an A100 is CUDA_COMPUTE_CAP=80 and
# for an A10 is CUDA_COMPUTE_CAP=86.
#
# docker build --build-arg USERID=$(id -u) --build-arg \
#   CUDA_COMPUTE_CAP=$(nvidia-smi --query-gpu=compute_cap --format=csv | tail -n1 | tr -d .) \
#   -t local/candle-vllm-bench .

# Select an available version from
# https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/supported-tags.md:
FROM nvidia/cuda:12.3.1-devel-rockylinux9
ARG USERID=1000
ARG CUDA_COMPUTE_CAP
RUN yum install -y cargo libcudnn8-devel openssl-devel git && yum clean all && \
    rm -rf /var/cache/yum/*
RUN git clone https://github.com/EricLBuehler/candle-vllm
WORKDIR /candle-vllm
RUN cargo build --release --features cuda,cudnn,flash-attn,nccl
RUN adduser -u $USERID user
USER user

But it fails with:

error: linking with `cc` failed: exit status: 1
  |
  = note: LC_ALL="C" PATH="/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin:/root/.local/bin:/root/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" VSLANG="1033" "cc" "-m64" "/tmp/rustczqulH1/symbols.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.0.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.1.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.10.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.11.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.12.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.13.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.14.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.15.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.2.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.3.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.4.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.5.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.6.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.7.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.8.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.candle_vllm.cb7142adda2cd887-cgu.9.rcgu.o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb.4psql9m0o7iw6sqs.rcgu.o" "-Wl,--as-needed" "-L" "/candle-vllm/target/release/deps" "-L" "/candle-vllm/target/release/build/zstd-sys-51991617680764ab/out" "-L" "/usr/local/cuda/lib64" "-L" "/usr/local/cuda/lib64/stubs" "-L" "/usr/local/cuda/targets/x86_64-linux" "-L" "/usr/local/cuda/targets/x86_64-linux/lib" "-L" "/usr/local/cuda/targets/x86_64-linux/lib/stubs" "-L" "/usr/lib" "-L" "/usr/lib64" "-L" "/candle-vllm/target/release/build/bzip2-sys-f7fb57a3f4e98cc1/out/lib" "-L" "/candle-vllm/target/release/build/ring-a59330cc6e943984/out" "-L" "/candle-vllm/target/release/build/lz4-sys-c90b3b6e3d6da391/out" "-L" "/candle-vllm/target/release/build/esaxx-rs-83f1f68488f360a8/out" "-L" "/candle-vllm/target/release/build/onig_sys-d0c2f3461f43020d/out" "-L" "/candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out" "-L" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-Wl,-Bstatic" "/candle-vllm/target/release/deps/libenv_logger-0f0fa188a1404846.rlib" "/candle-vllm/target/release/deps/libtermcolor-c53cf66b9b32e10f.rlib" "/candle-vllm/target/release/deps/libis_terminal-cdf9c5266fcbba03.rlib" "/candle-vllm/target/release/deps/librustix-a629012946c99e6d.rlib" "/candle-vllm/target/release/deps/liblinux_raw_sys-15bed2ca91cf42a8.rlib" "/candle-vllm/target/release/deps/libhumantime-1dc284c82c7f0559.rlib" "/candle-vllm/target/release/deps/libcandle_vllm-ce9b07d51787770c.rlib" "/candle-vllm/target/release/deps/libchrono-ae2c4cf3aacef826.rlib" "/candle-vllm/target/release/deps/libiana_time_zone-2bd86fbdc9e46a38.rlib" "/candle-vllm/target/release/deps/libhf_hub-b2415c762b503a90.rlib" "/candle-vllm/target/release/deps/libdirs-45aa89c180ae36f2.rlib" "/candle-vllm/target/release/deps/libdirs_sys-b0294348c2e4986c.rlib" "/candle-vllm/target/release/deps/liboption_ext-3db96de540040126.rlib" "/candle-vllm/target/release/deps/libureq-22a62ebb34562523.rlib" "/candle-vllm/target/release/deps/libnative_tls-addec962e00a97ff.rlib" "/candle-vllm/target/release/deps/libopenssl_probe-e135bf478bd9e62b.rlib" "/candle-vllm/target/release/deps/libopenssl-f7e740960c8b0b56.rlib" "/candle-vllm/target/release/deps/libforeign_types-434e4620cdd2963d.rlib" "/candle-vllm/target/release/deps/libforeign_types_shared-3cd91dddd8b3059a.rlib" "/candle-vllm/target/release/deps/libopenssl_sys-2724f2f05b6f6e71.rlib" "/candle-vllm/target/release/deps/libwebpki_roots-fb31dcc12f4e6db5.rlib" "/candle-vllm/target/release/deps/librustls-ca4a80b00d74d11d.rlib" "/candle-vllm/target/release/deps/libsct-d1a0a53864376724.rlib" "/candle-vllm/target/release/deps/libwebpki-8db93ee63982280a.rlib" "/candle-vllm/target/release/deps/libring-c45b21a3fb043429.rlib" "/candle-vllm/target/release/deps/libspin-a5bca8ced7fc453c.rlib" "/candle-vllm/target/release/deps/libuntrusted-766afbb3ef44c1d1.rlib" "/candle-vllm/target/release/deps/libcandle_lora_transformers-c49058ffb6d7068a.rlib" "/candle-vllm/target/release/deps/libtqdm-e47c7a840c2fc706.rlib" "/candle-vllm/target/release/deps/libcrossterm-f705860770d94db8.rlib" "/candle-vllm/target/release/deps/libsignal_hook_mio-ec3a5a299cc915e5.rlib" "/candle-vllm/target/release/deps/libsignal_hook-49df15a2181bf250.rlib" "/candle-vllm/target/release/deps/libanyhow-78648c12fa2eaee5.rlib" "/candle-vllm/target/release/deps/libcandle_lora-0543f2db3a02f6c2.rlib" "/candle-vllm/target/release/deps/libtrc-af4d2dc9e955d45c.rlib" "/candle-vllm/target/release/deps/libuuid-8e9abe15319c7747.rlib" "/candle-vllm/target/release/deps/libcandle_transformers-3a408f703fe757e5.rlib" "/candle-vllm/target/release/deps/libserde_plain-9edacf8e6b8b5e3b.rlib" "/candle-vllm/target/release/deps/libcandle_flash_attn-6ec38f8ed9aac30d.rlib" "/candle-vllm/target/release/deps/libdyn_fmt-ca01837b2f65b0b1.rlib" "/candle-vllm/target/release/deps/libfutures-813f484dc1c71e4c.rlib" "/candle-vllm/target/release/deps/libfutures_executor-cdd38bae408d4ce8.rlib" "/candle-vllm/target/release/deps/libcandle_sampling-07b86ed24f500345.rlib" "/candle-vllm/target/release/deps/libcandle_nn-3eaedbdadbe5fbb5.rlib" "/candle-vllm/target/release/deps/libtokenizers-61b7f12c56fed2c5.rlib" "/candle-vllm/target/release/deps/libesaxx_rs-c3b0fa8f52cc413c.rlib" "/candle-vllm/target/release/deps/libunicode_normalization_alignments-025da513407d9879.rlib" "/candle-vllm/target/release/deps/libspm_precompiled-8a5e3784a84b6fa0.rlib" "/candle-vllm/target/release/deps/libbase64-a00060132962802d.rlib" "/candle-vllm/target/release/deps/libunicode_segmentation-0609f6ce0b27032d.rlib" "/candle-vllm/target/release/deps/libnom-828591b7d6e9f08d.rlib" "/candle-vllm/target/release/deps/libunicode_categories-4b2d8309eb580595.rlib" "/candle-vllm/target/release/deps/libmonostate-121edb8fb43689e8.rlib" "/candle-vllm/target/release/deps/libmacro_rules_attribute-fbe2172e90fd6d9d.rlib" "/candle-vllm/target/release/deps/libindicatif-5ac26ff2181c3839.rlib" "/candle-vllm/target/release/deps/libportable_atomic-37fa7d733d3c2283.rlib" "/candle-vllm/target/release/deps/libnumber_prefix-fcbd61cd7f0fb674.rlib" "/candle-vllm/target/release/deps/libconsole-927989bf813852d8.rlib" "/candle-vllm/target/release/deps/libunicode_width-4a01194dbfae8c91.rlib" "/candle-vllm/target/release/deps/librayon_cond-ec5fdcb09b40065c.rlib" "/candle-vllm/target/release/deps/libitertools-87b264833edf6f52.rlib" "/candle-vllm/target/release/deps/libonig-40dabd6ed5124b91.rlib" "/candle-vllm/target/release/deps/libonig_sys-90597c1391bce008.rlib" "/candle-vllm/target/release/deps/libderive_builder-3471ddeab47c0b9a.rlib" "/candle-vllm/target/release/deps/liblazy_static-852800890c81fb22.rlib" "/candle-vllm/target/release/deps/libclap-23394ec333e54596.rlib" "/candle-vllm/target/release/deps/libclap_builder-41cde94296fdb820.rlib" "/candle-vllm/target/release/deps/libstrsim-bfb3799e9677cd4d.rlib" "/candle-vllm/target/release/deps/libanstream-d284661ab137b824.rlib" "/candle-vllm/target/release/deps/libanstyle_query-d08e7c102e46eb49.rlib" "/candle-vllm/target/release/deps/libcolorchoice-d9fe16d50a3dd803.rlib" "/candle-vllm/target/release/deps/libanstyle_parse-6ac7d6e179081361.rlib" "/candle-vllm/target/release/deps/libutf8parse-86e737e0d4678582.rlib" "/candle-vllm/target/release/deps/libclap_lex-3a6b7689365ae37a.rlib" "/candle-vllm/target/release/deps/libanstyle-9a261b265642b8a4.rlib" "/candle-vllm/target/release/deps/libcandle_core-d2f01b6e6a29d888.rlib" "/candle-vllm/target/release/deps/libmemmap2-4476da1f91fb3603.rlib" "/candle-vllm/target/release/deps/libzip-9bf92410c307c36c.rlib" "/candle-vllm/target/release/deps/libpbkdf2-bfe2a8675cfe3dd6.rlib" "/candle-vllm/target/release/deps/libsha2-7f594f901cd89567.rlib" "/candle-vllm/target/release/deps/libpassword_hash-2fa33ff8d4990779.rlib" "/candle-vllm/target/release/deps/libbase64ct-760f27bcfd4054ae.rlib" "/candle-vllm/target/release/deps/libzstd-bafef58bb20c82a7.rlib" "/candle-vllm/target/release/deps/libzstd_safe-2c41e8f78c52fdfc.rlib" "/candle-vllm/target/release/deps/libbzip2-b94c5c5e7c15f010.rlib" "/candle-vllm/target/release/deps/libbzip2_sys-a158ea0d0289b351.rlib" "/candle-vllm/target/release/deps/libaes-dc1bc8251226040a.rlib" "/candle-vllm/target/release/deps/libcipher-eeb8ea70098f4f7f.rlib" "/candle-vllm/target/release/deps/libinout-5e79d2c693701e41.rlib" "/candle-vllm/target/release/deps/libhmac-246f344022381f5d.rlib" "/candle-vllm/target/release/deps/libconstant_time_eq-742a8ca43fc4b3c6.rlib" "/candle-vllm/target/release/deps/libyoke-b5cb326284cb506c.rlib" "/candle-vllm/target/release/deps/libzerofrom-72df68927b68a064.rlib" "/candle-vllm/target/release/deps/libstable_deref_trait-76725faa25d9c59b.rlib" "/candle-vllm/target/release/deps/libthiserror-7cc4f2a96da73a94.rlib" "/candle-vllm/target/release/deps/libsafetensors-b94965e86f7ef122.rlib" "/candle-vllm/target/release/deps/libcudarc-bb4cc1d0d1d68ba3.rlib" "/candle-vllm/target/release/deps/libcandle_kernels-af06d5fd4a087af6.rlib" "/candle-vllm/target/release/deps/libgemm-9939fb772d1ff792.rlib" "/candle-vllm/target/release/deps/libgemm_c32-cba446e570d4386d.rlib" "/candle-vllm/target/release/deps/libgemm_c64-701b72db790c5491.rlib" "/candle-vllm/target/release/deps/libgemm_f64-132035f8fb79f58d.rlib" "/candle-vllm/target/release/deps/libgemm_f16-a17195123a2b5a97.rlib" "/candle-vllm/target/release/deps/libgemm_f32-43dd1a29089d0d80.rlib" "/candle-vllm/target/release/deps/libgemm_common-888ab4912d03277a.rlib" "/candle-vllm/target/release/deps/libpulp-c51f68967478b6aa.rlib" "/candle-vllm/target/release/deps/libnum_complex-9293d6ad98d7b1c3.rlib" "/candle-vllm/target/release/deps/libdyn_stack-e01f3657ea7d975f.rlib" "/candle-vllm/target/release/deps/libreborrow-77659d577c4b718c.rlib" "/candle-vllm/target/release/deps/libraw_cpuid-b9cfe85e371d3083.rlib" "/candle-vllm/target/release/deps/librayon-7e6c7f8c76536947.rlib" "/candle-vllm/target/release/deps/librayon_core-2fef7474b3331466.rlib" "/candle-vllm/target/release/deps/libcrossbeam_deque-f3876680669c2c7d.rlib" "/candle-vllm/target/release/deps/libcrossbeam_epoch-d5f20c1ae49163b7.rlib" "/candle-vllm/target/release/deps/libmemoffset-b4fab92a5d1a5e30.rlib" "/candle-vllm/target/release/deps/libcrossbeam_utils-1d67d2d362ef675e.rlib" "/candle-vllm/target/release/deps/libeither-c016b57e73ba30c1.rlib" "/candle-vllm/target/release/deps/libbyteorder-8bf78fc69cf5b0a1.rlib" "/candle-vllm/target/release/deps/libhalf-82866db1aa6c7f3e.rlib" "/candle-vllm/target/release/deps/librand_distr-b111214f51586c69.rlib" "/candle-vllm/target/release/deps/libnum_traits-28ee9b33f1e53f29.rlib" "/candle-vllm/target/release/deps/libbytemuck-7eee2fa1f516b4ce.rlib" "/candle-vllm/target/release/deps/libactix_web-0a08fb87679df924.rlib" "/candle-vllm/target/release/deps/liburl-1bbf839f22bd1732.rlib" "/candle-vllm/target/release/deps/libidna-fb425d18121613f1.rlib" "/candle-vllm/target/release/deps/libunicode_normalization-7972d0be1c38ac31.rlib" "/candle-vllm/target/release/deps/libtinyvec-61debd23e06e16bf.rlib" "/candle-vllm/target/release/deps/libtinyvec_macros-f326b6a6f0ca8a7b.rlib" "/candle-vllm/target/release/deps/libunicode_bidi-9dc6f963fdeb5a21.rlib" "/candle-vllm/target/release/deps/libserde_urlencoded-9f88ee3d21b5ec1b.rlib" "/candle-vllm/target/release/deps/libform_urlencoded-3e169fc285508f2a.rlib" "/candle-vllm/target/release/deps/libserde_json-2daaa0f082f50c3a.rlib" "/candle-vllm/target/release/deps/libryu-8b05c69dcf279a6f.rlib" "/candle-vllm/target/release/deps/libactix_server-e79c728840296968.rlib" "/candle-vllm/target/release/deps/libactix_router-48a733d95bd3dd5e.rlib" "/candle-vllm/target/release/deps/libregex-c78c6a0d40f8f119.rlib" "/candle-vllm/target/release/deps/libregex_automata-3822bb291a95f096.rlib" "/candle-vllm/target/release/deps/libaho_corasick-6f9c3d032c4f562f.rlib" "/candle-vllm/target/release/deps/libregex_syntax-3dd804a409b2c545.rlib" "/candle-vllm/target/release/deps/libserde-23513cb3b07422f8.rlib" "/candle-vllm/target/release/deps/libcookie-30bd32d9b0d08b83.rlib" "/candle-vllm/target/release/deps/libtime-bc85cd6997494558.rlib" "/candle-vllm/target/release/deps/libtime_core-531fb2a2b6009484.rlib" "/candle-vllm/target/release/deps/libderanged-5409594f6406082d.rlib" "/candle-vllm/target/release/deps/libpowerfmt-c4543fc1903272c6.rlib" "/candle-vllm/target/release/deps/libactix_http-f7b0baf59fd7bb10.rlib" "/candle-vllm/target/release/deps/librand-aa6ddb6627b48b96.rlib" "/candle-vllm/target/release/deps/librand_chacha-fa47a10cc5e59439.rlib" "/candle-vllm/target/release/deps/libppv_lite86-9a645f708eed4e1c.rlib" "/candle-vllm/target/release/deps/librand_core-479671a2b8263665.rlib" "/candle-vllm/target/release/deps/libhttparse-699e93ce2c2e7905.rlib" "/candle-vllm/target/release/deps/libbrotli-df4299509820f939.rlib" "/candle-vllm/target/release/deps/libbrotli_decompressor-0212e4cdb0da1245.rlib" "/candle-vllm/target/release/deps/liballoc_stdlib-fc777d5f3c59a235.rlib" "/candle-vllm/target/release/deps/liballoc_no_stdlib-f497a54db348ea9b.rlib" "/candle-vllm/target/release/deps/libhttpdate-5f8e81ac577420b0.rlib" "/candle-vllm/target/release/deps/libsha1-ad6469ba6b8b2240.rlib" "/candle-vllm/target/release/deps/libcpufeatures-dcef25221428931f.rlib" "/candle-vllm/target/release/deps/libdigest-f32a2ccccbd945ab.rlib" "/candle-vllm/target/release/deps/libsubtle-910e19b9d08b2799.rlib" "/candle-vllm/target/release/deps/libblock_buffer-2ad0dde06bca4c37.rlib" "/candle-vllm/target/release/deps/libcrypto_common-30c46997c474a2db.rlib" "/candle-vllm/target/release/deps/libgeneric_array-95ff38f8e6dc2014.rlib" "/candle-vllm/target/release/deps/libtypenum-ddf8574aa94ffabe.rlib" "/candle-vllm/target/release/deps/libbase64-daaf16d87f9b4835.rlib" "/candle-vllm/target/release/deps/liblocal_channel-5501da97fbe12c8a.rlib" "/candle-vllm/target/release/deps/libbytestring-4d1e0f611bab987e.rlib" "/candle-vllm/target/release/deps/libencoding_rs-c048082deb3a71c3.rlib" "/candle-vllm/target/release/deps/liblanguage_tags-e0dfc52f86f9b27a.rlib" "/candle-vllm/target/release/deps/libahash-a28674307e9664ad.rlib" "/candle-vllm/target/release/deps/libgetrandom-b24cab7002c3530b.rlib" "/candle-vllm/target/release/deps/libzerocopy-63825396d720b9a6.rlib" "/candle-vllm/target/release/deps/libmime-04e6f00618993e67.rlib" "/candle-vllm/target/release/deps/libpercent_encoding-d54414372a2980de.rlib" "/candle-vllm/target/release/deps/libh2-27cdaea5e3d2147c.rlib" "/candle-vllm/target/release/deps/libindexmap-fcdde0ade0e1bfe3.rlib" "/candle-vllm/target/release/deps/libequivalent-8a25e166243cfe94.rlib" "/candle-vllm/target/release/deps/libhashbrown-aee95c0614bccf63.rlib" "/candle-vllm/target/release/deps/libfutures_util-98b8b67b3d434750.rlib" "/candle-vllm/target/release/deps/libfutures_io-bbce8973c99e7ece.rlib" "/candle-vllm/target/release/deps/libslab-490ef311b9a84e0e.rlib" "/candle-vllm/target/release/deps/libfutures_channel-6d294bf595dec06a.rlib" "/candle-vllm/target/release/deps/libfutures_task-0a7c23a0933dbcaa.rlib" "/candle-vllm/target/release/deps/libpin_utils-185c55cbe9ca2fff.rlib" "/candle-vllm/target/release/deps/libbitflags-1029aec9c38cde73.rlib" "/candle-vllm/target/release/deps/libzstd-242538c7759a4fa6.rlib" "/candle-vllm/target/release/deps/libzstd_safe-d25e92a1d04503ec.rlib" "/candle-vllm/target/release/deps/libzstd_sys-a6ec9cf883e86b56.rlib" "/candle-vllm/target/release/deps/libflate2-b67596bfbb64de8d.rlib" "/candle-vllm/target/release/deps/libminiz_oxide-2b969af90226827f.rlib" "/candle-vllm/target/release/deps/libsimd_adler32-d1dbd8e6b06bf162.rlib" "/candle-vllm/target/release/deps/libcrc32fast-ceb628e76fc0bab0.rlib" "/candle-vllm/target/release/deps/libactix_service-dfc20131f5ba36d4.rlib" "/candle-vllm/target/release/deps/libactix_codec-f3cae536aed1196d.rlib" "/candle-vllm/target/release/deps/libtokio_util-88b2eabf4483c1ed.rlib" "/candle-vllm/target/release/deps/libtracing-9e7a6177765350ac.rlib" "/candle-vllm/target/release/deps/libtracing_core-c5e9157560beafe6.rlib" "/candle-vllm/target/release/deps/libonce_cell-4b31816a5aa6274f.rlib" "/candle-vllm/target/release/deps/libmemchr-38d4fc2a3522aa15.rlib" "/candle-vllm/target/release/deps/libfutures_sink-78114cacf22202c2.rlib" "/candle-vllm/target/release/deps/libbitflags-b9815c55ec510696.rlib" "/candle-vllm/target/release/deps/libactix_utils-ec862be5af373362.rlib" "/candle-vllm/target/release/deps/liblocal_waker-7857496d2dec9a57.rlib" "/candle-vllm/target/release/deps/libactix_rt-0ffc3a15823d1322.rlib" "/candle-vllm/target/release/deps/libtokio-b67279acab90ede3.rlib" "/candle-vllm/target/release/deps/libsignal_hook_registry-a773ced30481d3cb.rlib" "/candle-vllm/target/release/deps/libnum_cpus-fbaf57124b2a0166.rlib" "/candle-vllm/target/release/deps/libsocket2-8e37cfa1c7015c6b.rlib" "/candle-vllm/target/release/deps/libmio-81de974463968f98.rlib" "/candle-vllm/target/release/deps/liblog-35f97248cb2ec82c.rlib" "/candle-vllm/target/release/deps/libparking_lot-e183fcd4a13bd183.rlib" "/candle-vllm/target/release/deps/libparking_lot_core-5fbb54b30e35e540.rlib" "/candle-vllm/target/release/deps/liblibc-d38dc52f94735460.rlib" "/candle-vllm/target/release/deps/libcfg_if-88c619515d65e3f1.rlib" "/candle-vllm/target/release/deps/libsmallvec-e35ec471a6514672.rlib" "/candle-vllm/target/release/deps/liblock_api-920512de5989abb2.rlib" "/candle-vllm/target/release/deps/libscopeguard-6208b4062bcdc2b1.rlib" "/candle-vllm/target/release/deps/libpin_project_lite-42a553ee08f02ebb.rlib" "/candle-vllm/target/release/deps/libfutures_core-b87582f06d7f1343.rlib" "/candle-vllm/target/release/deps/libhttp-b738399ec4ab1c60.rlib" "/candle-vllm/target/release/deps/libitoa-dcbca83b54db3306.rlib" "/candle-vllm/target/release/deps/libbytes-8c2bf1b211f72910.rlib" "/candle-vllm/target/release/deps/libfnv-ffe196e20ea2a648.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd-9c342d6596ca77d8.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libpanic_unwind-35e6faa0abf08dd1.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libobject-6242b5524a2684de.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libmemchr-94511439d510df36.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libaddr2line-1923a594ddedab24.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libgimli-5b476927cd520d76.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_demangle-6b4664d28b4dc07b.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libstd_detect-4d7e14ee42b44abc.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libhashbrown-94e04d08d317eb2b.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_alloc-7e3a1db27b23a8ee.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libminiz_oxide-0651af3c34a1e4b9.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libadler-e5da8ecb95d2de36.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libunwind-052b86aa844a2857.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcfg_if-bbd2a157557b773d.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/liblibc-f47279717d0e1831.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/liballoc-d30e243a979711ec.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/librustc_std_workspace_core-18929aabe36e3f57.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcore-f9f41fbdedfbfafb.rlib" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib/libcompiler_builtins-b26982894e484f03.rlib" "-Wl,-Bdynamic" "-lssl" "-lcrypto" "-lflashattention" "-lcudart" "-lstdc++" "-lstdc++" "-lcuda" "-lnccl" "-lnvrtc" "-lcurand" "-lcublas" "-lcublasLt" "-lcudnn" "-lgcc_s" "-lutil" "-lrt" "-lpthread" "-lm" "-ldl" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-z,noexecstack" "-L" "/usr/lib/rustlib/x86_64-unknown-linux-gnu/lib" "-o" "/candle-vllm/target/release/deps/candle_vllm-8b71aad931b633bb" "-Wl,--gc-sections" "-pie" "-Wl,-z,relro,-z,now" "-Wl,-O1" "-nodefaultlibs"
  = note: /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_api.o): relocation R_X86_64_32 against `.nvFatBinSegment' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim128_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi128ELi128ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi128ELi128ELi64ELi4ES2_EELb0ELb0ELb1ELb1ELb0EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim160_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi160ELi64ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi160ELi64ELi64ELi4ES2_EELb1ELb1ELb0ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim192_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi192ELi64ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi192ELi64ELi64ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim224_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi224ELi128ELi64ELi8ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi224ELi128ELi64ELi8ES2_EELb1ELb1ELb1ELb1ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim256_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi256ELi64ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi256ELi64ELi64ELi4ES2_EELb0ELb0ELb0ELb1ELb0EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim32_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi32ELi128ELi128ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi32ELi128ELi128ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim64_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi64ELi128ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi64ELi128ELi64ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim96_fp16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi96ELi64ELi64ELi4ELb0ELb0EN7cutlass6half_tE19Flash_kernel_traitsILi96ELi64ELi64ELi4ES2_EELb1ELb1ELb1ELb1ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim128_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi128ELi128ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi128ELi128ELi64ELi4ES2_EELb0ELb0ELb1ELb1ELb0EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim160_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi160ELi64ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi160ELi64ELi64ELi4ES2_EELb1ELb1ELb0ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim192_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi192ELi64ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi192ELi64ELi64ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim224_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi224ELi128ELi64ELi8ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi224ELi128ELi64ELi8ES2_EELb1ELb1ELb1ELb1ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim256_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi256ELi64ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi256ELi64ELi64ELi4ES2_EELb0ELb0ELb0ELb1ELb0EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim32_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi32ELi128ELi128ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi32ELi128ELi128ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim64_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi64ELi128ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi64ELi128ELi64ELi4ES2_EELb1ELb1ELb1ELb0ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          /usr/bin/ld: /candle-vllm/target/release/build/candle-flash-attn-bb54a4d16d25ee03/out/libflashattention.a(flash_fwd_hdim96_bf16_sm80.o): relocation R_X86_64_32 against symbol `_Z16flash_fwd_kernelI23Flash_fwd_kernel_traitsILi96ELi64ELi64ELi4ELb0ELb0EN7cutlass10bfloat16_tE19Flash_kernel_traitsILi96ELi64ELi64ELi4ES2_EELb1ELb1ELb1ELb1ELb1EEv16Flash_fwd_params' can not be used when making a PIE object; recompile with -fPIE
          collect2: error: ld returned 1 exit status
          

error: could not compile `candle-vllm` (bin "candle-vllm") due to previous error
[root@95e9d872d994 candle-vllm]# PATH="/usr/lib/rustlib/x86_64-unknown-linux-gnu/bin:/root/.local/bin:/root/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
[root@95e9d872d994 candle-vllm]# command -v cc
/usr/bin/cc
[root@95e9d872d994 candle-vllm]# cc --version
cc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)

Maybe I shouldn't use the flash-attn feature?
Thanks for any suggestions or information.

model download failing from HF

I have been getting Error: APIError { data: "request error: Bad Header: invalid header 'Authorization: Bearer hf_XXXXXXX\n'"}

  • I have tried and refreshed my token on HF
  • i have also made sure i have access to the models

however it still dont work, one think i suspect is that along with token \n is getting appended.. is that a bug ? please let know..

thank you

PagedAttention tracking issue

Tracking issue for PagedAttention

General overview

  • Use updated, refactored PagedAttention impl from vLLM
  • Implement attention biases
  • Implement _memory_efficient_attention or equiv.
  • Implement and use block cache (Scheduler, CacheEngine) in inference.
  • Finalize FFI linking to compiled CUDA kernels
  • Test PagedAttention in llama , mistral models

Sampler

  • Implement according to this

ModulePipeline refactoring

Need to redesign ModulePipeline design to integrate with the LLMEngine design. This means that a request output will be returned, and logprobs must be calculated.

  • Calculate logprobs
  • Generate best_of responses instead of the current n to check from (sorted by the logprobs)
  • Use Sampler
  • Do not return a ChatChoices, return something like a RawResponse that contains the logprobs, results
  • Streaming poses a problem. This will need to be handled by the LLMEngine.
  • Overall, convert ModulePipeline to a pure-sequence-generating platform

LLM Engine

Manages calling the model pipeline and linking this with the KV cache.

Tasks

  • Init cache by profiling model memory usage at maximum sequence length input. See: huggingface/candle#1412
  • Write the .generate function, the decoding entrypoint. The rough call chain looks like this, from vllm:
    1. .generate: batch the input seqs, calling .add_request
    • Write .add_request
    • Write .generate
    1. .run_engine: .step through each unfinished request, recording output
    • Write a .has_unfinished_requests
    • Write .run_engine
    1. .step: 1) call Scheduler to manage the seqs to swap for this decoding phase with ._schedule, then 2) execute the model
    1. .execute_model: Follow the cache ops from ._schedule, finally run the model pipeline with the cache.
    • Looks like the cache passes to the model pipeline has shared ownership with the CacheEngine. Will need to use Arc+my interior immutability setup.
    • Write .execute_model

Completed tasks

SequenceGroup:

Sequences generated from the same prompt.

Tasks

Cache:

CacheEngine manages the KV cache.

Tasks

  • Ensure initialization is called

BlockSpaceManager

Managed blocks and allocation

Deps

  • SequenceGroup

Tasks

  • BlockAllocator
  • PhysicalTokenBlock
  • BlockSpaceManager
    • Allocation, in conjuction with a Scheduler impl

Scheduler

Scheduler schedules blocks to swap in/out, copy.

Deps

  • SequenceGroup
  • BlockSpaceManager

Tasks

  • SchedulerConfig
  • BlockSpaceManager (with some deps: 1) Block Allocator)
    • Allocation via BlockSpaceManager

Doesn't compile

Hi.
Tried to compile it and gave some warnings and errors:

warning: unused import: `candle_core::cuda_backend::cudarc::driver::sys as cudarc_sys`
 --> src/backend/paged_attention.rs:3:5
  |
3 | use candle_core::cuda_backend::cudarc::driver::sys as cudarc_sys;
  |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  |
  = note: `#[warn(unused_imports)]` on by default

error[E0107]: struct takes 0 lifetime arguments but 1 lifetime argument was supplied
   --> src/openai/sampling_params.rs:142:10
    |
142 |     ) -> LogitsProcessor<'a> {
    |          ^^^^^^^^^^^^^^^---- help: remove these generics
    |          |
    |          expected 0 lifetime arguments
    |
note: struct defined here, with 0 lifetime parameters
   --> /root/.cargo/git/checkouts/candle-sampling-999e88fb680d680c/f734d71/src/logits_processor.rs:12:12
    |
12  | pub struct LogitsProcessor {
    |            ^^^^^^^^^^^^^^^

error[E0061]: this function takes 8 arguments but 5 arguments were supplied
   --> src/openai/sampling_params.rs:145:13
    |
145 |               LogitsProcessor::new(
    |  _____________^^^^^^^^^^^^^^^^^^^^-
146 | |                 seed,
147 | |                 Some(self.temperature.into()),
148 | |                 SamplingMethod::Multinomial,
149 | |                 top_n_logprobs,
150 | |                 tokenizer,
    | |                 --------- expected `Tokenizer`, found `&Tokenizer`
151 | |             )
    | |_____________- three arguments of type `std::option::Option<f32>`, `std::option::Option<f32>`, and `std::option::Option<HashMap<u32, f32>>` are missing
    |
note: associated function defined here
   --> /root/.cargo/git/checkouts/candle-sampling-999e88fb680d680c/f734d71/src/logits_processor.rs:55:12
    |
55  |     pub fn new(
    |            ^^^
help: consider using clone here
    |
150 |                 tokenizer.clone(),
    |                          ++++++++
help: provide the arguments
    |
145 |             LogitsProcessor::new(seed, Some(self.temperature.into()), SamplingMethod::Multinomial, top_n_logprobs, /* tokenizers::Tokenizer */, /* std::option::Option<f32> */, /* std::option::Option<f32> */, /* std::option::Option<HashMap<u32, f32>> */)
    |                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0061]: this function takes 8 arguments but 5 arguments were supplied
   --> src/openai/sampling_params.rs:154:13
    |
154 |               LogitsProcessor::new(
    |  _____________^^^^^^^^^^^^^^^^^^^^-
155 | |                 seed,
156 | |                 Some(self.temperature.into()),
157 | |                 SamplingMethod::TopK(self.top_k.try_into().unwrap()),
158 | |                 top_n_logprobs,
159 | |                 tokenizer,
    | |                 --------- expected `Tokenizer`, found `&Tokenizer`
160 | |             )
    | |_____________- three arguments of type `std::option::Option<f32>`, `std::option::Option<f32>`, and `std::option::Option<HashMap<u32, f32>>` are missing
    |
note: associated function defined here
   --> /root/.cargo/git/checkouts/candle-sampling-999e88fb680d680c/f734d71/src/logits_processor.rs:55:12
    |
55  |     pub fn new(
    |            ^^^
help: consider using clone here
    |
159 |                 tokenizer.clone(),
    |                          ++++++++
help: provide the arguments
    |
154 |             LogitsProcessor::new(seed, Some(self.temperature.into()), SamplingMethod::TopK(self.top_k.try_into().unwrap()), top_n_logprobs, /* tokenizers::Tokenizer */, /* std::option::Option<f32> */, /* std::option::Option<f32> */, /* std::option::Option<HashMap<u32, f32>> */)
    |                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0061]: this function takes 8 arguments but 5 arguments were supplied
   --> src/openai/sampling_params.rs:163:13
    |
163 |               LogitsProcessor::new(
    |  _____________^^^^^^^^^^^^^^^^^^^^-
164 | |                 seed,
165 | |                 Some(self.temperature.into()),
166 | |                 SamplingMethod::TopP(self.top_p.into()),
167 | |                 top_n_logprobs,
168 | |                 tokenizer,
    | |                 --------- expected `Tokenizer`, found `&Tokenizer`
169 | |             )
    | |_____________- three arguments of type `std::option::Option<f32>`, `std::option::Option<f32>`, and `std::option::Option<HashMap<u32, f32>>` are missing
    |
note: associated function defined here
   --> /root/.cargo/git/checkouts/candle-sampling-999e88fb680d680c/f734d71/src/logits_processor.rs:55:12
    |
55  |     pub fn new(
    |            ^^^
help: consider using clone here
    |
168 |                 tokenizer.clone(),
    |                          ++++++++
help: provide the arguments
    |
163 |             LogitsProcessor::new(seed, Some(self.temperature.into()), SamplingMethod::TopP(self.top_p.into()), top_n_logprobs, /* tokenizers::Tokenizer */, /* std::option::Option<f32> */, /* std::option::Option<f32> */, /* std::option::Option<HashMap<u32, f32>> */)
    |                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0061]: this method takes 2 arguments but 1 argument was supplied
   --> src/openai/pipelines/llama.rs:246:56
    |
246 |             let next_token = try_api!(logits_processor.sample(&logits));
    |                                                        ^^^^^^---------
    |                                                              ||
    |                                                              |expected `candle_core::tensor::Tensor`, found `candle_core::Tensor`
    |                                                              an argument of type `std::option::Option<&[u32]>` is missing
    |
    = note: `candle_core::Tensor` and `candle_core::tensor::Tensor` have similar names, but are actually distinct types
note: `candle_core::Tensor` is defined in crate `candle_core`
   --> /root/.cargo/git/checkouts/candle-0c2b4fa9e5801351/ab86cd3/candle-core/src/tensor.rs:68:1
    |
68  | pub struct Tensor(Arc<Tensor_>);
    | ^^^^^^^^^^^^^^^^^
note: `candle_core::tensor::Tensor` is defined in crate `candle_core`
   --> /root/.cargo/git/checkouts/candle-c6a149c3b35a488f/bcfa563/candle-core/src/tensor.rs:68:1
    |
68  | pub struct Tensor(Arc<Tensor_>);
    | ^^^^^^^^^^^^^^^^^
    = note: perhaps two different versions of crate `candle_core` are being used?
note: method defined here
   --> /root/.cargo/git/checkouts/candle-sampling-999e88fb680d680c/f734d71/src/logits_processor.rs:344:12
    |
344 |     pub fn sample(&mut self, logits: &Tensor, penalty_ctxt: Option<&[u32]>) -> Result<Logprobs> {
    |            ^^^^^^
help: provide the argument
    |
246 |             let next_token = try_api!(logits_processor.sample(/* &candle_core::tensor::Tensor */, /* std::option::Option<&[u32]> */));
    |                                                              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

error[E0308]: mismatched types
   --> src/openai/pipelines/llama.rs:255:42
    |
255 |             if Some(next_token.token) == eos_token_id.map(|x| x as usize) {
    |                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `Option<u32>`, found `Option<usize>`
    |
    = note: expected enum `std::option::Option<u32>`
               found enum `std::option::Option<usize>`

error[E0308]: mismatched types
   --> src/scheduler/sequence.rs:73:37
    |
73  |         self.append_token_to_blocks(logprobs.token);
    |              ---------------------- ^^^^^^^^^^^^^^ expected `usize`, found `u32`
    |              |
    |              arguments to this method are incorrect
    |
note: method defined here
   --> src/scheduler/sequence.rs:168:8
    |
168 |     fn append_token_to_blocks(&mut self, token: usize) {
    |        ^^^^^^^^^^^^^^^^^^^^^^            ------------
help: you can convert a `u32` to a `usize` and panic if the converted value doesn't fit
    |
73  |         self.append_token_to_blocks(logprobs.token.try_into().unwrap());
    |                                                   ++++++++++++++++++++

error[E0277]: the trait bound `Vec<usize>: Extend<u32>` is not satisfied
   --> src/scheduler/sequence.rs:109:13
    |
109 |         res.extend(
    |             ^^^^^^ the trait `Extend<u32>` is not implemented for `Vec<usize>`
    |
    = help: the following other types implement trait `Extend<A>`:
              <Vec<T, A> as Extend<T>>
              <Vec<T, A> as Extend<&'a T>>

error[E0308]: mismatched types
   --> src/scheduler/sequence.rs:123:13
    |
119 |     pub fn get_last_token_id(&self) -> usize {
    |                                        ----- expected `usize` because of return type
...
123 |             self.deref().output_token_ids.last().unwrap().token
    |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `usize`, found `u32`
    |
help: you can convert a `u32` to a `usize` and panic if the converted value doesn't fit
    |
123 |             self.deref().output_token_ids.last().unwrap().token.try_into().unwrap()
    |                                                                ++++++++++++++++++++

Some errors have detailed explanations: E0061, E0107, E0277, E0308.
For more information about an error, try `rustc --explain E0061`.
warning: `candle-vllm` (lib) generated 1 warning
error: could not compile `candle-vllm` (lib) due to 9 previous errors; 1 warning emitted
warning: build failed, waiting for other jobs to finish...
error: failed to compile `candle-vllm v0.1.0 (/candle-vllm)`, intermediate artifacts can be found at `/candle-vllm/target`.

Thanks!
p.s.: is this repo obsolete or there will be continued development in the future?

OpenAI API version

What version did you use for Python OpenAI package? I've tried with 0.27.8 and I can't even run the example, and after I changed code according to OpenAI pypi page

import openai

# Run: HF_TOKEN=... cargo run --release -- --hf-token HF_TOKEN --port 2000 llama7b --repeat-last-n 64


openai.api_base = "http://localhost:2000/v1/"
openai.api_key = ""

completion = openai.ChatCompletion.create(
    model="mistral7b",
    messages=[
        {
            "role": "user",
            "content": "Explain how to best learn Rust.",
        },
    ],
    max_tokens=32,
)
print(completion.choices[0].message.content)

I get

Traceback (most recent call last):
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_requestor.py", line 753, in _interpret_response_line
data = json.loads(rbody)
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/kuba/Projects/forks/candle-vllm/examples/llama.py", line 9, in
completion = openai.ChatCompletion.create(
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_resources/chat_completion.py", line 25, in create
return super().create(*args, **kwargs)
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_requestor.py", line 298, in request
resp, got_stream = self._interpret_response(result, stream)
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_requestor.py", line 700, in _interpret_response
self._interpret_response_line(
File "/home/kuba/.pyenv/versions/3.10.6/lib/python3.10/site-packages/openai/api_requestor.py", line 755, in _interpret_response_line
raise error.APIError(
openai.error.APIError: HTTP code 404 from API ()

Support using arbitrary derivative models

Currently the models need to be specified as llama7b for example, but what if one wants to use codellama/CodeLlama-7b-hf or meta-llama/Llama-2-7b-hf (non chat version), etc.?
A more flexible method should be implemented in the future.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.