Mistral.rs is a LLM inference platform written in pure, safe Rust.
- Python bindings
- Lightweight OpenAI API compatible HTTP server.
- Fast performance with per-sequence and catch-up KV cache management technique.
- Continuous batching.
- First X-LoRA inference platform with first class support.
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit quantization for faster inference and optimized memory usage.
- Apple silicon support with the Metal framework.
Supported models:
- Mistral 7B
- Normal
- GGUF
- X-LoRA
- Gemma
- Normal
- X-LoRA
- Llama
- Normal
- GGUF
- GGML
- X-LoRA
Library API
- Rust multithreaded API for easy integration into any application: docs. To use, add
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
to the Cargo.toml.
HTTP Server Mistral.rs provides an OpenAI API compatible API server. It is accessible through the command line when one builds mistral.rs.
To build mistral.rs, one should ensure they have Rust installed by following this link.
The Huggingface token should be provided in ~/.cache/huggingface/token
.
-
Using a script
For an easy quickstart, the script below will download an setup Rust and then build mistral.rs to run on the CPU.
sudo apt update -y sudo apt install libssl-dev -y sudo apt install pkg-config -y curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env git clone https://github.com/EricLBuehler/mistral.rs.git cd mistral.rs mkdir ~/.cache/huggingface touch ~/.cache/huggingface/token echo <HF_TOKEN_HERE> > ~/.cache/huggingface/token cargo build --release
-
Manual build
If Rust is installed and the Huggingface token is set, then one may build mistral.rs by executing the build command.
cargo build --release
. The build process will output a binarymisralrs
at./target/release/mistralrs
.
Rust uses a feature flag system during build to implement compile-time build options. As such, the following is a list of features
which may be specified using the --features
command.
cuda
metal
flash-attn
The X-LoRA ordering JSON file contains 2 parts. The first is the order of the adapters and the second, the layer ordering. The layer ordering has been automatically generated and should not be manipulated as it controls the application of scalings. However the order of adapter should be replaced by an array of strings of adapter names corresponding to the order the adapters were specified during training.
To start a server serving Mistral on localhost:1234
,
./mistralrs --port 1234 --log output.log mistral
Mistral.rs uses subcommands to control the model type. Please run ./mistralrs --help
to see the subcommands.
To start an X-LoRA server with the default weights, run the following after modifying or copying the ordering file as described here.
./mistralrs --port 1234 x-lora-mistral -o ordering.json
For the prompt "Tell me about the Rust type system in depth." and a maximum length of 256.
A6000 Mistral + CUDA + Flash Attention
- 30.44 tok/s
A6000 Mistral GGUF + CUDA
- 39.3 tok/s