Giter Site home page Giter Site logo

turboderp / exllama Goto Github PK

View Code? Open in Web Editor NEW
2.7K 37.0 215.0 848 KB

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

License: MIT License

Python 56.28% Shell 2.39% C++ 6.24% Cuda 21.19% JavaScript 8.24% CSS 4.12% HTML 1.15% C 0.08% Dockerfile 0.32%

exllama's Introduction

ExLlama

A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs.

Disclaimer: The project is coming along, but it's still a work in progress!

Hardware requirements

I am developing on an RTX 4090 and an RTX 3090-Ti. 30-series and later NVIDIA GPUs should be well supported, but anything Pascal or older with poor FP16 support isn't going to perform well. AutoGPTQ or GPTQ-for-LLaMa are better options at the moment for older GPUs. ROCm is also theoretically supported (via HIP) though I currently have no AMD devices to test or optimize on.

Dependencies

  • Python 3.9 or newer
  • torch tested on 2.0.1 and 2.1.0 (nightly) with cu118
  • safetensors 0.3.2
  • sentencepiece
  • ninja

Additionally, only for the web UI:

  • flask
  • waitress

Linux/WSL prerequisites

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

Windows prerequisites

To run on Windows (without WSL):

  1. Install MSVC 2022. You can choose to install the whole Visual Studio 2022 IDE, or alternatively just the Build Tools for Visual Studio 2022 package (make sure Desktop development with C++ is ticked in the installer), it doesn't really matter which.
  2. Install the appropriate version of PyTorch, choosing one of the CUDA versions. I am developing on the nightly build, but the stable version (2.0.1) should also work.
  3. Install CUDA Toolkit, (11.7 and 11.8 both seem to work, just make sure to match PyTorch's Compute Platform version).
  4. For best performance, enable Hardware Accelerated GPU Scheduling.

How to

Clone repo, install dependencies, and run benchmark:

git clone https://github.com/turboderp/exllama
cd exllama

pip install -r requirements.txt

python test_benchmark_inference.py -d <path_to_model_files> -p -ppl

The CUDA extension is loaded at runtime so there's no need to install it separately. It will be compiled on the first run and cached to ~/.cache/torch_extensions/ which could take a little while. If nothing happens at first, give it a minute to compile.

Chatbot example:

python example_chatbot.py -d <path_to_model_files> -un "Jeff" -p prompt_chatbort.txt

Python module

jllllll currently maintains an installable Python module here which may be more suitable for integrating ExLlama with other projects

Web UI

I also made a simple web UI for it. Don't look at the JavaScript, it was mostly written by ChatGPT and it will haunt your dreams. But it sort of works, and it's kinda fun, especially multibot mode:

_screenshot.jpg

To run it:

pip install -r requirements-web.txt

python webui/app.py -d <path_to_model_files>

Note that sessions are stored in ~/exllama_sessions/ by default. You can change that location with -sd if you want.

Docker

For security benefits and easier deployment, it is also possible to run the web UI in an isolated docker container. Note: the docker image currently only supports NVIDIA GPUs.

Requirements

It is recommended to run docker in rootless mode.

Build

The easiest way to build the docker image is using docker compose. First, set the MODEL_PATH and SESSIONS_PATH variables in the .env file to the actual directories on the host. Then run:

docker compose build

It is also possible to manually build the image:

docker build -t exllama-web .

NOTE: by default, the service inside the docker container is run by a non-root user. Hence, the ownership of bind-mounted directories (/data/model and /data/exllama_sessions in the default docker-compose.yml file) is changed to this non-root user in the container entrypoint (entrypoint.sh). To disable this, set RUN_UID=0 in the .env file if using docker compose, or the following command if you manually build the image:

docker build -t exllama-web --build-arg RUN_UID=0 .

Run

Using docker compose:

docker compose up

The web UI can now be accessed on the host at http://localhost:5000.

The configuration can be viewed in docker-compose.yml and changed by creating a docker-compose.override.yml file.

Run manually:

docker run --gpus all -p 5000:5000 -v <path_to_model_dir>:/data/model/ -v <path_to_session_dir>:/data/exllama_sessions --rm -it exllama-web --host 0.0.0.0:5000

Results so far

New implementation

Model Size grpsz act Seq. len. VRAM Prompt Best Worst Ppl
Llama 7B 128 no 2,048 t 5,194 MB 13,918 t/s 173 t/s 140 t/s 6.45
Llama 13B 128 no 2,048 t 9,127 MB 7,507 t/s 102 t/s 86 t/s 5.60
Llama 33B 128 no 2,048 t 20,795 MB 2,959 t/s 47 t/s 40 t/s 4.60
Llama 33B 128 yes 2,048 t 20,795 MB 2,784 t/s 45 t/s 37 t/s 4.55
Llama 33B 32 yes 1,550 t 1 21,486 MB 2,636 t/s 41 t/s 37 t/s 4.52
Koala 13B 128 yes 2,048 t 9,127 MB 5,529 t/s 93 t/s 79 t/s 6.73
WizardLM 33B - yes 2,048 t 20,199 MB 2,313 t/s 47 t/s 40 t/s 5.75
OpenLlama 3B 128 yes 2,048 t 3,128 MB 16,419 t/s 226 t/s 170 t/s 7.81

1 Can not achieve full sequence length without OoM

All tests done on stock RTX 4090 / 12900K, running with a desktop environment, with a few other apps also using VRAM.

"Prompt" speed is inference over the sequence length listed minus 128 tokens. "Worst" is the average speed for the last 128 tokens of the full context (worst case) and "Best" lists the speed for the first 128 tokens in an empty sequence (best case.)

VRAM usage is as reported by PyTorch and does not include PyTorch's own overhead (CUDA kernels, internal buffers etc.) This is somewhat unpredictable anyway. Best bet is to just optimize VRAM usage by the model, probably aiming for 20 GB on a 24 GB GPU to ensure there is room for a desktop environment and all of Torch's internals.

Perplexity is measured only to verify that the models are working. The dataset used is a particular, small sample from WikiText, so scores are not comparable to other Llama benchmarks and only useful for comparing the different Llama models to one another.

Dual GPU results

The following benchmarks are from a 4090 + 3090-Ti with -gs 17.2,24:

Model Size groupsize act Seq. len. VRAM Prompt Best Worst Ppl
Llama 65B 128 yes 2,048 t 39,804 MB 1,109 t/s 20 t/s 18 t/s 4.20
Llama 65B 32 yes 2,048 t 43,424 MB 1,037 t/s 17 t/s 16 t/s 4.11
Llama-2 70B 128 yes 2,048 t 40,680 MB 914 t/s 17 t/s 14 t/s 4.15
Llama-2 70B 32 yes 2,048 t 36,815 MB 874 t/s 15 t/s 12 t/s 4.10

Note that perplexity scores may not be strictly apples-to-apples between Llama and Llama 2 due to their different pretraining datasets.

Todo

Moved the todo list here.

Compatibility

Here is a list of models confirmed to be working right now.

Recent updates

2023-01-09: Added rope_theta parameter for (at least partial) CodeLlama support. If you were using alpha = 97 or similar, you would no longer need that for CodeLlama models. Still stuff to sort out regarding the extended vocabulary.

2023-08-09: Added support for sharded models. config.model_path now accepts either a filename or a list of filenames. model_init() will detect multiple .safetensors files if given a model directory. Note the change in the various examples: model_path = glob.glob(st_pattern)[0] becomes simply model_path = glob.glob(st_pattern). Also there's a little script in util/shard.py to split large .safetensors files. It also produces an index.json file for the sharded model, just for completeness, although ExLlama doesn't need it to read the shards. Note that the safetensors dependency was bumped to version 0.3.2.

2023-08-12: Preliminary, initial and tentative release of ExLlamaV2. It doesn't do all the things that ExLlamaV1 does, yet, but it's better at what it does do. So check it out!

exllama's People

Contributors

aljungberg avatar allenbenz avatar ardfork avatar eyedeck avatar flotos avatar kerushii avatar lhl avatar nopperl avatar osmarks avatar ph0rk0z avatar quarticcat avatar sinanakkoyun avatar tiendung avatar turboderp avatar vldmrb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

exllama's Issues

will it work with Nvidia P40 24GB on Linux?

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results.
exllama looks pretty interesting, but I'm getting compilation error.
Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert.
Would it be correct to assume from the lines below that P40 is not supported currently?
-D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work?
If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?

The full Traceback:
python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in
import cuda_ext
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in
exllama_ext = load(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load
return jit_compile(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile
write_ninja_file_and_build_library(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu".
[2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu".
ninja: build stopped: subcommand failed.

Working with TheBloke/WizardLM-30B-Uncensored-GPTQ

Hi! I got this to work with TheBloke/WizardLM-30B-Uncensored-GPTQ.

Here's what worked:

  1. This doesn't work on windows, but it does work on WSL
  2. Download the model (and all files) from HF and place it somewhere. Put this somewhere inside the wsl linux filesystem, not under /mnt/c/somewhere otherwise the model loading will be mega slow regardless of your disk speed
  3. on model.py I added the following:
        # self.groupsize = (self.qweight.shape[0] * 8) // self.qzeros.shape[0]
        self.groupsize = None
        self.config.groupsize = None
        self.config.act_order = True

        # self.config.groupsize = self.groupsize
        # if self.config.groupsize is None:
        #     self.config.groupsize = self.groupsize
        # else:
        #     if self.config.groupsize != self.groupsize:
        #         raise ValueError("Irregular groupsize for matrix: " + key + ", " + str(self.config.groupsize) + ", "+ str(self.groupsize))

Note the commented out code and the additions
4. I had to use -mm pytorch_only and -a pytorch_matmul

Docker and ownership permissions

Hi, I store my models on a local NAS, synology - which does not allow me to change ownership permissions of the files.

I get the following error when starting up the docker container with docker compose;

 ⠿ Network exllama_default  Created                                                                                                                                                                0.0s 
 ⠿ Container exllama-web-1  Created                                                                                                                                                                0.1s
Attaching to exllama-web-1
exllama-web-1  | chown: changing ownership of '/app/model/minotaur-13b-GPTQ-4bit-128g.no-act.order.safetensors': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer.model': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/tokenizer.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/special_tokens_map.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/quantize_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/generation_config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/config.json': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model/README.md': Operation not permitted
exllama-web-1  | chown: changing ownership of '/app/model': Operation not permitted
exllama-web-1 exited with code 1

Unsure how to proceed

Using cache cause random behavior during generation

I'm currently testing the different generation behavior between exllama and autogptq, and I found that using cache with exllama will generate different results for same prompt even when I'm using greedy decoding.

def exllama_greedy_gen_wo_cache(prompt, max_length):
    seq = tokenizer.encode(prompt) # Huggingface tokenizer
    for _ in range(max_length):
        temp_cache = ExLlamaCache(model_exllama)
        logits = model_exllama.forward(torch.tensor([seq], dtype=torch.long), temp_cache)[0][0]
        seq.append(torch.argmax(logits).item())
    return seq

for i in range(10):
    print(tokenizer.decode(exllama_greedy_gen_wo_cache("Hello,", 20)))

For generate without cache, it's really slow but I get consistant outputs
图片

But when I enable cache to get a much faster generation, I start seeing inconsistency between generations

def exllama_greedy_gen_wi_cache(prompt, max_length):
    seq = tokenizer.encode(prompt) # Huggingface tokenizer
    gen_cache = ExLlamaCache(model_exllama)
    model_exllama.forward(torch.tensor([seq[:-1]], dtype=torch.long), gen_cache, preprocess_only = True)
    for _ in range(max_length):
        logits = model_exllama.forward(torch.tensor([seq[-1:]], dtype=torch.long), gen_cache)[0][0]
        seq.append(torch.argmax(logits).item())
    return seq

for i in range(10):
    print(tokenizer.decode(exllama_greedy_gen_wi_cache("Hello,", 20)))

图片

I wonder is this a bug within the cache implementation or it is I'm using cache in a wrong way.

"fatal error LNK1104: cannot open file 'python310.lib'" + Solution (Windows)

I installed exllama on a secondary drive, then tried to install dependencies both inside a venv and on the root folder. In ever case I got a long error when I tried to run the test (python test_benchmark_inference.py -d <path_to_model_files> -p -ppl), including:

fatal error LNK1104: cannot open file 'python310.lib'

The solution was to copy the python310.lib file from Program Files\Python310\libs and paste it into \venv\Scripts\libs. Note I had to make that directory myself.

Multi-GPU

I see from your own testing testing that you have multi-GPU working.

Following the instructions and running test_benchmark_inference.py or test_chatbot.py they both worked on one of my RTX 3060's for a 13b model, and my other 2 3060's were detected (but not used)

Attempting to load a 33b llama model across all 3 cards I have led to cuda OOM error before the model loaded, as only a single card was used, with none of the other cards showing any VRAM usage.

Any tricks to multi-card setups or parameters i should be passing?

Support for StarCoder

Hello there,
upon loading StarCoder and its derivatives like WizardCoder the following error is thrown:

Traceback (most recent call last):
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/test_benchmark_inference.py", line 114, in <module>
    config = model_init.make_config(args)
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model_init.py", line 97, in make_config
    config = ExLlamaConfig(args.config)
  File "/media/bkutasi/60824A4F824A29BC/Other_projects/exllama/model.py", line 40, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Since it's a different model architecture, GPTBigCodeForCausalLM instead of LlamaForCausalLM the config is pretty different, so pad_token_id and hidden_size and other parameters are missing. Am I loading it wrong, or these model types are not supported?

Gradio error: "Not implemented yet"

I'm getting an error when attempting to use generate_simple inside of a Gradio UI. I can run test_inference.py just fine, however when I put that code into a Gradio UI and attempt to redirect the output to a Chatbot component, I get the below error:

Traceback (most recent call last):
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/routes.py", line 422, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1323, in process_api
    result = await self.call_function(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/gradio/blocks.py", line 1051, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 72, in bot
    bot_message = self.predict(history, user_message)
  File "/home/mmealman/src/exllama/webui/Chatbot.py", line 58, in predict
    return self.textgen.test_generate()
  File "/home/mmealman/src/exllama/TextGenerator.py", line 96, in test_generate
    text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
  File "/home/mmealman/src/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/home/mmealman/src/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/home/mmealman/src/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
  File "/home/mmealman/src/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
  File "/home/mmealman/src/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
  File "/home/mmealman/src/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/mmealman/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/home/mmealman/src/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

Below is the generation code I'm calling in the Chatbot:

    def test_generate(self):
        tokenizer_model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/tokenizer.model"
        model_config_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/config.json"
        model_path = "/home/mmealman/src/models/vicuna-13B-1.1-GPTQ-4bit-128g/vicuna-13B-1.1-GPTQ-4bit-128g.safetensors"
        config = ExLlamaConfig(model_config_path)
        config.model_path = model_path
        config.max_seq_len = 2048
        model = ExLlama(config)
        cache = ExLlamaCache(model)

        tokenizer = ExLlamaTokenizer(tokenizer_model_path)
        generator = ExLlamaGenerator(model, tokenizer, cache)
        generator.settings.token_repetition_penalty_max = 1.2
        generator.settings.token_repetition_penalty_sustain = 20
        generator.settings.token_repetition_penalty_decay = 50

        prompt = \
        "On 19 February 1952, Headlam became senior air staff officer (SASO) at Eastern Area Command in Penrith, New South " \
        "Wales. During his term as SASO, the RAAF began re-equipping with English Electric Canberra jet bombers and CAC " \
        "Sabre jet fighters. The Air Force also underwent a major organisational change, as it transitioned from a " \
        "geographically based command-and-control system to one based on function, resulting in the establishment of Home " \
        "(operational), Training, and Maintenance Commands. Eastern Area Command, considered a de facto operational " \
        "headquarters owing to the preponderance of combat units under its control, was reorganised as Home Command in " \
        "October 1953. Headlam was appointed an Officer of the Order of the British Empire (OBE) in the 1954 New Year " \
        "Honours for his \"exceptional ability and devotion to duty\". He was promoted to acting air commodore in May. His " \
        "appointment as aide-de-camp to Queen Elizabeth II was announced on 7 October 1954."

        gen_tokens = 200
        text = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
        return text

ExLLaMA generation in all other stand alone Python scripts works fine. The Gradio UI code also has worked fine in several other projects.

Streaming API

Foremost, this is a terrific project.
I've been trying to integrate it with other apps, but the API is a little bit different compared to other implementations like KobolAI and its API or textgen-webui and its API examples.
I could get it to work (while the webapp is running) with the following script with my limited knowledge, albeit it's not the best:

import requests
import json
import sys

url = 'http://0.0.0.0:5005/api/userinput'
data = {'user_input': 'What time is it? Write a very looong essay about time.'}
headers = {'Content-type': 'application/json'}

# send the POST request and stream the response
response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)

# extract the text values from the JSON response
text_values = (json.loads(line).get('text') for line in response.iter_lines())
for text_value in text_values:
    print(text_value, end="")
    sys.stdout.flush() # flush the output buffer

What do you think about the possibility of making a streaming api endpoint on /api/stream that is not connected with the backend user handling and message saving, and is "stateless" so it follows the REST principles? Since it's one of the most performant backends this would surely boost its popularity.

65B working on multi-gpu

This is not a issue, just reporting that it works great with Guanaco-65B-GPTQ-4bit.act-order.safetensors from TheBloke using 2x3090. Speed is great, about 15t/s.

Can't compile on Windows

Hi there, really amazing work that you're doing here.

I'm trying to run either the benchmark or the webui to test (I have 2x4090), but it seems it can't find the compiler or something similar?

The complete error is:

python .\webui\app.py
F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
INFO: Could not find files for the given pattern(s)
Traceback (most recent call last):
  File "F:\ChatIAs\exllama\webui\app.py", line 9, in <module>
    import model_init
  File "F:\ChatIAs\exllama\model_init.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "F:\ChatIAs\exllama\model.py", line 5, in <module>
    import cuda_ext
  File "F:\ChatIAs\exllama\cuda_ext.py", line 14, in <module>
    exllama_ext = load(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1283, in load
    return _jit_compile(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1610, in _write_ninja_file_and_build_library
    _write_ninja_file_to_build_library(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2057, in _write_ninja_file_to_build_library
    _write_ninja_file(
  File "F:\ChatIAs\exllama\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2200, in _write_ninja_file
    cl_paths = subprocess.check_output(['where',
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.

I have CUDA 11.8 and CUDA 12.1 on my system. I do specify when building for gtpq for example (with $env:CUDA_PATH="CUDA_DIR", but here I'm not sure if it uses those or self built. Also, when specifying the CUDA version, it doesn't work either.

Maybe I'm missing something here?

Python 3.10.10
Windows 11 Pro
RTX 4090 x2
AMD Ryzen 7 7800X3D
VS2019

C:\Program Files (x86)\Microsoft Visual Studio\2019\Community>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

how to get correct model type?

my original model is a .bin file, then i use the below code to convert the model format

model = AutoModelForCausalLM.from_pretrained("path/to/llama_7b", torch_dtype=torch.float16,device_map='auto' )
model.save_pretrained('path/to/exllama',safe_serialization=True, max_shard_size="200GB")

but when i run test_benchmark_inference.py , i meet mistake
image
the model didn't have qzeros/scales

how can i deal with it, could you help me?

Multimodal support

Great work with this loader, I'm seeing 5x i/s improvements in Ooba and was hopeful that it would help serve up some gains when using ooba's multimodal extension (confirmed working in my current setup (Windows 10, RTX 2080Ti, 11 GB VRAM, 96 GB RAM, 11.8 Cuda, 2.0.1 Torch, with Llava or miniGPT pipelines at either 7b and 13b).

When attempting to use exllama as the loader with any of the 4 MM setups, regular text chat or instruct work well and much faster but as soon as attempting to use the multimodal extension to include a photo I get this error.
Maybe you can point me in the right direction to try and resolve this?

  File "D:\00\text-generation-webui\modules\text_generation.py", line 300, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "D:\00\text-generation-webui\modules\exllama.py", line 68, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 191, in gen_begin_reuse
    if reuse < in_tokens.shape[-1]: self.gen_feed_tokens(in_tokens[:, reuse:])
  File "D:\00\text-generation-webui\repositories\exllama\generator.py", line 209, in gen_feed_tokens
    self.model.forward(self.sequence[:, start:-1], self.cache, preprocess_only = True, lora = self.lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 841, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 459, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "D:\00\text-generation-webui\repositories\exllama\model.py", line 381, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len)
RuntimeError: start (49) + length (13970) exceeds dimension size (2048).
Output generated in 1.51 seconds (0.00 tokens/s, 0 tokens, context 14020, seed 979644525)```

API for batched input?

Thanks for this great project. The inference speed is exceptional. However it seems the generator api only supports single string input. When serving concurrent requests, batching of inputs will be needed for better thoughput.

Very poor output quality

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

TransformerEngine FP8 support

Hello! Could this work utilize the new H100 TransformerEngine for speedup? If yes, I would be very interested in that and would also pay your H100 cloud GPU access if you could estimate how long you will need it

Thank you very much!

Splitting model on multiple GPUs produces RuntimeError

When attempting to split the model on multiple GPUs, I get the following error:

> python test_chatbot.py -d /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/ -gs 16,22 -p prompt_assistant.txt -un "John" -bn "Assistant" -temp 1.00 -topp 0.95 -beams 5 -beamlen 20 -mm quant_only
 -- Loading model
 -- Tokenizer: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/tokenizer.model
 -- Model config: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/config.json
 -- Model: /home/john/Projects/Python/text-models/text-generation-webui/models/TheBloke_guanaco-65B-GPTQ/Guanaco-65B-GPTQ-4bit.act-order.safetensors
 -- Sequence length: 2048
 -- Temperature: 1.00
 -- Top-K: 20
 -- Top-P: 0.95
 -- Min-P: 0.00
 -- Repetition penalty: 1.15
 -- Beams: 5 x 20
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: quant_only', 'gpu_split: 16,22']
 -- Groupsize (inferred): None
 -- Act-order (inferred): no
This is a conversation between a person called John and an AI chatbot called Assistant. Assistant is a helpful and articulate virtual assistant possessing deep knowledge on a multitude of subjects. Assistant will never refuse to answer a question or comply with a request.
Assistant: Hello, John. I am Assistant, your virtual assistant. How may I help you?
John: Testing
Assistant:Traceback (most recent call last):
  File "/home/john/Projects/exllama/test_chatbot.py", line 213, in <module>
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 385, in beam_search
    tokens, probs = self.sample(logits,
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/john/Projects/exllama/generator.py", line 94, in sample
    sampled_ind = torch.multinomial(norm_probs, norm_probs.shape[-1] if num == -1 else min(num, norm_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

This only happens if the model is split between GPUs using the -gs option.

Cuda 12.1 - Fails to Build Here

exllama/exllama_ext/q4v2_matmul.cu(116): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (half *, half)
      atomicAdd(out_.item_ptr(x_row, w_column), result);

Problem with generation leading space.

I'm currently implementing the HF decoding for exllama, but I find that model sometimes do not generate the expected leading space. It happens kind of rarely, but still from time to time. Since I can only trigger it when I do sampling thereby I currently cannot give a prompt that can reproduce it on greedy decoding.
I check the oobabooga/text-generation-webui implementation and find that it's fixed in a strange way:
图片
Since my implementing follows the HF interface so I have no access to generation index "i" and thereby cannot check whether a forward call is for the first token or not.
So I'm wondering what's the potential cause of this and is there any other way to fix it?
Update: Seems like not exllama's problem but has something to do with the strange "add leading space" behavior of HF tokenizer observed earlier.

Using QLoRA?

I have been using QLoRA to finetune my model on my 3090, which previously could only perform inferences and not finetuning.

With the incredible improvements achieved with exllama, is it possible to combine both QLoRa and exllama so that the finetuning requirements are similar to the requirements for inferences?

Get error when compiling.

Hello! I am trying to run exllama on wizard-vicuna-13b-uncensored-gptq, and when i try to run any of the commands I get the following error. I am running it using the nvidia pytorch image nvcr.io/nvidia/pytorch:23.05-py3. I am using the newest version of cuda 12.1.1 and running it on a google vm with an L4 on ubuntu 18.04 LTS. I know the documentation says its not compatible with all gpu's, is it compatible with the L4? Any help would be very much appreciated. Thank you!!

error.txt

Possible to add a pip package?

Heya, I'm writing a langchain binding for exllama, I'd love to be able to pip install exllama and be able to access the libraries in python natively, right now I'm not really sure how I'd ship the langchain module without creating my own binding library in pip, which seems very awkward.

Support for llama models with >2048 context?

Hi there! As always, thanks for the amazing project.

I was trying to get to load Minotaur-15B (8192 max context), which it's the result of quantising to 4bit using GPTQ-for-LLaMa.
https://huggingface.co/TheBloke/minotaur-15B-GPTQ

At first, I was trying with ooba text webui, and got:

2023-06-19 15:07:54 INFO:Loading TheBloke_minotaur-15B-GPTQ...
2023-06-19 15:07:56 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "F:\ChatIAs\oobabooga\text-generation-webui\server.py", line 62, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 65, in load_model
    output = load_func_map[loader](model_name)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\models.py", line 277, in ExLlama_loader
    model, tokenizer = ExllamaModel.from_pretrained(model_name)
  File "F:\ChatIAs\oobabooga\text-generation-webui\modules\exllama.py", line 35, in from_pretrained
    config = ExLlamaConfig(str(model_config_path))
  File "F:\ChatIAs\oobabooga\text-generation-webui\repositories\exllama\model.py", line 39, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Then, tried with exllama directly, and got;

(venv) PS F:\ChatIAs\exllama> python webui/app.py -d "F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ"
 -- Tokenizer: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\tokenizer.model
 -- Model config: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\config.json
 -- Model: F:\ChatIAs\oobabooga\text-generation-webui\models\TheBloke_minotaur-15B-GPTQ\4bit-128g.safetensors
 -- Sequence length: 2048
 -- Tuning:
 -- --matmul_recons_thd: 8
 -- --fused_mlp_thd: 2
 -- --sdp_thd: 8
 -- Options: []
Traceback (most recent call last):
  File "F:\ChatIAs\exllama\webui\app.py", line 133, in <module>
    config = model_init.make_config(args)
  File "F:\ChatIAs\exllama\model_init.py", line 97, in make_config
    config = ExLlamaConfig(args.config)
  File "F:\ChatIAs\exllama\model.py", line 39, in __init__
    self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id'

Is there any setting that I'm missing? Or are these models not compatible yet? Thanks!

Landmark Attention support

Any thoughts on how difficult it would be to support inference on a model trained with landmark attention? Like Minotaur, Wizard or the base Llama landmark finetunes released recently, and I suppose more will come out, now that multiple repos support lora/qlora/gptq-lora training with landmark attention.

I haven’t compared results yet, but it sounds like landmark attention should be more effective with long contexts compared to the turboderp/alpaca_lora_4bit repo. Like the author, I found that that repo did “something”, and stopped generating gibberish beyond 2048 at least, but I’m not sure what the model learned. The landmark attention paper claims it can solve needle-haystack problems beyond the context length, which I couldn’t get the previous method to do.

Landmark apparently works with Oogabooga with remote code enabled.

WebUI Multi-bot

Anything i need to do to make this work? simply adding names doesn't change anything. I've also tried creating prompt text files to match the name added, but no change.

Other than that it's working pretty well for talking to one bot.

Performance degradation

I did a test on the latest commit (77545c) and bec6c9 on h100 with 30b model and I can see stable performance degradation.

Latest   bec6c9
25 t/s   34t/s

thoughts?

Feature Request: length_penalty support

We are trying to port the transformer based gen code to exllama but did not find a configurable length_penalty control. Will this be on the road map? Thanks.

Is is able to turning with exllama?

I have test this in 4*2080ti . Works well and almost 10t/s for llama 65b while only 0.65t/s for bnb 4bit, really amazing..First of all please accept my thanks and adoration.

If you need do tests with 20 series gpu, may be I can help.

And can this used for turning along with lora?

Typo in model.py

AttributeError: 'ExLlamaConfig' object has no attribute 'sdp_thd'. Did you mean: 'stp_thd'?

I think that in line 78:

self.stp_thd = 8

should be

self.sdp_thd = 8

Thanks for all your hard work, great project!

RuntimeError: CUDA error: an illegal memory access was encountered

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 4
      2 config.model_path = model_path
      3 config.max_seq_len = 2048
----> 4 model = ExLlama(config)
      5 cache = ExLlamaCache(model)
      6 tokenizer = ExLlamaTokenizer(tokenizer_model_path)

File /workspace/exllama/model.py:759, in ExLlama.__init__(self, config)
    756     device = self.config.device_map.layers[i]
    757     sin, cos = self.sincos[device]
--> 759     layer = ExLlamaDecoderLayer(self.config, tensors, f"model.layers.{i}", i, sin, cos)
    761     modules.append(layer)
    763 self.layers = modules

File /workspace/exllama/model.py:345, in ExLlamaDecoderLayer.__init__(self, config, tensors, key, index, sin, cos)
    342 self.config = config
    343 self.index = index
--> 345 self.self_attn = ExLlamaAttention(self.config, tensors, key + ".self_attn", sin, cos, self.index)
    346 self.mlp = ExLlamaMLP(self.config, tensors, key + ".mlp")
    348 self.input_layernorm = ExLlamaRMSNorm(self.config, tensors, key + ".input_layernorm.weight")

File /workspace/exllama/model.py:260, in ExLlamaAttention.__init__(self, config, tensors, key, sin, cos, index)
    258 self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".k_proj")
    259 self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".v_proj")
--> 260 self.o_proj = Ex4bitLinear(config, self.config.num_attention_heads * self.config.head_dim, self.config.hidden_size, False, tensors, key + ".o_proj")

File /workspace/exllama/model.py:137, in Ex4bitLinear.__init__(self, config, in_features, out_features, has_bias, tensors, key)
    135 self.qzeros = tensors[key + ".qzeros"]
    136 self.scales = tensors[key + ".scales"]
--> 137 self.g_idx = tensors[key + ".g_idx"].cpu() if key + ".g_idx" in tensors else None
    138 self.bias = tensors[key + ".bias"] if has_bias else None
    140 self.device = self.qweight.device

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

running on runpod, 2x3090 with 2.1.0.dev20230607+cu118

Pure C++ core instead of Python

I'm curious, is it possible to extract just C++ / CUDA core from the project to integrate into external systems? Basically to have something like llama.cpp without Python at all, or just the initial step for compiling / building exllama.

Batch generation support

Great repo.

Is there any plans to add support for batched generation?

Any idea how much work this might be to achieve? I can potentially work on this if you can point me in the right direction.

Kernel wouldn't compile in my conda env

Until I added a link from lib to lib64 it was unable to find the cuda libs. Compile would fail. Test kernel stuff is also out of date as the paths are wrong.

Error running. ArgTypes. Ninja: Build stopped: subcommand failed

When i try to run: python3 example_chatbot.py -d /home/xxxxx/models/based-7B-GPTQ -un "Jeff" -p prompt_chatbort.txt

The following error appears:

Traceback (most recent call last):
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/xxxxxx/exllama/example_chatbot.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/xxxxxx/exllama/model.py", line 5, in
import cuda_ext
File "/home/xxxxxx/exllama/cuda_ext.py", line 42, in
exllama_ext = load(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1284, in load
return jit_compile(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1509, in jit_compile
write_ninja_file_and_build_library(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1624, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils /cpp_extension.py", line 1909, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/5] /usr/bin/nvcc -DTOR CH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILE R_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1 011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx/miniconda3/envs/gpt q/lib/python3.9/site-packages/torch/include -isystem /home/xxxxxx/miniconda3/envs /gptq/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/TH -i system /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/includ e/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/python3.9 -D_GLIBCXX_USE CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUD A_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constex pr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/xxxxxx/exllama/exllama_e xt/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION
H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[2/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATOR S
--expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
FAILED: q4_mlp.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION
H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_mlp.cu -o q4_mlp.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[3/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATOR S
--expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
FAILED: q4_attn.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION
H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_attn.cu -o q4_attn.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(Functor&& f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘ArgTypes’
[4/5] /usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTE NSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYB IND11_BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home /xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem / home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/ csrc/api/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-pa ckages/torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/ site-packages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/includ e/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATOR S
--expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=a rch=compute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /ho me/xxxxxx/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
FAILED: q4_matmul.cuda.o
/usr/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION
H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11 BUILD_ABI="cxxabi1011" -I/home/xxxxxx/exllama/exllama_ext -isystem /home/xxxxxx /miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -isystem /home/l ucas/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/a pi/include -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-packages /torch/include/TH -isystem /home/xxxxxx/miniconda3/envs/gptq/lib/python3.9/site-p ackages/torch/include/THC -isystem /home/xxxxxx/miniconda3/envs/gptq/include/pyth on3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_C ONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -- expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=co mpute_86,code=sm_86 --compiler-options '-fPIC' -lineinfo -std=c++17 -c /home/luc as/exllama/exllama_ext/cuda_func/q4_matmul.cu -o q4_matmul.cuda.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expa nded with ‘...’:
435 | function(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expa nded with ‘...’:
530 | operator=(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’
ninja: build stopped: subcommand failed.

Question - possible to run starcoder with exllama?

Recently there was a 15.5b param called starcoder released https://huggingface.co/bigcode/starcoder

You should be able to run it with text-generation-webui using a fork of GPTQ-for-llama called
GPTQ-for-SantaCoder https://github.com/mayank31398/GPTQ-for-SantaCoder

Since as far as I can tell these two libraries are using the same library - transformers
as well as the same quantization method (GPTQ) shouldn't this be possible to run with exllama?

any ideas on how I would go about doing this? @turboderp @disarmyouwitha

ExLlama API spec / discussion

Opening a new thread to continue conversation re: API as I think having a thread for discussion about this will be valuable as the project continues to scale

Continuation from: #12

Crashing with act order and no act order since latest changes.

Crashing with act order and no act order since latest changes.

`python test_benchmark_inference.py -t /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model -c /home/nap/llm_models/koala-13B-HF-4bit/config.json -m /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors -g 128 -p -ppl

Using /home/nap/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/nap/.cache/torch_extensions/py310_cu118/exllama_ext/build.ninja...
Building extension module exllama_ext...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module exllama_ext...
-- Loading model
-- Tokenizer: /home/nap/llm_models/koala-13B-HF-4bit/tokenizer.model
-- Model config: /home/nap/llm_models/koala-13B-HF-4bit/config.json
-- Model: /home/nap/llm_models/koala-13B-HF-4bit/koala13B-4bit-128g-no-act-order.safetensors
-- Groupsize: 128
-- Sequence length: 2048
-- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf', 'ppl']
** Time, Load model: 1.50 seconds
** VRAM, Model: [cuda:0] 6,689.96 MB
-- Inference, first pass.
Traceback (most recent call last):
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 64, in timer
ret = func()
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 153, in
logits = timer("Inference", lambda: wrapper.next_logits(ids))
File "/home/nap/Documents/exllama-api/test_benchmark_inference.py", line 54, in next_logits
return self.model.forward(input_ids, self.cache, last_id_only)
File "/home/nap/Documents/exllama-api/model.py", line 523, in forward
hidden_states = decoder_layer(hidden_states, cache, attn_masks[device])
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 351, in forward
hidden_states = self.self_attn(hidden_states, cache, attention_mask)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 264, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/nap/miniconda3/envs/exllama/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nap/Documents/exllama-api/model.py", line 148, in forward
out = quant_util.matmul4bit(x,
File "/home/nap/Documents/exllama-api/quant_util.py", line 70, in matmul4bit
if switch: output = _q4v2_recons(x, qweight, scales, zeros, groupsize, g_idx)
File "/home/nap/Documents/exllama-api/quant_util.py", line 51, in _q4v2_recons
q4v2_recons(qweight, buffer, scales, zeros, groupsize, g_idx if g_idx is not None else none_tensor)
TypeError: q4v2_recons(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: int) -> None

Invoked with: tensor([[-1398026309, 1248440250, 1968657271, ..., 1648788836,
1771582072, 1432982596],
[-1129530164, -1402287736, 1970562646, ..., 2016756323,
900172105, -2007726747],
[ -876888900, -1735723655, 1717986149, ..., -1236974524,
1117231658, -1988663128],
...,
[ 2125380013, 729121940, -1516013256, ..., -1448441238,
1395411286, -910718291],
[ -609454181, -1721358701, 2071349639, ..., -1380296262,
842437924, -646359431],
[ 1518767014, -1668986954, -1201825385, ..., 1920967637,
1770408276, -932611670]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0', dtype=torch.float16), tensor([[0.0146179199, 0.0079727173, 0.0102233887, ..., 0.0184173584,
0.0107421875, 0.0127944946],
[0.0070610046, 0.0054702759, 0.0059432983, ..., 0.0166778564,
0.0122299194, 0.0089492798],
[0.0113754272, 0.0084228516, 0.0120010376, ..., 0.0273590088,
0.0140762329, 0.0107803345],
...,
[0.0120925903, 0.0055465698, 0.0123825073, ..., 0.0216217041,
0.0121994019, 0.0125427246],
[0.0135955811, 0.0066719055, 0.0160827637, ..., 0.0184020996,
0.0138168335, 0.0129928589],
[0.0071449280, 0.0064125061, 0.0047798157, ..., 0.0156250000,
0.0118713379, 0.0109329224]], device='cuda:0', dtype=torch.float16), tensor([[ 2023392650, 1010177125, -1289406317, ..., 1279628968,
-2103822205, 1447265365],
[ 2002999416, 1783731782, 1698252904, ..., 1971681173,
-2055768200, 1720019317],
[-2023401052, 678589785, -1808521094, ..., 1430677867,
-2089273206, 1750898759],
...,
[ 1703449704, 1770349877, -1807272091, ..., -2041219240,
1732671894, 1721131894],
[ 1732459378, 1197652024, 1950771288, ..., -1837668746,
1719236473, -2024245130],
[-2005375383, 1970881927, 1753777765, ..., 1971808872,
2003334805, 1970759287]], device='cuda:0', dtype=torch.int32), 128, tensor([ 0, 0, 0, ..., 39, 39, 39], device='cuda:0', dtype=torch.int32)`

Reverting to previous commit fixed the issue for me.

OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Pop!_OS 20.04
Python 3.8.10
AMD 6800 XT GPU

Installed with:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118

pip install safetensors sentencepiece ninja

git clone https://github.com/turboderp/exllama
cd exllama

When running python3 test_benchmark_inference.py -d /home/user1/models/ -p -ppl or python example_chatbot.py -d /home/user1/models/ -un "Jeff" -p prompt_chatbot.txt I get the follow errors:

Traceback (most recent call last):
  File "test_benchmark_inference.py", line 1, in <module>
    from model import ExLlama, ExLlamaCache, ExLlamaConfig
  File "/home/user1/bin/exllama/model.py", line 5, in <module>
    import cuda_ext
  File "/home/user1/bin/exllama/cuda_ext.py", line 42, in <module>
    exllama_ext = load(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1286, in load
    return _jit_compile(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1511, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1603, in _write_ninja_file_and_build_library
    extra_ldflags = _prepare_ldflags(
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1702, in _prepare_ldflags
    if (not os.path.exists(_join_cuda_home(extra_lib_dir)) and
  File "/home/user1/.local/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2238, in _join_cuda_home
    raise EnvironmentError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

What am I doing wrong?

Lora support

Congrats and thank you again for a project that changes everything. Can't use anything else and now I even prefer your Web UI to the std. text-web-ui...

In some instances it would be super-useful to be able load separate lora's on top of a GPTQ model loaded with exllama.

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ

image

[email protected]:/exllama$ python test_benchmark_inference.py -d Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
 -- Loading model
 -- Tokenizer: Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
 -- Model config: Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
 -- Model: Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
Traceback (most recent call last):
  File "/exllama/test_benchmark_inference.py", line 171, in <module>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 73, in timer
    ret = func()
  File "/exllama/test_benchmark_inference.py", line 171, in <lambda>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 51, in __init__
    self.model = ExLlama(config)
  File "/exllama/model.py", line 883, in __init__
    with safe_open(self.config.model_path, framework="pt", device="cpu") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Exception ignored in: <function ExLlama.__del__ at 0x7fd43bfe1fc0>
Traceback (most recent call last):
  File "/exllama/model.py", line 1066, in __del__
    if torch_device is not None: cuda_ext.free_cuda_buffers(torch_device)
  File "/exllama/cuda_ext.py", line 57, in free_cuda_buffers
    free_buffers(device)
TypeError: free_buffers(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.device, arg1: int, arg2: int, arg3: int, arg4: int) -> None

Invoked with: device(type='cuda', index=0)
[email protected]:/exllama$ ^C

Has anyone seen an error like this before?

RTX 3060 12GB Benchmarking

model: llama-13B-4bit-128g

exllama:

(exllama) user@debian:~/AI/exllama$ python test_benchmark_inference.py -d ~/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/ -p
 -- Loading model
 -- Tokenizer: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/tokenizer.model
 -- Model config: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/config.json
 -- Model: /home/user/AI/2oobabooga/text-generation-webui/models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: pytorch_scaled_dp', 'matmul: switched', 'perf']
 ** Time, Load model: 1.57 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,683.17 MB
 -- Inference, first pass.
 ** Time, Inference: 2.08 seconds
 ** Speed: 923.57 tokens/second
 -- Generating 128 tokens...
 ** Speed: 22.04 tokens/second
 ** VRAM, Inference: [cuda:0] 2,291.67 MB
 ** VRAM, Total: [cuda:0] 8,974.84 MB

ooba's webui:
streaming on:

(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.65 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 9.12 seconds (21.81 tokens/s, 199 tokens, context 4, seed 989197438)
Output generated in 8.57 seconds (23.22 tokens/s, 199 tokens, context 4, seed 26472177)

no stream:

(textgen) user@debian:~/AI/2oobabooga/text-generation-webui$ python3.10 server.py --wbits 4 --model llama-13b-4bit-128g --groups 128 --model_type llama --no-stream
INFO:Gradio HTTP request redirected to localhost :)
bin /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so
INFO:Loading llama-13b-4bit-128g...
INFO:Found the following quantized model: models/llama-13b-4bit-128g/llama-13b-4bit-128g.safetensors
INFO:Loaded the model in 2.48 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 5.17 seconds (24.74 tokens/s, 128 tokens, context 4, seed 250438644)
Output generated in 4.57 seconds (28.02 tokens/s, 128 tokens, context 4, seed 1203371762)
Output generated in 4.80 seconds (26.65 tokens/s, 128 tokens, context 4, seed 484445001)

Perplexity Data Format/Testing Data Question

I was trying to do an apples-to-apple shootout on GPTQ vs the new llama.cpp k-quants (memory usage, speed, etc) but ran into a bump with perplexity. It looks like exllama loads a jsonl formatted version of wikitext-2's wiki.valid.raw (not the wiki.test.raw that is typically used for testing)?

Just wondering if there's a preformatted jsonl of the rest of wikitext-2 already. Is the format just literally chunking every line into a "text" object?

Are you able to help?

I've been trying to setup exllama with my webserver. it will create an instance of the LlamaModelRepo and call loadModel, if I use the following code, i can get text output fine

class LlamaModelRepo:
    tokenizer: ExLlamaTokenizer = None
    generator: ExLlamaGenerator = None
    config: ExLlamaConfig = None
    model: ExLlama = None
    cache: ExLlamaCache = None
    def __init__(self):
        self.models: list = []
        self.modelsDir: str = './models'

    def loadModel(self, llamaModel: LlamaModel):
        errors = []
        configPath = llamaModel.path + "/config.json"
        if (not exists(configPath)):
            errors.append(f"{configPath} does not exist")
        
        modelPath = llamaModel.path + "/" + llamaModel.modelFile
        if (not exists(modelPath)):
            errors.append(f"{modelPath} does not exist")
            
        tokenizerModelPath = llamaModel.path + "/tokenizer.model"
        if (not exists(tokenizerModelPath)):
            errors.append(f"{tokenizerModelPath} does not exist")

        if errors:
            raise Exception("\n".join(errors))
        
        torch.set_grad_enabled(False)
        torch.cuda._lazy_init()
        self.config = ExLlamaConfig(configPath)
        self.config.model_path = modelPath
        self.config.max_seq_len = 2048
        self.model = ExLlama(self.config)
        self.cache = ExLlamaCache(self.model)

        self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
        self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
        self.generator.settings.token_repetition_penalty_max = 1.2
        self.generator.settings.token_repetition_penalty_sustain = 20
        self.generator.settings.token_repetition_penalty_decay = 50
        gen_tokens = 200
        text = self.generator.generate_simple("test", max_new_tokens = 200)
        print(text)

Printed output

test Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Create a list of 5 adjectives to describe a car.

### Response:
1. Sleek
2. Powerful
3. Luxurious
4. Reliable
5. Sporty

If i move the bottom lines so they happen in a separate function, I call loadModel and then call chat through a separate request to the server. I get some exception, im very confused about why this is happening here and wondering if im doing something wrong?

class LlamaModelRepo:
    tokenizer: ExLlamaTokenizer = None
    generator: ExLlamaGenerator = None
    config: ExLlamaConfig = None
    model: ExLlama = None
    cache: ExLlamaCache = None
    def __init__(self):
        self.models: list = []
        self.modelsDir: str = './models'

    def loadModel(self, llamaModel: LlamaModel):
        errors = []
        configPath = llamaModel.path + "/config.json"
        if (not exists(configPath)):
            errors.append(f"{configPath} does not exist")
        
        modelPath = llamaModel.path + "/" + llamaModel.modelFile
        if (not exists(modelPath)):
            errors.append(f"{modelPath} does not exist")
            
        tokenizerModelPath = llamaModel.path + "/tokenizer.model"
        if (not exists(tokenizerModelPath)):
            errors.append(f"{tokenizerModelPath} does not exist")

        if errors:
            raise Exception("\n".join(errors))
        
        torch.set_grad_enabled(False)
        torch.cuda._lazy_init()
        self.config = ExLlamaConfig(configPath)
        self.config.model_path = modelPath
        self.config.max_seq_len = 2048
        self.model = ExLlama(self.config)
        self.cache = ExLlamaCache(self.model)

        self.tokenizer = ExLlamaTokenizer(tokenizerModelPath)
        self.generator = ExLlamaGenerator(self.model, self.tokenizer, self.cache)
        self.generator.settings.token_repetition_penalty_max = 1.2
        self.generator.settings.token_repetition_penalty_sustain = 20
        self.generator.settings.token_repetition_penalty_decay = 50
        gen_tokens = 200

    def chat(self, text: str, params:dict = {}):
        text = self.generator.generate_simple("test", max_new_tokens = 200)
        print(text)

exception

Traceback (most recent call last):
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2213, in __call__
    return self.wsgi_app(environ, start_response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2193, in wsgi_app
    response = self.handle_exception(e)
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/server.py", line 110, in chat
    return modelRepo.chat(text="test")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/repos/model_repo.py", line 97, in chat
    text = self.generator.generate_simple(text, max_new_tokens = 200)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/generator.py", line 176, in generate_simple
    self.gen_begin(ids)
  File "/mnt/kanna/Documents/llm/exllama/generator.py", line 103, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True)
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 1153, in forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device])
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 540, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 447, in forward
    query_states = self.q_proj.forward(hidden_states)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/model.py", line 314, in forward
    out = cuda_ext.ExAutogradMatmul4bitCuda.apply(x, self.qweight, self.scales, self.qzeros, self.groupsize, self.bits, self.maxq)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kannalo/.local/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/mnt/kanna/Documents/llm/exllama/cuda_ext.py", line 271, in forward
    raise ValueError("Not implemented yet")
ValueError: Not implemented yet

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.