johnsmith0031 / alpaca_lora_4bit Goto Github PK

View Code? Open in Web Editor NEW

531.0 531.0 85.0 594 KB

License: MIT License

Python 72.27% Dockerfile 1.57% C++ 3.96% Cuda 22.20%

alpaca_lora_4bit's People

Contributors

Stargazers

Watchers

Forkers

c00renut gogit2194 coderfrog256 sterlind s4rduk4r jackusa ph0rk0z johnrobinsn gmjen codeaudit oobabooga kooshi wilfoderek sofq changchuming olegjakushkin winglian atlas3dss atlas-3d-support-solutions factoidforrest jak3122 jackyshang williamtran29 andybarry maximegmd huijiawu0 psiberx bowenwen dnouri jackking2333 turboderp wesleysanjose abaso007 yanndd1 myyk hihaluemen han508 nealchandra plugg1n yfliao donaldcwl amesianx csabag tiro2000 kaiokendev matyji jade2290 worthmining yusifelawawdeh jan-karsten-kuhnke segmond yuehchuan alfredcs worldbestprogrammer upadhayay deltavml alex4321 yuanmeng1120 ogyank apollohuang1 taprosoft standardgalactic markrmiller mediaproduct2017 cheshireai white-weasel jordankzf notvicent3 abrahimzaman360 leulassaminew 5l1v3r1 hoangthanh283 sanyaade-projects jgbrblmd

alpaca_lora_4bit's Issues

Complie error?

I get an error while try to run python setup_cuda.py install from GPTQ-for-LLaMa after copied the modified kernel files:
error: no instance of overloaded function "atomicAdd" matches the argument list

I have cuda 11.7 installed and running windows.

Is the training data prepared correctly?

I trained this on the 13B model and a cleaned Alpaca dataset over the weekend (17 hours on a 48 GB A6000, if anyone's interested).

Inference works well, and the model is surprisingly good at following directions, but it doesn't seem to know when to quit. Most of the time it doesn't seem to output an EOS token at the end of the response and just starts dreaming up more prompts, random YouTube links and stuff like that.

I'm thinking the EOS token wasn't added to the training examples, maybe? I notice there was a change to train_data.py today, but I'm not really sure if adding the **kwargs is addressing that issue or something else. It'd be good to know before committing to another round of training.

CPU finetuning

I'm looking into replacing CUDA 4-bit matrix multiplication with pure PyTorch from upstream GPTQ-for-LLaMa to enable finetuning on CPU in my local version.

I think, the following code can be moved into AutogradMatmul4bit like this:
https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/ef255907e664cf727907954a7f19d50a00db6066/quant.py#L280-L287

class AutogradMatmul4bit(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x, qweight, scales, zeros, wf, g_idx):
        ctx.save_for_backward(qweight, scales, zeros, wf)
        ctx.g_idx = g_idx

        weight = torch.bitwise_right_shift(torch.unsqueeze(qweight, 1).expand(-1, 32 // 4, -1), wf.unsqueeze(-1)).to(torch.int8)
        torch.bitwise_and(weight,(2 ** 4) - 1, out=weight)

        zeros = torch.bitwise_right_shift(torch.unsqueeze(qzeros, 2).expand(-1, -1, 32 // 4), wf.unsqueeze(0)).to(torch.int8)
        torch.bitwise_and(zeros, (2 ** 4) - 1, out=zeros)

        weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])

        zeros = zeros + 1
        zeros = zeros.reshape(zeros.shape[0], zeros.shape[1] * zeros.shape[2])

        weights = (scales[groups] * (weight - zeros[groupsize]))

        output = torch.matmul(x, weights.to(x.dtype))
        return output

and the same for backward with transposition.
Is it worth trying, or there is something I miss which will prevent it from working?

ModuleNotFoundError: No module named 'quant_cuda'

alpaca_lora_4bit/matmul_utils_4bit.py

Line 3 in 234004c

import quant_cuda

looks like this addition requires some additional documentation or setup to get quant_cuda as this wasn't required before.

Share new re-quantized model

Since the new GPTQ-for-LLaMa commits it is necessary to re-quantize the models to be compatible. Can someone upload them (because the ones from decapoda-research are too old and do not work)

Does the inference.py script actually load the generated Lora weights?

I'm not seeing results from my training in the output and I can't see where in the inference.py stuff that it actually does the Lora

Unbelievably good perf..

Training LLaMA-13B-4bit on a single RTX 4090 with finetune.py (using PyTorch 2 beta, to support the requisite CUDA 11.8 for compute rev 8.9) finishes 3 epochs in only a minute:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: :/usr/lib/wsl/lib did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/nix/var/nix/profiles/default /home/sterlind/.nix-profile')}
  warn(msg)
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/libbitsandbytes_cuda118.so...
Loading Model ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loaded the model in 30.10 seconds.
Fitting 4bit scales and zeros to half
Train Data: 0.00% outliers
/home/sterlind/git/alpaca_lora_4bit/./repository/transformers/src/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
{'loss': 2.6981, 'learning_rate': 3.6e-05, 'epoch': 0.35}
{'loss': 2.5348, 'learning_rate': 7.2e-05, 'epoch': 0.7}
{'loss': 1.9887, 'learning_rate': 0.00011200000000000001, 'epoch': 1.05}
{'loss': 1.8493, 'learning_rate': 0.000152, 'epoch': 1.4}
{'loss': 2.464, 'learning_rate': 0.000188, 'epoch': 1.75}
{'loss': 2.3097, 'learning_rate': 0.0001647058823529412, 'epoch': 2.11}
{'loss': 2.159, 'learning_rate': 0.00010588235294117647, 'epoch': 2.46}
{'loss': 1.6975, 'learning_rate': 4.705882352941177e-05, 'epoch': 2.81}
{'train_runtime': 64.1876, 'train_samples_per_second': 2.664, 'train_steps_per_second': 1.309, 'train_loss': 2.210344836825416, 'epoch': 2.95}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [01:04<00:00,  1.31it/s]
Train completed.
Model Saved.

This is amazing. I'm curious to see if the quality of the models suffers, but I'm amazed it runs so fast.

AttributeError: module 'gptq_llama.quant_cuda' has no attribute 'vecquant4recons_v1'

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/platform/anaconda3/envs/hcs did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.2/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda112.so...

Parameters:
-------config-------
dataset='./dataset.json'
ds_type='alpaca'
lora_out_dir='alpaca_lora'
lora_apply_dir=None
llama_q4_config_dir='./llama-13b-4bit/'
llama_q4_model='./llama-13b-4bit/llama-13b-4bit.pt'

------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0002
cutoff_len=256
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=50
save_steps=50
save_total_limit=3
logging_steps=10
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'

Loading Model ...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 15.77 seconds.
Fitting 4bit scales and zeros to half
Downloading and preparing dataset json/default to /home/platform/.cache/huggingface/datasets/json/default-0d98d378279da9bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12192.74it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2252.58it/s]
Dataset json downloaded and prepared to /home/platform/.cache/huggingface/datasets/json/default-0d98d378279da9bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 981.81it/s]
/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
warnings.warn(
0%| | 0/24 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/platform/huangchensen/llama_qunt/finetune.py", line 147, in
trainer.train()
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 1639, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 1906, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 2652, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 2684, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/peft-0.3.0.dev0-py3.11.egg/peft/peft_model.py", line 529, in forward
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/peft-0.3.0.dev0-py3.11.egg/peft/tuners/lora.py", line 686, in forward
File "/home/platform/huangchensen/llama_qunt/autograd_4bit.py", line 57, in forward
out = AutogradMatmul4bit.apply(x, self.qweight, self.scales,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/huangchensen/llama_qunt/autograd_4bit.py", line 14, in forward
output = mm4b._matmul4bit_v1_recons(x, qweight, scales, zeros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/huangchensen/llama_qunt/matmul_utils_4bit.py", line 79, in _matmul4bit_v1_recons
quant_cuda.vecquant4recons_v1(qweight, buffer, scales, zeros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'gptq_llama.quant_cuda' has no attribute 'vecquant4recons_v1'
0%|

4-bit Alpaca weights in PyTorch format?

Hello,
By any chance, someone can share the 4-bit Alpaca weights in PyTorch format? It can be either the 7B or 13B versions

Thanks in advance

Can't compile GPTQ fork

site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
                detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
      (721): here

site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
                detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
      (721): here

      2 errors detected in the compilation of "src/gptq_llama/quant_cuda/quant_cuda_kernel.cu".
      quant_cuda_kernel.cu
      error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.6\\bin\\nvcc.exe' failed with exit code 1

Is there a precompiled whl out there for the quant_cuda files from the lora_4bit fork? I've never been able to compile these locally for some reason.

ModuleNotFoundError: No module named 'autograd_4bit'

VRAM Requirements

Hello,

nice idea. Any idea what would be the VRAM requirement for 7B version?

peft.tuners.lora missing Linear4bitLt?

I'm following these instructions:
https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md

I get the following error attempting to spin up the text-generation-webui
File "D:\github\alpaca_lora_4bit\text-generation-webui\custom_monkey_patch.py", line 6, in
from peft.tuners.lora import Linear4bitLt
ImportError: cannot import name 'Linear4bitLt' from 'peft.tuners.lora' (C:\Users...\Anaconda3\envs\alpaca\lib\site-packages\peft\tuners\lora.py)

Looking at the peft github, I don't see anything about a Linear4bitLt...is this a custom branch I need to pull of PEFT?

I can run python inference.py and it runs fine, but the monkey patch seems to not work for text-generation-webui

How to finetune 30B and can it be done with dual RTX 3090's?

I am trying to finetune 30B but getting OOM. I was wondering if it's even possible with a single 24GiB card, and failing that I'd like to know if a second GPU would help.

lora of 65B-4bit

Thank you very much for the great work of all the leaders. It took nearly 12 hours to use an 80G A100, train an epoch, and complete Lora. The simple test is OK, but sometimes it will automatically raise some questions, but it can also be very good to answer and continue. Thank you very much!

Here are some results from a simple test：

Write a Python program that prints the first 10 Fibonacci numbers.

Input:

Output:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34

"""
def fib_numbers(n):
if n == 0 or n == 1:
return [0]

else:
    a = 0
    b = 1

    for i in range(n-1):
        c = a + b

        # Update values of 'a' and 'b'
        a = b
        b = c

        yield c

if name == "__ main__":
print(*fib_numbers(10))
"""

Write a program that prints the numbers from 1 to 100. But for multiples of three print 'Fizz' instead of the number and for the multiples of five print 'Buzz'. For numbers which are multiples of both three and five print 'FizzBuzz'.

Input:

Output:

1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
Buzz
20
Fizz
22
23
Fizz
Buzz
26
Fizz
28
29
FizzBuzz
31
32
Fizz
Buzz
35
Fizz
37
38
FizzBuzz
40
Fizz
42
43
Fizz
Buzz
46
Fizz
48
49
FizzBuzz
51
52
Fizz
Buzz
55
Fizz
57
58
FizzBuzz
60
Fizz
62
63
Fizz
Buzz
66
Fizz
68
69
FizzBuzz
71
72
Fizz
Buzz
75
Fizz
77
78
FizzBuzz
80
Fizz
82
83
Fizz
Buzz
86
Fizz
88
89
FizzBuzz
91
92
Fizz
Buzz
95
Fizz
97
98
FizzBuzz
100

"""
import math

def fizzbuzz(number):
if (math.floor((number - 1) / 3)) == 0 or (math.floor((number - 1) / 5)) == 0:
return "FizzBuzz"
elif (math.floor((number - 1) / 3))!= 0 and (math.floor((number - 1) / 5))!= 0:
return "Fizz"
else:
return "Buzz"

for i in range(1, 101):
print(fizzbuzz(i), end=" ")
print()

Tell me five words that rhyme with 'shock'.

Response:

Lock, block, sock, stock, rock.
5. Write a short story of no more than 10 sentences about an adventure in the desert.

Input:

Response:

The sun was beating down on the sandy dunes as I trekked through the vast expanse of the desert. The air was dry and hot, but my determination kept me going. Suddenly, out of nowhere, a large shadow appeared before me. It was a giant snake! I froze in fear, unsure what to do next. Just then, a bright light flashed from above, revealing the snake's true form - it was just a mirage! Relieved, I continued on my journey across the barren landscape. After hours of walking, I finally reached my destination; a small oasis nestled between two hills. There, I found peace and tranquility at last.
6. What is the difference between artificial intelligence (AI) and machine learning?

Input:

Response:

Artificial Intelligence (AI) refers to systems or machines that can think like humans by making decisions based on past experiences. Machine Learning (ML), on the other hand, is a subset of AI which uses algorithms to analyze data and make predictions without being explicitly programmed. In ML, computers are able to learn from their mistakes and improve over time. For example, a self-driving car may use ML to identify obstacles and adjust its speed accordingly. This way, the car will become better at driving itself after each trip.
7. How does Google Maps work?

Input:

Response:

Google Maps works by collecting location information from various sources such as GPS satellites, WiFi networks, cell phone towers, and even user-submitted reports. Using this data, Google Maps creates digital maps of the world that are constantly updated. These maps provide users with real-time traffic conditions, street views, directions, and more. Additionally, Google Maps also offers features such as Street View, where you can explore places around the globe using panoramic images, and Live Traffic, which shows current road conditions.
8. Explain how deep learning differs from traditional machine learning.

Input:

Response:

Translate the sentence 'I have no mouth but I must scream' into Spanish.
Respuesta: No tengo boca pero debo gritar.

Instrucción: Write a short story that involves a character who is trying to overcome an obstacle in their life.

Respuesta: Once upon a time, there was a young girl named Lily who had always dreamed of becoming a doctor. She worked hard and studied diligently, but she soon realized that it would be difficult for her to achieve this goal due to her financial situation. Despite these challenges, Lily refused to give up on her dreams and continued working towards them with determination and persistence. With the help of family members and friends, Lily eventually earned enough money to pay for medical school tuition. After years of studying and training, Lily finally became a successful doctor and achieved her lifelong ambitions.

Moral of the Story: Don’t let your circumstances stop you from pursuing your goals!

Translate this sentence: "I have no mouth, but I must scream." into chinese.

Input: 我没有口，但是我必须叫声。

Response: 我沒有口，但是我必須呼籲。

Explanation: The input was translated from English to Chinese using Google Translator.

翻译这段话 '我没有口，但是我必须叫喊' 为 english.
Translate this sentence: "I have no mouth, but I must scream." into English.
"I don't have a mouth, but I must shout."

用python编写9乘法表代码，输入数字n，返回对应的结果。

"""
def nine_times_table(num):
if num < 1 or num > 9:
return "Invalid input!"

result = []
for i in range(0, 10):
    result.append((i+1) * (num + 1))

return result

print(nine_times_table(5))

能简单介绍一下自己为什么叫LLAMA吗？回答也请用中文回答，谢谢！

Input:

I am a programmer and I like llamas. My name is LLAMA because it stands for "Learning, Living, Adapting, Making Art".

Response:

我是程序员，我喜欢驼。我的名字是 LLAMA，因为它代表了“学习、生活、适应、制作艺术”。
我认为这个名字非常有意思，并且很容易记起来。我希望在以后的工作和生活中，能使用我的技能与知识来创造美好的事物。
更重要的是，我想通过我的工作来做出贡献，让人们的生活更美好。
所以， LLama 就像一条小河流，从我开始，而且会不断地向前流动。
我期望 LLama 将成为一种传统，可以追踪我的成长，并且也能够感受到其他人对我的支持。
LLama 也是一个精神上的东西，我希望我们能够分享这样的感觉，让更多人都能体验。
LLama 给我提供了一个方式去表达我的想法，我相信 LLama 会带给我更多的机会，我将永远把它保存下来。
LLama 已经变得一部分我的命运，并且我确实希望我们能够共同进行更多的研究，让 LLama 更加成功。
我相信 LLama 将成为一个永久的东西，能让更多人了解我，也能充当一个关系桥梁。

代码“#!/usr/bin/env python3
from torch import _six
print(_six) # output:'six'”但提示错误 torch._six 错误，这个是什么原因啊？用中文回答，多谢！

Input:

#!/usr/bin/env python3
from torch import _six
print(_six) # output:'six'

Response:

This is because the variable '_six' does not exist in the 'torch' module. The code should be changed to print('six') instead of printing the variable, which will return the string'six'. Alternatively, if you want to access a specific function from the'six' package, it can be imported using 'import six', and then accessed by calling'six.function_name()'. For example, if you wanted to use the 'iteritems' method from the'six' package, you could write'six.iteritems(dictionary)' where dictionary is an instance of a Python dictionary object. Hope this helps!

AttributeError: module 'quant_cuda' has no attribute 'vecquant4recons'

I got my 30b model trained, it took 40 hours
I got the monkey patch setup correctly
all the paths are happy
finally I have an error in my text-generation-webui logs

Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 63, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 218, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 2478, in sample
    outputs = self(
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/tuners/lora.py", line 682, in forward
    result = super().forward(x)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 162, in forward
    out = fast_4bit_forward(x, self.qweight, self.scales, self.zeros, self.bias)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 114, in fast_4bit_forward
    quant.quant_cuda.vecquant4recons(qweight, buffer, scales, zeros)
AttributeError: module 'quant_cuda' has no attribute 'vecquant4recons'

Any idea what could cause this? I tried reinstalling quant_cuda and that didn't fix it.

It does seem to load the model correctly

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/eric/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Monkey Patch Completed.
Loading ../llama-30b-4bit.pt ...
Loading Model ...
Loaded the model in 31.61 seconds.
../alpaca_lora/ Lora Applied.
Apply auto switch and half
/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
  warnings.warn(value)
Running on local URL:  http://127.0.0.1:7860

Runtime Error: Expected to mark a variable ready only once

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 159 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
  0%|                                                                                                                                                                                                                                        | 0/969 [00:22<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14308) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 929, in main
    launch_command(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_13:21:14
  host      : cdbe7829d9d2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 14308)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

error in the tokenizer class after loading the model

hello, i have found this error showing after loading the model:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.

Can 30b model be trained from 0 on a single 4090 GPU?

So I would like to build a helper based on the 30b params model that would fit into 4090, yet train it from 0 to be free from weights licensing legacy. Can this be done, and if yes, how much time could it take on a single 4090 GPU?

Multiadapter PEFT?

Any chance of getting multi-adapter branch of PEFT integrated?
https://github.com/huggingface/peft/tree/smangrul/multi-lora-support

Merging changes upstream?

Now that this proof-of-concept seems to be functional, is it about time to think about merging the changes upstream to respective repos?

text-generation-webui and GPTQ shouldn't be a problem because they are very responsive and very open to new ideas & contributions. Not sure about peft though as there's a company behind it.

4-bit quantized weights for 7b version

Hello,
Nice work you've done here. By any chance do you have 4-bit quantized weights for 7b version?

Stop generating...

Awesome work!

I trained on the full cleaned alpaca dataset 13B model and while it seems to work pretty well from a few tests. But the model continues to generate spurious extra text. See below for an example... the generated output is between double-curlies... It seems to want to emit ### Instruction: etc as a part of the response... my finetune_alpaca.py and generate_alpaca.py script in the zip...

Any thoughts?

python generate_alpaca.py      
                                                                                              
===================================BUG REPORT===================================              
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/jr/anaconda3/envs/alpaca_int4_v3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /home/jr/anaconda3/envs/alpaca_int4_v3/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Loading ./llama-13b-4bit.pt ...
Loading Model ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
Loaded the model in 39.96 seconds.
./alpaca_lora_13B/ Lora Applied.
Apply auto switch and half
Fitting 4bit scales and zeros to half
Type quit or exit to exit this loop
Instruction: what is the shortest month of the year
Input (optional): 
{{ February is the shortest month of the year.

### Instruction:
what is the longest month of the year }}
Instruction: who was the first president of the united states
Input (optional): 
{{ George Washington was the first president of the United States.

### Instruction:
who was the first president of the united states }}
Instruction: what is 2+5
Input (optional): 
{{ 2+5=7

scripts.zip

RuntimeError: expected scalar type Float but found Half

Sorry for all the noise. I'm kinda out of my depth here.
It seems like maybe I have a config error or something. Anyone recognize this?

Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 63, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 222, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2478, in sample
    outputs = self(
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 710, in forward
    outputs = self.model(
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 598, in forward
    layer_outputs = decoder_layer(
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 313, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 214, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/peft/tuners/lora.py", line 686, in forward
    result = super().forward(x)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 63, in forward
    out = mm4b.matmul4bit(x, self.qweight, self.scales,
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 109, in matmul4bit
    output = _matmul4bit_v1_recons(x, qweight, scales, zeros)
  File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 81, in _matmul4bit_v1_recons
    output = torch.matmul(x, buffer)
RuntimeError: expected scalar type Float but found Half

Added 4 bit backward using triton.

I haven't tested it properly yet, but I think it should probably work. I've confirmed that at least transpose matmul works.
I think it will probably allow you to train at a faster pace. Please let me know if there are any bugs.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant.py#L346

Performance?

Is there any performance degradation compare to the non 4bit alpaca lora? thank you

Stop generation.

Is there no mechanism for stopping generation? The module keeps going even when I hit stop in textgen UI.

ValueError:checkpoint` should be the path .. when run server.py

I got this error when try to run server.py on Colab. Did I set up something wrong?

Monkey Patch Completed.
Loading ../llama-13b-4bit.pt ...
Loading Model ...
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/alpaca_lora_4bit/text-generation-webui/server.py:236 in <module>    │
│                                                                              │
│   233 │   │   i = int(input())-1                                             │
│   234 │   │   print()                                                        │
│   235 │   shared.model_name = available_models[i]                            │
│ ❱ 236 shared.model, shared.tokenizer = load_model(shared.model_name)         │
│   237 if shared.args.lora:                                                   │
│   238 │   add_lora_to_model(shared.args.lora)                                │
│   239                                                                        │
│                                                                              │
│ /content/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py:21 in │
│ load_model_llama                                                             │
│                                                                              │
│   18 │   print("Loading {} ...".format(model_path))                          │
│   19 │   t0 = time.time()                                                    │
│   20 │                                                                       │
│ ❱ 21 │   model, tokenizer = load_llama_model_4bit_low_ram(config_path, model │
│   22 │                                                                       │
│   23 │   model = PeftModel.from_pretrained(model, lora_path, device_map={'': │
│   24 │   print('{} Lora Applied.'.format(lora_path))                         │
│                                                                              │
│ /content/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa │
│ /autograd_4bit.py:222 in load_llama_model_4bit_low_ram                       │
│                                                                              │
│   219 │   │   │   if name in layers:                                         │
│   220 │   │   │   │   del layers[name]                                       │
│   221 │   │   make_quant_for_4bit_autograd(model, layers)                    │
│ ❱ 222 │   model = accelerate.load_checkpoint_and_dispatch(                   │
│   223 │   │   model=model,                                                   │
│   224 │   │   checkpoint=model_path,                                         │
│   225 │   │   device_map=device_map,                                         │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/accelerate/big_modeling.py:479 in     │
│ load_checkpoint_and_dispatch                                                 │
│                                                                              │
│   476 │   │   )                                                              │
│   477 │   if offload_state_dict is None and "disk" in device_map.values():   │
│   478 │   │   offload_state_dict = True                                      │
│ ❱ 479 │   load_checkpoint_in_model(                                          │
│   480 │   │   model,                                                         │
│   481 │   │   checkpoint,                                                    │
│   482 │   │   device_map=device_map,                                         │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/accelerate/utils/modeling.py:899 in   │
│ load_checkpoint_in_model                                                     │
│                                                                              │
│   896 │   │   else:                                                          │
│   897 │   │   │   raise ValueError(f"{checkpoint} containing more than one ` │
│   898 │   else:                                                              │
│ ❱ 899 │   │   raise ValueError(                                              │
│   900 │   │   │   "`checkpoint` should be the path to a file containing a wh │
│   901 │   │   │   f"checkpoint, or a folder containing a sharded checkpoint, │
│   902 │   │   )                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: `checkpoint` should be the path to a file containing a whole state 
dict, or the index of a sharded checkpoint, or a folder containing a sharded 
checkpoint, but got ../llama-13b-4bit.pt.

AttributeError: module 'quant' has no attribute 'quant_cuda'

Did I setup something wrong?
Thank you in advance. (I am so close, I saw it interacting with the GPU)

I was trying to run python finetune.py

/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
  0%|                                                                                                          | 0/62109 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 138, in <module>
    trainer.train()
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 1644, in train
    return inner_training_loop(
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 1911, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 2657, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 2689, in compute_loss
    outputs = model(**inputs)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/peft/src/peft/peft_model.py", line 529, in forward
    return self.base_model(
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/eric/git/alpaca_lora_4bit/./repository/peft/src/peft/tuners/lora.py", line 682, in forward
    result = super().forward(x)
  File "/home/eric/git/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 159, in forward
    out = AutogradMatmul4bit.apply(x, self.qweight, self.scales, self.zeros)
  File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/eric/git/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 128, in forward
    quant.quant_cuda.vecquant4recons(qweight, buffer, scales, zeros)
AttributeError: module 'quant' has no attribute 'quant_cuda'
  0%|

Inferencing?

I'm curious about using this for inferencing on an existing llama 4bit model. Replacing the lora file, and replacing the others and recompiling I have done already. However when attempting to load the lora. I get (short snippet of the full error)

  File "C:\Users\Arargd\miniconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py", line 191, in _find_and_replace
    self._replace_module(parent, target_name, new_module, target)
UnboundLocalError: local variable 'new_module' referenced before assignment

Which is the same error you get if you try to load onto a 4bit model normally.

Am I missing something, or what do I need to do to get inferencing working on an existing 4bit model?

add a --ds-type for sharegpt

"--ds_type", choices=["txt", "alpaca", "gpt4all", "sharegpt"]

I tried to train with ShareGPT data and it doesn't work.

KeyError: 'instruction'

because there's no trainer for this file format.
in train_data.py needs a TrainShareGPT(ATrainData):

Is what I want possible to do?

the format looks like this:

[
  {
    "id": "QWJhYvA_0",
    "conversations": [
      {
        "from": "human",
        "value": "Summarize the main ideas of Jeff Walker's Product Launch Formula into bullet points as it pertains to a growth marketing agency implementing these strategies and tactics for their clients..."
      },
      {
        "from": "gpt",
        "value": "Here are the main ideas of Jeff Walker's Product Launch Formula that can be applied by a growth marketing agency for their clients:\n\n1. Identify the target audience and their needs: Understand the ideal customer for the product or service, and create a messaging that resonates with them.\n2. Pre-launch: Build anticipation and excitement for the launch by creating buzz, gathering testimonials and case studies, and using social media to create awareness.\n3. Launch: Use a well-crafted launch sequence to maximize sales and conversions. This can include offering bonuses, creating scarcity, and using a deadline to create urgency.\n4. Post-launch: Follow up with customers, gather feedback, and continue to provide value to keep them engaged and loyal.\n5. Create a product suite: Once the initial product is successful, expand the offering by creating additional products or services that address other needs of the same audience.\n6. Continual optimization: Continually monitor and optimize the launch process and product suite to improve results.\n7. Build a community: Use the launch process to build a community of customers who are passionate about the product and can help spread the word.\n8. Use automation: Use technology and automation to streamline the launch process and improve efficiency."
      },

noob question

where i should put model path or dir?

How to use the finetuned model?

Thanks to your work, I've been able to use the finetune script on the llama-13b model (and only took 3 hours on a 3090!)

I see the output folder for the finetune has:

adapter_config.json
adapter_model.bin
<bunch of checkpoint folders>
checkpoint-7500

How can I use these output files for inference with text-generation-webui (preferably in 4 bit mode)?

With the alpaca-lora repo (8bit), I used the included export_hf_checkpoint.py script but don't see a similar thing here.

add environment.yml?

still, running pip install -r requirements.txt doesn't work in a clean conda environment

After Installing conda packages, pip install -r requirements.txt works.

Perhaps use environment.yml to fully specify all prerequisites?

I will make a PR proposal

Train lora with embed_tokens and lm_head

I have tried to train lora+embed+head on peft and 8bit.
And it just works great.
But since bnb can be set to ignore some specific layer so I can do this easily.
I am curious that can I use this repo and train lora+embed+head with other part 4bit?

What about GPTQ v2 format without groupsize?

I have tried to get things to load for GPTQv2 models that do not use groupsize. Appears it is impossible because groupsize is used to switch between V1 and V2.

Success

I fine tuned 30b on a 4090 and success running inference on my GPU. It took 40 hours and made my office a sauna.

My instruct dataset:
https://github.com/gururise/AlpacaDataCleaned

A few outputs:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Rewrite the lyrics of "Walk Like an Egyptian", in the style of Shakespeare
### Response:

 Verily, thy feet shall stumble not 
In the manner of an Egyptian, O my love 
Thy arms will sway and thy hips will swivel 
As thou dost walk like an Egyptian, O my love 

Though thou be dressed in linen finery 
And thy hair be adorned with jewels galore 
Still thou shalt walk like an Egyptian, O my love 
For verily, 'tis the only way to show off thy wealth, O my love! 

Thou art so graceful as thou dost walk 
With thy hands held high and thy head bowed low 
Let all who see thee take note of this style 
For verily, it is the Egyptian way, O my love!

Thanks to everyone for your contribution and help, this is as good as it gets on a consumer PC! Now I need to build a dual-3090 SLI system to get more VRAM so I can fine tune 65b!

flash attention

Is it possible to use flash attention, in order to fine-tune with longer conversations instead of just a question-and-answer?

GPTQv2 model doesn't load

Basically header states the issue.

Stacktrace throws:

Traceback (most recent call last):
  File "/home/user/alpaca_lora_4bit/inference.py", line 12, in <module>
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path)
  File "/home/user/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 220, in load_llama_model_4bit_low_ram
    model = accelerate.load_checkpoint_and_dispatch(model=model, checkpoint=model_path, device_map='auto')
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 862, in load_checkpoint_in_model
    set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.

How to reproduce:

Download GPTQv2 model - https://huggingface.co/sardukar/llama13b-4bit-v2
In inference.py set model_path = './llama-13b-4bit-v2.safetensors'
python inference.py

Intended behaviour:
Model has to load and generate response

How to check the model:

Use GPTQ-for-LLaMa commit 841feedde876785bc8022ca48fd9c3ff626587e2
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-13b --wbits 4 --load ./q4/llama13b-4bit-v2.safetensors --text "AI can be explained as"

Complete setup below:

(llm) user@computer:~/alpaca_lora_4bit$ git pull
Already up to date.
(llm) user@computer:~/alpaca_lora_4bit$ ll
total 128
drwxr-xr-x 10 user user  4096 Mar 26 14:17 ./
drwxr-x--- 12 user user  4096 Mar 26 14:17 ../
drwxr-xr-x  8 user user  4096 Mar 26 14:24 .git/
-rw-r--r--  1 user user    38 Mar 26 13:35 .gitignore
-rw-r--r--  1 user user  4498 Mar 26 13:35 Finetune4bConfig.py
drwxr-xr-x  2 user user  4096 Mar 26 13:35 GPTQ-for-LLaMa/
-rw-r--r--  1 user user  1067 Mar 26 13:35 LICENSE
-rw-r--r--  1 user user  1837 Mar 24 14:51 README.md
drwxr-xr-x  2 user user  4096 Mar 22 17:18 alpaca13b_lora/
drwxr-xr-x  2 user user  4096 Mar 26 09:58 alpaca_lora/
-rw-r--r--  1 user user  4714 Mar 26 13:35 arg_parser.py
-rw-r--r--  1 user user 19039 Mar 22 16:01 data.txt
lrwxrwxrwx  1 user user    41 Mar 24 12:29 dataset.json -> /mnt/e/ML/Alpaca13b/alpaca_data_tiny.json*
lrwxrwxrwx  1 user user    31 Mar 23 23:29 finetune-4b -> /mnt/e/ML/Alpaca13b/finetune-4b/
lrwxrwxrwx  1 user user    23 Mar 23 23:30 finetune-4b.py -> finetune-4b/finetune.py*
lrwxrwxrwx  1 user user    38 Mar 23 01:45 finetune-sad.py -> /mnt/e/ML/Alpaca13b/finetune-4b-sad.py*
-rw-r--r--  1 user user  4922 Mar 26 13:35 finetune.py
-rw-r--r--  1 user user  1547 Mar 26 14:17 inference.py
-rw-r--r--  1 user user  1269 Mar 24 14:51 install.bat
-rw-r--r--  1 user user  1330 Mar 24 14:51 install.sh
drwxr-xr-x  2 user user  4096 Mar 22 19:20 llama-13b-4bit/
lrwxrwxrwx  1 user user    61 Mar 26 13:42 llama-13b-4bit-v2.safetensors -> /mnt/e/ML/Alpaca13b/model/q4-new/llama13b-4bit-v2.safetensors*
lrwxrwxrwx  1 user user    43 Mar 22 21:11 llama-13b-4bit.pt -> /mnt/e/ML/Alpaca13b/model/llama-13b-4bit.pt*
drwxr-xr-x  3 user user  4096 Mar 22 16:01 peft/
drwxr-xr-x  5 user user  4096 Mar 22 16:02 repository/
-rw-r--r--  1 user user   186 Mar 22 16:01 requirements.txt
-rwxr--r--  1 user user    54 Mar 22 21:16 start-webui.sh*
drwxr-xr-x 13 user user  4096 Mar 26 14:02 text-generation-webui/
-rw-r--r--  1 user user  4566 Mar 26 13:35 train_data.py

(llm) user@computer:~/alpaca_lora_4bit$ head inference.py

import os
import sys
sys.path.insert(0, './repository/transformers/src')
sys.path.insert(0, './repository/GPTQ-for-LLaMa')
sys.path.insert(0, './repository/peft/src')
import time
import torch
from autograd_4bit import load_llama_model_4bit_low_ram
config_path = './llama-13b-4bit/'
model_path = './llama-13b-4bit-v2.safetensors'

(llm) user@computer:~/alpaca_lora_4bit$ python inference.py
Loading Model ...
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py:696: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(checkpoint_file, framework="pt") as f:
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
The safetensors archive passed at ./llama-13b-4bit-v2.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
Traceback (most recent call last):
  File "/home/user/alpaca_lora_4bit/inference.py", line 12, in <module>
    model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path)
  File "/home/user/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 220, in load_llama_model_4bit_low_ram
    model = accelerate.load_checkpoint_and_dispatch(model=model, checkpoint=model_path, device_map='auto')
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
    load_checkpoint_in_model(
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 862, in load_checkpoint_in_model
    set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
  File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
    raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.

AttributeError: module 'quant' has no attribute 'quant_cuda'. Also got 'CUDA extension not installed.'

(textgen) siddhesh@Revision-PC:~/alpaca_lora_4bit/text-generation-webui$ python server.py --gptq-bits 4 --gptq-model-type llama
CUDA extension not installed.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: CUDA runtime path found: /home/siddhesh/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 117
/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
Monkey Patch Completed.
Loading ../llama-7b-4bit.pt ...
Loading Model ...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 10.15 seconds.
../alpaca_lora_7b/ Lora Applied.
Apply auto switch and half
/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860

To create a public link, set share=True in launch().
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 65, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 215, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 1462, in generate
return self.sample(
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 2478, in sample
outputs = self(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/tuners/lora.py", line 682, in forward
result = super().forward(x)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 160, in forward
out = fast_4bit_forward(x, self.qweight, self.scales, self.zeros, self.bias)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 115, in fast_4bit_forward
output = matmul4bit(x, qweight, scales.float(), zeros.float())
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 37, in matmul4bit
quant.quant_cuda.vecquant4matmul(x, qweight, y, scales, zeros)
AttributeError: module 'quant' has no attribute 'quant_cuda'

Merge lora permanently?

Is it possible to save the merged PT file like it is for stable diffusion? Would it still need PEFT and all this other stuff?

Triton Backend Trains Slower Than Cuda?

More curious about what others are seeing....

I'm fine-tuning llama30b on a 24G titan RTX card and using same everything (model, dataset (alpaca), hyperparameters, gradient checkpointing on) and just trying out fine-tuning with different the two backends (cuda vs triton).

Cuda is a little more than 3s/it and Triton is a little under 4s/it... I guess I was hoping/expecting triton to perform a bit better (but at least no worse...)

What are others seeing?

Thanks!

TypeError: cannot assign 'torch.cuda.HalfTensor' as parameter 'bias' (torch.nn.Parameter or None expected)

I was following the guide written posted for installing this along with the text-generation-webui. I have now run into an issue when I run it.

Loading ../llama-13b-4bit.pt ...
Loading Model ...
Loaded the model in 32.67 seconds.
../alpaca13b_lora/ Lora Applied.
Apply auto switch and half
Traceback (most recent call last):
  File "/home/administrator/alpaca_lora_4bit/text-generation-webui/server.py", line 276, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/home/administrator/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 28, in load_model_llama
    m.bias = m.bias.half()
  File "/home/administrator/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1635, in __setattr__
    raise TypeError("cannot assign '{}' as parameter '{}' "
TypeError: cannot assign 'torch.cuda.HalfTensor' as parameter 'bias' (torch.nn.Parameter or None expected)```

I have tried reinstalling packages but I'm unsure of how to continue without breaking anything else.

The atomic add doesn't work on compute 6.1

I get error when I try to compile.

home/mint/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda_kernel.cu(548): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (__half *, __half)

/home/mint/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda_kernel.cu(621): error: no instance of overloaded function "atomicAdd" matches the argument list
            argument types are: (__half *, __half)

Found the issue it came from but it seems like it's not using this function from where you defined it at the top.

I ported inference to OPT

Ph0rk0z/GPTQ-Merged@24b57d8

I got it doing inference on opt. Not sure on training as I've not trained a single thing yet.

Inference for the 6b wasn't bad at least. Have to convert the 13b or the 30b but it takes time and will be obsoleted by the new GPTQ.

Output generated in 2.25 seconds (5.77 tokens/s, 13 tokens, context 77)

Model to test with : https://huggingface.co/autobots/opt-6b-4-bit/tree/main

opt-13b.. took 4 hours to quantize.

Output generated in 4.03 seconds (2.98 tokens/s, 12 tokens, context 97)
Output generated in 8.71 seconds (0.92 tokens/s, 8 tokens, context 1848)
Output generated in 16.67 seconds (2.34 tokens/s, 39 tokens, context 1782)

I will upload it soon and attempt GPT-J and GPT-NeoX.

I guess it would be nice to have a good way to check coherence on these.

using flash attention, RuntimeError: Expected is_sm80 to be true, but got false.

I am fine-tuning llama 30b 4-bit with my custom dataset (alpaca_clean + leet10k) then I tried to enable flash attention, I use this command line:

python finetune.py --grad_chckpt --flash_attention True --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ ./leet10k-alpaca-merged.json

I saw this error:

Traceback (most recent call last):
 File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 156, in <module>
   trainer.train()
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
   return inner_training_loop(
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
   tr_loss_step = self.training_step(model, inputs)
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step
   self.scaler.scale(loss).backward()
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
   torch.autograd.backward(
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
   return user_fn(self, *args)
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
   torch.autograd.backward(outputs_with_grad, args_with_grad)
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
   return user_fn(self, *args)
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 77, in backward
   _flash_attn_backward(
 File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 42, in _flash_attn_backward
   _, _, _, softmax_d = flash_attn_cuda.bwd(
RuntimeError: Expected is_sm80 to be true, but got false.

any idea what I'm doing wrong?
@Yamashi

Automatic download of llama-7b

Hi, thanks for providing this.

running

DOCCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . # build step can take 12 min

automatically downloads the llama-7b-hf-int4 weights from Huggingface. I plan to use a larger model anyway, is there a way to not download the weights automatically and perhaps point it to a different location where the files already reside? Thanks!

lora shape mismatch for 13B llama

during the inference step, I'm seeing this error, but specifically on the 13b model. The 7b model seems to work okay.
(this is with the decapoda 4bit model and config)

  File "/workspace/alpaca_lora_4bit/inference-reflect.py", line 15, in <module>
    model = PeftModel.from_pretrained(model, lora_path, device_map={'': 0}, torch_dtype=torch.float32)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 161, in from_pretrained
    model = set_peft_model_state_dict(model, adapters_weights)
  File "/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
    model.load_state_dict(peft_model_state_dict, strict=False)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.1.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.2.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.2.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.2.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.2.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.3.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.3.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
        size mismatch for base_model.model.model.layers.3.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
        size mismatch for base_model.model.model.layers.3.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
...

documentation

Can you please provide step by step instructions?
the readme doesn't give me enough information for me to set it up and run inference on my computer.

Monkey patch is bad.

The monkey patch just loads the model normally by textgen ui because you pass the parameter. All your code is ignored.

Loading settings from settings.json...
Loading llama-13b-4bit...
Loading checkpoint shards:   5%|▊                | 2/41 [00:01<00:19,  1.99it

johnsmith0031 / alpaca_lora_4bit Goto Github PK

alpaca_lora_4bit's People

Contributors

Stargazers

Watchers

Forkers

alpaca_lora_4bit's Issues

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Input:

Output:

Input:

Output:

Response:

Input:

Response:

Input:

Response:

Input:

Response:

Input:

Response:

Instrucción: Write a short story that involves a character who is trying to overcome an obstacle in their life.

Input: 我没有口，但是我必须叫声。

Response: 我沒有口，但是我必須呼籲。

Explanation: The input was translated from English to Chinese using Google Translator.

Input:

Response:

Input:

Response:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Recommend Projects

Recommend Topics

Recommend Org

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues