johnsmith0031 / alpaca_lora_4bit Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
I get an error while try to run python setup_cuda.py install from GPTQ-for-LLaMa after copied the modified kernel files:
error: no instance of overloaded function "atomicAdd" matches the argument list
I have cuda 11.7 installed and running windows.
I trained this on the 13B model and a cleaned Alpaca dataset over the weekend (17 hours on a 48 GB A6000, if anyone's interested).
Inference works well, and the model is surprisingly good at following directions, but it doesn't seem to know when to quit. Most of the time it doesn't seem to output an EOS token at the end of the response and just starts dreaming up more prompts, random YouTube links and stuff like that.
I'm thinking the EOS token wasn't added to the training examples, maybe? I notice there was a change to train_data.py today, but I'm not really sure if adding the **kwargs is addressing that issue or something else. It'd be good to know before committing to another round of training.
I'm looking into replacing CUDA 4-bit matrix multiplication with pure PyTorch from upstream GPTQ-for-LLaMa to enable finetuning on CPU in my local version.
I think, the following code can be moved into AutogradMatmul4bit like this:
https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/ef255907e664cf727907954a7f19d50a00db6066/quant.py#L280-L287
class AutogradMatmul4bit(torch.autograd.Function):
@staticmethod
def forward(ctx, x, qweight, scales, zeros, wf, g_idx):
ctx.save_for_backward(qweight, scales, zeros, wf)
ctx.g_idx = g_idx
weight = torch.bitwise_right_shift(torch.unsqueeze(qweight, 1).expand(-1, 32 // 4, -1), wf.unsqueeze(-1)).to(torch.int8)
torch.bitwise_and(weight,(2 ** 4) - 1, out=weight)
zeros = torch.bitwise_right_shift(torch.unsqueeze(qzeros, 2).expand(-1, -1, 32 // 4), wf.unsqueeze(0)).to(torch.int8)
torch.bitwise_and(zeros, (2 ** 4) - 1, out=zeros)
weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])
zeros = zeros + 1
zeros = zeros.reshape(zeros.shape[0], zeros.shape[1] * zeros.shape[2])
weights = (scales[groups] * (weight - zeros[groupsize]))
output = torch.matmul(x, weights.to(x.dtype))
return output
and the same for backward
with transposition.
Is it worth trying, or there is something I miss which will prevent it from working?
alpaca_lora_4bit/matmul_utils_4bit.py
Line 3 in 234004c
looks like this addition requires some additional documentation or setup to get quant_cuda as this wasn't required before.
Since the new GPTQ-for-LLaMa commits it is necessary to re-quantize the models to be compatible. Can someone upload them (because the ones from decapoda-research are too old and do not work)
I'm not seeing results from my training in the output and I can't see where in the inference.py stuff that it actually does the Lora
Training LLaMA-13B-4bit on a single RTX 4090 with finetune.py
(using PyTorch 2 beta, to support the requisite CUDA 11.8 for compute rev 8.9) finishes 3 epochs in only a minute:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: :/usr/lib/wsl/lib did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/nix/var/nix/profiles/default /home/sterlind/.nix-profile')}
warn(msg)
/home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('unix')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/sterlind/.local/lib/python3.8/site-packages/bitsandbytes-0.37.1-py3.8.egg/bitsandbytes/libbitsandbytes_cuda118.so...
Loading Model ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loaded the model in 30.10 seconds.
Fitting 4bit scales and zeros to half
Train Data: 0.00% outliers
/home/sterlind/git/alpaca_lora_4bit/./repository/transformers/src/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
{'loss': 2.6981, 'learning_rate': 3.6e-05, 'epoch': 0.35}
{'loss': 2.5348, 'learning_rate': 7.2e-05, 'epoch': 0.7}
{'loss': 1.9887, 'learning_rate': 0.00011200000000000001, 'epoch': 1.05}
{'loss': 1.8493, 'learning_rate': 0.000152, 'epoch': 1.4}
{'loss': 2.464, 'learning_rate': 0.000188, 'epoch': 1.75}
{'loss': 2.3097, 'learning_rate': 0.0001647058823529412, 'epoch': 2.11}
{'loss': 2.159, 'learning_rate': 0.00010588235294117647, 'epoch': 2.46}
{'loss': 1.6975, 'learning_rate': 4.705882352941177e-05, 'epoch': 2.81}
{'train_runtime': 64.1876, 'train_samples_per_second': 2.664, 'train_steps_per_second': 1.309, 'train_loss': 2.210344836825416, 'epoch': 2.95}
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 84/84 [01:04<00:00, 1.31it/s]
Train completed.
Model Saved.
This is amazing. I'm curious to see if the quality of the models suffers, but I'm amazed it runs so fast.
/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /home/platform/anaconda3/envs/hcs did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.2/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 112
CUDA SETUP: Loading binary /home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cuda112.so...
Parameters:
-------config-------
dataset='./dataset.json'
ds_type='alpaca'
lora_out_dir='alpaca_lora'
lora_apply_dir=None
llama_q4_config_dir='./llama-13b-4bit/'
llama_q4_model='./llama-13b-4bit/llama-13b-4bit.pt'
------training------
mbatch_size=1
batch_size=2
gradient_accumulation_steps=2
epochs=3
lr=0.0002
cutoff_len=256
lora_r=8
lora_alpha=16
lora_dropout=0.05
val_set_size=0.2
gradient_checkpointing=False
gradient_checkpointing_ratio=1
warmup_steps=50
save_steps=50
save_total_limit=3
logging_steps=10
checkpoint=False
skip=False
world_size=1
ddp=False
device_map='auto'
Loading Model ...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 15.77 seconds.
Fitting 4bit scales and zeros to half
Downloading and preparing dataset json/default to /home/platform/.cache/huggingface/datasets/json/default-0d98d378279da9bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12192.74it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2252.58it/s]
Dataset json downloaded and prepared to /home/platform/.cache/huggingface/datasets/json/default-0d98d378279da9bb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 981.81it/s]
/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
warnings.warn(
0%| | 0/24 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/platform/huangchensen/llama_qunt/finetune.py", line 147, in
trainer.train()
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 1639, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 1906, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 2652, in training_step
loss = self.compute_loss(model, inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/trainer.py", line 2684, in compute_loss
outputs = model(**inputs)
^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/peft-0.3.0.dev0-py3.11.egg/peft/peft_model.py", line 529, in forward
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/transformers-4.28.0.dev0-py3.11.egg/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/peft-0.3.0.dev0-py3.11.egg/peft/tuners/lora.py", line 686, in forward
File "/home/platform/huangchensen/llama_qunt/autograd_4bit.py", line 57, in forward
out = AutogradMatmul4bit.apply(x, self.qweight, self.scales,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/anaconda3/envs/hcs/lib/python3.11/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/huangchensen/llama_qunt/autograd_4bit.py", line 14, in forward
output = mm4b._matmul4bit_v1_recons(x, qweight, scales, zeros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/platform/huangchensen/llama_qunt/matmul_utils_4bit.py", line 79, in _matmul4bit_v1_recons
quant_cuda.vecquant4recons_v1(qweight, buffer, scales, zeros)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: module 'gptq_llama.quant_cuda' has no attribute 'vecquant4recons_v1'
0%|
Hello,
By any chance, someone can share the 4-bit Alpaca weights in PyTorch format? It can be either the 7B or 13B versions
Thanks in advance
site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here
site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here
2 errors detected in the compilation of "src/gptq_llama/quant_cuda/quant_cuda_kernel.cu".
quant_cuda_kernel.cu
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.6\\bin\\nvcc.exe' failed with exit code 1
Is there a precompiled whl out there for the quant_cuda files from the lora_4bit fork? I've never been able to compile these locally for some reason.
Hello,
nice idea. Any idea what would be the VRAM requirement for 7B version?
I'm following these instructions:
https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md
I get the following error attempting to spin up the text-generation-webui
File "D:\github\alpaca_lora_4bit\text-generation-webui\custom_monkey_patch.py", line 6, in
from peft.tuners.lora import Linear4bitLt
ImportError: cannot import name 'Linear4bitLt' from 'peft.tuners.lora' (C:\Users...\Anaconda3\envs\alpaca\lib\site-packages\peft\tuners\lora.py)
Looking at the peft github, I don't see anything about a Linear4bitLt...is this a custom branch I need to pull of PEFT?
I can run python inference.py and it runs fine, but the monkey patch seems to not work for text-generation-webui
I am trying to finetune 30B but getting OOM. I was wondering if it's even possible with a single 24GiB card, and failing that I'd like to know if a second GPU would help.
Thank you very much for the great work of all the leaders. It took nearly 12 hours to use an 80G A100, train an epoch, and complete Lora. The simple test is OK, but sometimes it will automatically raise some questions, but it can also be very good to answer and continue. Thank you very much!
Here are some results from a simple test:
Write a Python program that prints the first 10 Fibonacci numbers.
0, 1, 1, 2, 3, 5, 8, 13, 21, 34
"""
def fib_numbers(n):
if n == 0 or n == 1:
return [0]
else:
a = 0
b = 1
for i in range(n-1):
c = a + b
# Update values of 'a' and 'b'
a = b
b = c
yield c
if name == "__ main__":
print(*fib_numbers(10))
"""
Write a program that prints the numbers from 1 to 100. But for multiples of three print 'Fizz' instead of the number and for the multiples of five print 'Buzz'. For numbers which are multiples of both three and five print 'FizzBuzz'.
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
Buzz
20
Fizz
22
23
Fizz
Buzz
26
Fizz
28
29
FizzBuzz
31
32
Fizz
Buzz
35
Fizz
37
38
FizzBuzz
40
Fizz
42
43
Fizz
Buzz
46
Fizz
48
49
FizzBuzz
51
52
Fizz
Buzz
55
Fizz
57
58
FizzBuzz
60
Fizz
62
63
Fizz
Buzz
66
Fizz
68
69
FizzBuzz
71
72
Fizz
Buzz
75
Fizz
77
78
FizzBuzz
80
Fizz
82
83
Fizz
Buzz
86
Fizz
88
89
FizzBuzz
91
92
Fizz
Buzz
95
Fizz
97
98
FizzBuzz
100
"""
import math
def fizzbuzz(number):
if (math.floor((number - 1) / 3)) == 0 or (math.floor((number - 1) / 5)) == 0:
return "FizzBuzz"
elif (math.floor((number - 1) / 3))!= 0 and (math.floor((number - 1) / 5))!= 0:
return "Fizz"
else:
return "Buzz"
for i in range(1, 101):
print(fizzbuzz(i), end=" ")
print()
Tell me five words that rhyme with 'shock'.
Lock, block, sock, stock, rock.
5. Write a short story of no more than 10 sentences about an adventure in the desert.
The sun was beating down on the sandy dunes as I trekked through the vast expanse of the desert. The air was dry and hot, but my determination kept me going. Suddenly, out of nowhere, a large shadow appeared before me. It was a giant snake! I froze in fear, unsure what to do next. Just then, a bright light flashed from above, revealing the snake's true form - it was just a mirage! Relieved, I continued on my journey across the barren landscape. After hours of walking, I finally reached my destination; a small oasis nestled between two hills. There, I found peace and tranquility at last.
6. What is the difference between artificial intelligence (AI) and machine learning?
Artificial Intelligence (AI) refers to systems or machines that can think like humans by making decisions based on past experiences. Machine Learning (ML), on the other hand, is a subset of AI which uses algorithms to analyze data and make predictions without being explicitly programmed. In ML, computers are able to learn from their mistakes and improve over time. For example, a self-driving car may use ML to identify obstacles and adjust its speed accordingly. This way, the car will become better at driving itself after each trip.
7. How does Google Maps work?
Google Maps works by collecting location information from various sources such as GPS satellites, WiFi networks, cell phone towers, and even user-submitted reports. Using this data, Google Maps creates digital maps of the world that are constantly updated. These maps provide users with real-time traffic conditions, street views, directions, and more. Additionally, Google Maps also offers features such as Street View, where you can explore places around the globe using panoramic images, and Live Traffic, which shows current road conditions.
8. Explain how deep learning differs from traditional machine learning.
Translate the sentence 'I have no mouth but I must scream' into Spanish.
Respuesta: No tengo boca pero debo gritar.
Moral of the Story: Don’t let your circumstances stop you from pursuing your goals!
Translate this sentence: "I have no mouth, but I must scream." into chinese.
翻译这段话 '我没有口,但是我必须叫喊' 为 english.
Translate this sentence: "I have no mouth, but I must scream." into English.
"I don't have a mouth, but I must shout."
用python编写9乘法表代码 ,输入数字n,返回对应的结果。
"""
def nine_times_table(num):
if num < 1 or num > 9:
return "Invalid input!"
result = []
for i in range(0, 10):
result.append((i+1) * (num + 1))
return result
print(nine_times_table(5))
能简单介绍一下自己为什么叫LLAMA吗?回答也请用中文回答,谢谢!
I am a programmer and I like llamas. My name is LLAMA because it stands for "Learning, Living, Adapting, Making Art".
我是程序员,我喜欢驼。我的名字是 LLAMA,因为它代表了“学习、生活、适应、制作艺术”。
我认为这个名字非常有意思,并且很容易记起来。我希望在以后的工作和生活中,能使用我的技能与知识来创造美好的事物。
更重要的是,我想通过我的工作来做出贡献,让人们的生活更美好。
所以, LLama 就像一条小河流,从我开始,而且会不断地向前流动。
我期望 LLama 将成为 一种传统,可以追踪我的成长,并且也能够感受到其他人对我的支持。
LLama 也是一个精神上的东西,我希望我们能够分享这样的感觉,让更多人都能体验。
LLama 给我提供了一个方式去表达我的想法,我相信 LLama 会带给我更多的机会,我将永远把它保存下来。
LLama 已经变得一部分我的命运,并且我确实希望我们能够共同进行更多的研究,让 LLama 更加成功。
我相信 LLama 将成为一个永久的东西,能让更多人了解我,也能充当一个关系桥梁。
代码“#!/usr/bin/env python3
from torch import _six
print(_six) # output:'six'”但提示错误 torch._six 错误,这个是什么原因啊?用中文回答,多谢!
代码“#!/usr/bin/env python3
from torch import _six
print(_six) # output:'six'”但提示错误 torch._six 错误,这个是什么原因啊?用中文回答,多谢!
#!/usr/bin/env python3
from torch import _six
print(_six) # output:'six'
This is because the variable '_six' does not exist in the 'torch' module. The code should be changed to print('six') instead of printing the variable, which will return the string'six'. Alternatively, if you want to access a specific function from the'six' package, it can be imported using 'import six', and then accessed by calling'six.function_name()'. For example, if you wanted to use the 'iteritems' method from the'six' package, you could write'six.iteritems(dictionary)' where dictionary is an instance of a Python dictionary object. Hope this helps!
I got my 30b model trained, it took 40 hours
I got the monkey patch setup correctly
all the paths are happy
finally I have an error in my text-generation-webui logs
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 63, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 218, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 1462, in generate
return self.sample(
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 2478, in sample
outputs = self(
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/tuners/lora.py", line 682, in forward
result = super().forward(x)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 162, in forward
out = fast_4bit_forward(x, self.qweight, self.scales, self.zeros, self.bias)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 114, in fast_4bit_forward
quant.quant_cuda.vecquant4recons(qweight, buffer, scales, zeros)
AttributeError: module 'quant_cuda' has no attribute 'vecquant4recons'
Any idea what could cause this? I tried reinstalling quant_cuda and that didn't fix it.
It does seem to load the model correctly
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/eric/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Monkey Patch Completed.
Loading ../llama-30b-4bit.pt ...
Loading Model ...
Loaded the model in 31.61 seconds.
../alpaca_lora/ Lora Applied.
Apply auto switch and half
/home/eric/miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 159 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
0%| | 0/969 [00:22<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14308) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 929, in main
launch_command(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 914, in launch_command
multi_gpu_launcher(args)
File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 603, in multi_gpu_launcher
distrib_run.run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
finetune.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-30_13:21:14
host : cdbe7829d9d2
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 14308)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
hello, i have found this error showing after loading the model:
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
So I would like to build a helper based on the 30b params model that would fit into 4090, yet train it from 0 to be free from weights licensing legacy. Can this be done, and if yes, how much time could it take on a single 4090 GPU?
Any chance of getting multi-adapter branch of PEFT integrated?
https://github.com/huggingface/peft/tree/smangrul/multi-lora-support
Now that this proof-of-concept seems to be functional, is it about time to think about merging the changes upstream to respective repos?
text-generation-webui
and GPTQ
shouldn't be a problem because they are very responsive and very open to new ideas & contributions. Not sure about peft
though as there's a company behind it.
Hello,
Nice work you've done here. By any chance do you have 4-bit quantized weights for 7b version?
Awesome work!
I trained on the full cleaned alpaca dataset 13B model and while it seems to work pretty well from a few tests. But the model continues to generate spurious extra text. See below for an example... the generated output is between double-curlies... It seems to want to emit ### Instruction: etc as a part of the response... my finetune_alpaca.py and generate_alpaca.py script in the zip...
Any thoughts?
python generate_alpaca.py
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/jr/anaconda3/envs/alpaca_int4_v3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 113
CUDA SETUP: Loading binary /home/jr/anaconda3/envs/alpaca_int4_v3/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda113.so...
Loading ./llama-13b-4bit.pt ...
Loading Model ...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loaded the model in 39.96 seconds.
./alpaca_lora_13B/ Lora Applied.
Apply auto switch and half
Fitting 4bit scales and zeros to half
Type quit or exit to exit this loop
Instruction: what is the shortest month of the year
Input (optional):
{{ February is the shortest month of the year.
### Instruction:
what is the longest month of the year }}
Instruction: who was the first president of the united states
Input (optional):
{{ George Washington was the first president of the United States.
### Instruction:
who was the first president of the united states }}
Instruction: what is 2+5
Input (optional):
{{ 2+5=7
Sorry for all the noise. I'm kinda out of my depth here.
It seems like maybe I have a config error or something. Anyone recognize this?
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 63, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 222, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1462, in generate
return self.sample(
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/generation/utils.py", line 2478, in sample
outputs = self(
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 710, in forward
outputs = self.model(
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 598, in forward
layer_outputs = decoder_layer(
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 313, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 214, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/miniconda3/envs/textgen2/lib/python3.10/site-packages/peft/tuners/lora.py", line 686, in forward
result = super().forward(x)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/autograd_4bit.py", line 63, in forward
out = mm4b.matmul4bit(x, self.qweight, self.scales,
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 109, in matmul4bit
output = _matmul4bit_v1_recons(x, qweight, scales, zeros)
File "/home/eric/git/alpaca_lora_4bit/text-generation-webui/matmul_utils_4bit.py", line 81, in _matmul4bit_v1_recons
output = torch.matmul(x, buffer)
RuntimeError: expected scalar type Float but found Half
I haven't tested it properly yet, but I think it should probably work. I've confirmed that at least transpose matmul works.
I think it will probably allow you to train at a faster pace. Please let me know if there are any bugs.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant.py#L346
Is there any performance degradation compare to the non 4bit alpaca lora? thank you
Is there no mechanism for stopping generation? The module keeps going even when I hit stop in textgen UI.
I got this error when try to run server.py on Colab. Did I set up something wrong?
Monkey Patch Completed.
Loading ../llama-13b-4bit.pt ...
Loading Model ...
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/alpaca_lora_4bit/text-generation-webui/server.py:236 in <module> │
│ │
│ 233 │ │ i = int(input())-1 │
│ 234 │ │ print() │
│ 235 │ shared.model_name = available_models[i] │
│ ❱ 236 shared.model, shared.tokenizer = load_model(shared.model_name) │
│ 237 if shared.args.lora: │
│ 238 │ add_lora_to_model(shared.args.lora) │
│ 239 │
│ │
│ /content/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py:21 in │
│ load_model_llama │
│ │
│ 18 │ print("Loading {} ...".format(model_path)) │
│ 19 │ t0 = time.time() │
│ 20 │ │
│ ❱ 21 │ model, tokenizer = load_llama_model_4bit_low_ram(config_path, model │
│ 22 │ │
│ 23 │ model = PeftModel.from_pretrained(model, lora_path, device_map={'': │
│ 24 │ print('{} Lora Applied.'.format(lora_path)) │
│ │
│ /content/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa │
│ /autograd_4bit.py:222 in load_llama_model_4bit_low_ram │
│ │
│ 219 │ │ │ if name in layers: │
│ 220 │ │ │ │ del layers[name] │
│ 221 │ │ make_quant_for_4bit_autograd(model, layers) │
│ ❱ 222 │ model = accelerate.load_checkpoint_and_dispatch( │
│ 223 │ │ model=model, │
│ 224 │ │ checkpoint=model_path, │
│ 225 │ │ device_map=device_map, │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/big_modeling.py:479 in │
│ load_checkpoint_and_dispatch │
│ │
│ 476 │ │ ) │
│ 477 │ if offload_state_dict is None and "disk" in device_map.values(): │
│ 478 │ │ offload_state_dict = True │
│ ❱ 479 │ load_checkpoint_in_model( │
│ 480 │ │ model, │
│ 481 │ │ checkpoint, │
│ 482 │ │ device_map=device_map, │
│ │
│ /usr/local/lib/python3.9/dist-packages/accelerate/utils/modeling.py:899 in │
│ load_checkpoint_in_model │
│ │
│ 896 │ │ else: │
│ 897 │ │ │ raise ValueError(f"{checkpoint} containing more than one ` │
│ 898 │ else: │
│ ❱ 899 │ │ raise ValueError( │
│ 900 │ │ │ "`checkpoint` should be the path to a file containing a wh │
│ 901 │ │ │ f"checkpoint, or a folder containing a sharded checkpoint, │
│ 902 │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: `checkpoint` should be the path to a file containing a whole state
dict, or the index of a sharded checkpoint, or a folder containing a sharded
checkpoint, but got ../llama-13b-4bit.pt.
Did I setup something wrong?
Thank you in advance. (I am so close, I saw it interacting with the GPU)
I was trying to run python finetune.py
/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
0%| | 0/62109 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 138, in <module>
trainer.train()
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 1644, in train
return inner_training_loop(
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 1911, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 2657, in training_step
loss = self.compute_loss(model, inputs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/trainer.py", line 2689, in compute_loss
outputs = model(**inputs)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/peft/src/peft/peft_model.py", line 529, in forward
return self.base_model(
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/eric/git/alpaca_lora_4bit/./repository/peft/src/peft/tuners/lora.py", line 682, in forward
result = super().forward(x)
File "/home/eric/git/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 159, in forward
out = AutogradMatmul4bit.apply(x, self.qweight, self.scales, self.zeros)
File "/home/eric/miniconda3/envs/al4/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/eric/git/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 128, in forward
quant.quant_cuda.vecquant4recons(qweight, buffer, scales, zeros)
AttributeError: module 'quant' has no attribute 'quant_cuda'
0%|
I'm curious about using this for inferencing on an existing llama 4bit model. Replacing the lora file, and replacing the others and recompiling I have done already. However when attempting to load the lora. I get (short snippet of the full error)
File "C:\Users\Arargd\miniconda3\envs\textgen\lib\site-packages\peft\tuners\lora.py", line 191, in _find_and_replace
self._replace_module(parent, target_name, new_module, target)
UnboundLocalError: local variable 'new_module' referenced before assignment
Which is the same error you get if you try to load onto a 4bit model normally.
Am I missing something, or what do I need to do to get inferencing working on an existing 4bit model?
"--ds_type", choices=["txt", "alpaca", "gpt4all", "sharegpt"]
I tried to train with ShareGPT data and it doesn't work.
KeyError: 'instruction'
because there's no trainer for this file format.
in train_data.py needs a TrainShareGPT(ATrainData):
Is what I want possible to do?
the format looks like this:
[
{
"id": "QWJhYvA_0",
"conversations": [
{
"from": "human",
"value": "Summarize the main ideas of Jeff Walker's Product Launch Formula into bullet points as it pertains to a growth marketing agency implementing these strategies and tactics for their clients..."
},
{
"from": "gpt",
"value": "Here are the main ideas of Jeff Walker's Product Launch Formula that can be applied by a growth marketing agency for their clients:\n\n1. Identify the target audience and their needs: Understand the ideal customer for the product or service, and create a messaging that resonates with them.\n2. Pre-launch: Build anticipation and excitement for the launch by creating buzz, gathering testimonials and case studies, and using social media to create awareness.\n3. Launch: Use a well-crafted launch sequence to maximize sales and conversions. This can include offering bonuses, creating scarcity, and using a deadline to create urgency.\n4. Post-launch: Follow up with customers, gather feedback, and continue to provide value to keep them engaged and loyal.\n5. Create a product suite: Once the initial product is successful, expand the offering by creating additional products or services that address other needs of the same audience.\n6. Continual optimization: Continually monitor and optimize the launch process and product suite to improve results.\n7. Build a community: Use the launch process to build a community of customers who are passionate about the product and can help spread the word.\n8. Use automation: Use technology and automation to streamline the launch process and improve efficiency."
},
where i should put model path or dir?
Thanks to your work, I've been able to use the finetune script on the llama-13b model (and only took 3 hours on a 3090!)
I see the output folder for the finetune has:
adapter_config.json
adapter_model.bin
<bunch of checkpoint folders>
checkpoint-7500
How can I use these output files for inference with text-generation-webui (preferably in 4 bit mode)?
With the alpaca-lora
repo (8bit), I used the included export_hf_checkpoint.py
script but don't see a similar thing here.
still, running pip install -r requirements.txt doesn't work in a clean conda environment
After Installing conda packages, pip install -r requirements.txt works.
Perhaps use environment.yml to fully specify all prerequisites?
I will make a PR proposal
I have tried to train lora+embed+head on peft and 8bit.
And it just works great.
But since bnb can be set to ignore some specific layer so I can do this easily.
I am curious that can I use this repo and train lora+embed+head with other part 4bit?
I have tried to get things to load for GPTQv2 models that do not use groupsize. Appears it is impossible because groupsize is used to switch between V1 and V2.
I fine tuned 30b on a 4090 and success running inference on my GPU. It took 40 hours and made my office a sauna.
My instruct dataset:
https://github.com/gururise/AlpacaDataCleaned
A few outputs:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Rewrite the lyrics of "Walk Like an Egyptian", in the style of Shakespeare
### Response:
Verily, thy feet shall stumble not
In the manner of an Egyptian, O my love
Thy arms will sway and thy hips will swivel
As thou dost walk like an Egyptian, O my love
Though thou be dressed in linen finery
And thy hair be adorned with jewels galore
Still thou shalt walk like an Egyptian, O my love
For verily, 'tis the only way to show off thy wealth, O my love!
Thou art so graceful as thou dost walk
With thy hands held high and thy head bowed low
Let all who see thee take note of this style
For verily, it is the Egyptian way, O my love!
Thanks to everyone for your contribution and help, this is as good as it gets on a consumer PC! Now I need to build a dual-3090 SLI system to get more VRAM so I can fine tune 65b!
Is it possible to use flash attention, in order to fine-tune with longer conversations instead of just a question-and-answer?
Basically header states the issue.
Stacktrace throws:
Traceback (most recent call last):
File "/home/user/alpaca_lora_4bit/inference.py", line 12, in <module>
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path)
File "/home/user/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 220, in load_llama_model_4bit_low_ram
model = accelerate.load_checkpoint_and_dispatch(model=model, checkpoint=model_path, device_map='auto')
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 862, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.
How to reproduce:
inference.py
set model_path = './llama-13b-4bit-v2.safetensors'
python inference.py
Intended behaviour:
Model has to load and generate response
How to check the model:
CUDA_VISIBLE_DEVICES=0 python llama_inference.py ./llama-13b --wbits 4 --load ./q4/llama13b-4bit-v2.safetensors --text "AI can be explained as"
Complete setup below:
(llm) user@computer:~/alpaca_lora_4bit$ git pull
Already up to date.
(llm) user@computer:~/alpaca_lora_4bit$ ll
total 128
drwxr-xr-x 10 user user 4096 Mar 26 14:17 ./
drwxr-x--- 12 user user 4096 Mar 26 14:17 ../
drwxr-xr-x 8 user user 4096 Mar 26 14:24 .git/
-rw-r--r-- 1 user user 38 Mar 26 13:35 .gitignore
-rw-r--r-- 1 user user 4498 Mar 26 13:35 Finetune4bConfig.py
drwxr-xr-x 2 user user 4096 Mar 26 13:35 GPTQ-for-LLaMa/
-rw-r--r-- 1 user user 1067 Mar 26 13:35 LICENSE
-rw-r--r-- 1 user user 1837 Mar 24 14:51 README.md
drwxr-xr-x 2 user user 4096 Mar 22 17:18 alpaca13b_lora/
drwxr-xr-x 2 user user 4096 Mar 26 09:58 alpaca_lora/
-rw-r--r-- 1 user user 4714 Mar 26 13:35 arg_parser.py
-rw-r--r-- 1 user user 19039 Mar 22 16:01 data.txt
lrwxrwxrwx 1 user user 41 Mar 24 12:29 dataset.json -> /mnt/e/ML/Alpaca13b/alpaca_data_tiny.json*
lrwxrwxrwx 1 user user 31 Mar 23 23:29 finetune-4b -> /mnt/e/ML/Alpaca13b/finetune-4b/
lrwxrwxrwx 1 user user 23 Mar 23 23:30 finetune-4b.py -> finetune-4b/finetune.py*
lrwxrwxrwx 1 user user 38 Mar 23 01:45 finetune-sad.py -> /mnt/e/ML/Alpaca13b/finetune-4b-sad.py*
-rw-r--r-- 1 user user 4922 Mar 26 13:35 finetune.py
-rw-r--r-- 1 user user 1547 Mar 26 14:17 inference.py
-rw-r--r-- 1 user user 1269 Mar 24 14:51 install.bat
-rw-r--r-- 1 user user 1330 Mar 24 14:51 install.sh
drwxr-xr-x 2 user user 4096 Mar 22 19:20 llama-13b-4bit/
lrwxrwxrwx 1 user user 61 Mar 26 13:42 llama-13b-4bit-v2.safetensors -> /mnt/e/ML/Alpaca13b/model/q4-new/llama13b-4bit-v2.safetensors*
lrwxrwxrwx 1 user user 43 Mar 22 21:11 llama-13b-4bit.pt -> /mnt/e/ML/Alpaca13b/model/llama-13b-4bit.pt*
drwxr-xr-x 3 user user 4096 Mar 22 16:01 peft/
drwxr-xr-x 5 user user 4096 Mar 22 16:02 repository/
-rw-r--r-- 1 user user 186 Mar 22 16:01 requirements.txt
-rwxr--r-- 1 user user 54 Mar 22 21:16 start-webui.sh*
drwxr-xr-x 13 user user 4096 Mar 26 14:02 text-generation-webui/
-rw-r--r-- 1 user user 4566 Mar 26 13:35 train_data.py
(llm) user@computer:~/alpaca_lora_4bit$ head inference.py
import os
import sys
sys.path.insert(0, './repository/transformers/src')
sys.path.insert(0, './repository/GPTQ-for-LLaMa')
sys.path.insert(0, './repository/peft/src')
import time
import torch
from autograd_4bit import load_llama_model_4bit_low_ram
config_path = './llama-13b-4bit/'
model_path = './llama-13b-4bit-v2.safetensors'
(llm) user@computer:~/alpaca_lora_4bit$ python inference.py
Loading Model ...
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py:696: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
The safetensors archive passed at ./llama-13b-4bit-v2.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
Traceback (most recent call last):
File "/home/user/alpaca_lora_4bit/inference.py", line 12, in <module>
model, tokenizer = load_llama_model_4bit_low_ram(config_path, model_path)
File "/home/user/alpaca_lora_4bit/./repository/GPTQ-for-LLaMa/autograd_4bit.py", line 220, in load_llama_model_4bit_low_ram
model = accelerate.load_checkpoint_and_dispatch(model=model, checkpoint=model_path, device_map='auto')
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/big_modeling.py", line 479, in load_checkpoint_and_dispatch
load_checkpoint_in_model(
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 862, in load_checkpoint_in_model
set_module_tensor_to_device(model, param_name, param_device, value=param, dtype=dtype)
File "/home/user/.miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 131, in set_module_tensor_to_device
raise ValueError(f"{module} does not have a parameter or a buffer named {tensor_name}.")
ValueError: Autograd4bitQuantLinear() does not have a parameter or a buffer named qzeros.
(textgen) siddhesh@Revision-PC:~/alpaca_lora_4bit/text-generation-webui$ python server.py --gptq-bits 4 --gptq-model-type llama
CUDA extension not installed.
CUDA SETUP: CUDA runtime path found: /home/siddhesh/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 117
/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
Monkey Patch Completed.
Loading ../llama-7b-4bit.pt ...
Loading Model ...
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Loaded the model in 10.15 seconds.
../alpaca_lora_7b/ Lora Applied.
Apply auto switch and half
/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/gradio/deprecation.py:40: UserWarning: The 'type' parameter has been deprecated. Use the Number component instead.
warnings.warn(value)
Running on local URL: http://127.0.0.1:7860
To create a public link, set share=True
in launch()
.
Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/modules/callbacks.py", line 65, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/modules/text_generation.py", line 215, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/peft_model.py", line 581, in generate
outputs = self.base_model.generate(**kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 1462, in generate
return self.sample(
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/generation/utils.py", line 2478, in sample
outputs = self(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 309, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/transformers/src/transformers/models/llama/modeling_llama.py", line 209, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/siddhesh/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/peft/src/peft/tuners/lora.py", line 682, in forward
result = super().forward(x)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 160, in forward
out = fast_4bit_forward(x, self.qweight, self.scales, self.zeros, self.bias)
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 115, in fast_4bit_forward
output = matmul4bit(x, qweight, scales.float(), zeros.float())
File "/home/siddhesh/alpaca_lora_4bit/text-generation-webui/../repository/GPTQ-for-LLaMa/autograd_4bit.py", line 37, in matmul4bit
quant.quant_cuda.vecquant4matmul(x, qweight, y, scales, zeros)
AttributeError: module 'quant' has no attribute 'quant_cuda'
Is it possible to save the merged PT file like it is for stable diffusion? Would it still need PEFT and all this other stuff?
More curious about what others are seeing....
I'm fine-tuning llama30b on a 24G titan RTX card and using same everything (model, dataset (alpaca), hyperparameters, gradient checkpointing on) and just trying out fine-tuning with different the two backends (cuda vs triton).
Cuda is a little more than 3s/it and Triton is a little under 4s/it... I guess I was hoping/expecting triton to perform a bit better (but at least no worse...)
What are others seeing?
Thanks!
I was following the guide written posted for installing this along with the text-generation-webui. I have now run into an issue when I run it.
Loading ../llama-13b-4bit.pt ...
Loading Model ...
Loaded the model in 32.67 seconds.
../alpaca13b_lora/ Lora Applied.
Apply auto switch and half
Traceback (most recent call last):
File "/home/administrator/alpaca_lora_4bit/text-generation-webui/server.py", line 276, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/home/administrator/alpaca_lora_4bit/text-generation-webui/custom_monkey_patch.py", line 28, in load_model_llama
m.bias = m.bias.half()
File "/home/administrator/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1635, in __setattr__
raise TypeError("cannot assign '{}' as parameter '{}' "
TypeError: cannot assign 'torch.cuda.HalfTensor' as parameter 'bias' (torch.nn.Parameter or None expected)```
I have tried reinstalling packages but I'm unsure of how to continue without breaking anything else.
I get error when I try to compile.
home/mint/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda_kernel.cu(548): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (__half *, __half)
/home/mint/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_cuda_kernel.cu(621): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (__half *, __half)
Found the issue it came from but it seems like it's not using this function from where you defined it at the top.
I got it doing inference on opt. Not sure on training as I've not trained a single thing yet.
Inference for the 6b wasn't bad at least. Have to convert the 13b or the 30b but it takes time and will be obsoleted by the new GPTQ.
Output generated in 2.25 seconds (5.77 tokens/s, 13 tokens, context 77)
Model to test with : https://huggingface.co/autobots/opt-6b-4-bit/tree/main
opt-13b.. took 4 hours to quantize.
Output generated in 4.03 seconds (2.98 tokens/s, 12 tokens, context 97)
Output generated in 8.71 seconds (0.92 tokens/s, 8 tokens, context 1848)
Output generated in 16.67 seconds (2.34 tokens/s, 39 tokens, context 1782)
I will upload it soon and attempt GPT-J and GPT-NeoX.
I guess it would be nice to have a good way to check coherence on these.
I am fine-tuning llama 30b 4-bit with my custom dataset (alpaca_clean + leet10k) then I tried to enable flash attention, I use this command line:
python finetune.py --grad_chckpt --flash_attention True --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ ./leet10k-alpaca-merged.json
I saw this error:
Traceback (most recent call last):
File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 156, in <module>
trainer.train()
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step
self.scaler.scale(loss).backward()
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 77, in backward
_flash_attn_backward(
File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 42, in _flash_attn_backward
_, _, _, softmax_d = flash_attn_cuda.bwd(
RuntimeError: Expected is_sm80 to be true, but got false.
any idea what I'm doing wrong?
@Yamashi
Hi, thanks for providing this.
running
DOCCKER_BUILDKIT=1 docker build -t alpaca_lora_4bit . # build step can take 12 min
automatically downloads the llama-7b-hf-int4 weights from Huggingface. I plan to use a larger model anyway, is there a way to not download the weights automatically and perhaps point it to a different location where the files already reside? Thanks!
during the inference step, I'm seeing this error, but specifically on the 13b model. The 7b model seems to work okay.
(this is with the decapoda 4bit model and config)
File "/workspace/alpaca_lora_4bit/inference-reflect.py", line 15, in <module>
model = PeftModel.from_pretrained(model, lora_path, device_map={'': 0}, torch_dtype=torch.float32)
File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 161, in from_pretrained
model = set_peft_model_state_dict(model, adapters_weights)
File "/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 74, in set_peft_model_state_dict
model.load_state_dict(peft_model_state_dict, strict=False)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.0.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.0.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.0.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.1.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.1.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.1.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.1.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.2.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.2.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.2.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.2.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.3.self_attn.q_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.3.self_attn.q_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
size mismatch for base_model.model.model.layers.3.self_attn.v_proj.lora_A.weight: copying a param with shape torch.Size([8, 5120]) from checkpoint, the shape in current model is torch.Size([8, 4096]).
size mismatch for base_model.model.model.layers.3.self_attn.v_proj.lora_B.weight: copying a param with shape torch.Size([5120, 8]) from checkpoint, the shape in current model is torch.Size([4096, 8]).
...
Can you please provide step by step instructions?
the readme doesn't give me enough information for me to set it up and run inference on my computer.
The monkey patch just loads the model normally by textgen ui because you pass the parameter. All your code is ignored.
Loading settings from settings.json...
Loading llama-13b-4bit...
Loading checkpoint shards: 5%|▊ | 2/41 [00:01<00:19, 1.99it
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.