0cc4m / koboldai Goto Github PK

This project forked from henk717/koboldai

License: GNU Affero General Public License v3.0

Shell 0.48% JavaScript 15.53% Python 43.79% Lua 3.89% Haxe 0.15% PowerShell 0.09% CSS 8.06% HTML 20.98% Batchfile 0.39% Jupyter Notebook 1.64% Dockerfile 0.04% Less 1.71% SCSS 1.71% Stylus 1.56%

koboldai's People

Stargazers

Watchers

koboldai's Issues

Hey, I'm not sure what's wrong, but it does automatically delete a lot of output at the end of each generation.

What I have observed so far is that when using models like HyperMantis and Chronos, it will delete all but the first paragraph of the current output after each generation.
When using Erebus and Hermes, it will most likely delete all the content generated this time.
If I switch to the new UI, all the content generated by the latter will probably be saved, but it will not be possible to go backwards, only the commands and actions I entered will be reverted, but not the AI output.
I tried reinstalling KAI to set it back to the initial state, but I had the same problem, which did not occur before with the non-4bit version.

NameError: name 'os' is not defined after last commit

Traceback (most recent call last):
File "aiserver.py", line 598, in
koboldai_vars = koboldai_settings.koboldai_vars(socketio)
File "/kaggle/tmp/KoboldAI/koboldai_settings.py", line 99, in init
self._system_settings = system_settings(socketio, self)
File "/kaggle/tmp/KoboldAI/koboldai_settings.py", line 1309, in init
import gptq
File "/kaggle/tmp/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/gptq/init.py", line 1, in
from . import gptj, gptneox, llama, opt, offload
File "/kaggle/tmp/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/gptq/gptj.py", line 9, in
from .gptq import GPTQ
File "/kaggle/tmp/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/gptq/gptq.py", line 8, in
from .modelutils import GPTQVERSION
File "/kaggle/tmp/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/gptq/modelutils.py", line 9, in
GPTQVERSION = int(os.environ.get("GPTQVERSION", 1))
NameError: name 'os' is not defined

ModuleNotFoundError when starting "play.bat"

Runtime launching in B: drive mode
Traceback (most recent call last):
File "aiserver.py", line 26, in
from ansi2html import Ansi2HTMLConverter
ModuleNotFoundError: No module named 'ansi2html'
I am on a window system, tried everything I could find, several reinstalls and still the same, help.
I can't use "install_requirements.sh" to install, I'm using "install_requirements.bat".

when will the new update kobold just got for llama-2 be pushed here?

cant wait for new update lol

ImportError when running "play.sh"

Hello! I seem to be running into an import error with the latest version of the code. When installing Kobold via install_requirements.sh and running play.sh, it gives me a message that ansi2html cannot be found. After I used commandline.sh to hop into the environment and pip install it manually, I got another importerror mentioning that transformers cannot be found. It looks like the Python packages aren't being installed into the environment, and I'm wondering if this has something to do with the most recent change to install_requirements.sh. This approach worked with previous versions of the repo (as of last running it a couple of days ago).

What is the best way to update?

git pull then run install_requirements.bat, or do I need to do something else?

Loading a model via command line (--model) does not work in 0cc4m Branch

Discussed briefly in Discord:

Issue: When using the --model command with play.bat, the following error is thrown (both 4bit and non-4bit models):

Performing the same action play.bat --model "PygmalionAI_pygmalion-6b" --no_ui in United loads correctly as expected.

how i can uninstall

hello,

I don't know if the application is included in a conda environment.
If not, how can I uninstall it properly so that I can reinstall it in a conda environment of its own?

thanks

AMD install out of date?

I was running through the amd install instructions and the folder repos/gptq is missing and setup_cuda.py is also missing.

Are the AMD install instructions out of date?

1 token generation in story mode

I use WizardLM-7B-uncencored-GPTQ, pygmalion-7b-4bit-128g-cuda, pygmalion-13b-4bit-128g and PygmalionCoT-7b. All are based on LLaMA and with all I have the same problem:

All 3-5 generation there is only 1 token generated.

When that happens, sometimes after pressing "Submit" several times it generates something, but that didn't help every time and I need to change something in the chat history that it generates more than 1 token. Sometimes a simple space on the end helps, often a new line helps. But sometimes the AI generates complete crap, using directly something from the beginning of the context to generate a own genre tag or author's note tag instead continue the context.

When I switch the model to a other LLaMA based model on a position in the story where it only was generating 1 token, the other model also only generates 1 token. If I use Pygmalion-6b-4bit-128g, a model that is not based on LLaMA, it generates normally. So it looks like it is a problem with models based on LLaMA only.

I have that problem since a long time now and already did a complete fresh KAI installation. Nothing helped so far. I use KAI locally under Win10.

[Regression] Can't participate in horde with `exllama` branch, stopping sharing breaks processing

Summary

Probably due to the switch to AI-Horde-Worker instead of KoboldAI-Horde-Worker, I can no longer participate in Horde. The console outputs a stream of:

Environment

Linux
Any model loaded with ExLlama (splitting it across two GPUs in my case)

Steps to Reproduce

git clone https://github.com/0cc4m/KoboldAI.git
cd KoboldAI
git checkout exllama
./play.sh --host
In the UI go to settings, name the Horde Worker and set the Horde API Key
Go to Home, Load Model, select the model from a directory (airoboros-l2-70b-gpt4-2.0 in my case)
Pick ExLlama from the loader dropdown, split 35/45, context 4096, click Load
When loading is complete, check "Share with Horde" in the slider

Observed Results

The console outputs:

INFO       | 2023-09-01 06:22:23.578472 | worker.jobs.poppers:report_skipped_info:85 - Server https://horde.koboldai.net has no valid generations for us to do.
INFO       | 2023-09-01 06:22:25.999475 | worker.jobs.poppers:report_skipped_info:85 - Server https://horde.koboldai.net has no valid generations for us to do.
INFO       | 2023-09-01 06:22:28.183659 | worker.jobs.poppers:report_skipped_info:85 - Server https://horde.koboldai.net has no valid generations for us to do.
INFO       | 2023-09-01 06:22:29.786495 | worker.jobs.poppers:report_skipped_info:85 - Server https://horde.koboldai.net has no valid generations for us to do.

Going to lite.koboldai.net and manually selecting the worker and submitting a job doesn't make it process it. Previously I'd get jobs within one second of clicking "share" as well.

Additionally, unchecking "share with horde" in the UI now results in a red error popup that says:

Error at koboldai.js:3236
Uncaught TypeError: Cannot use 'in' operator to search for 'status' in undefined
--
Please report this error to the developers.

The console prints out:

  File "/home/user/Programs/KoboldAI/AI-Horde-Worker/worker/workers/framework.py", line 58, in stop
    self.ui_class.stop()
    │    └ None
    └ <AI-Horde-Worker.worker.workers.scribe.ScribeWorker object at 0x7f9c6e4d1100>

AttributeError: 'NoneType' object has no attribute 'stop'

At this point the model can't be shared again nor can a new model be loaded. The backend seems locked and has to be restarted.

Expected Results

The model is shared; it participates in the horde with jobs being processed and sent off; sharing can be stopped; and a new model can be loaded and started.

pt not found

At the start of def prepare_4bit_load in aiserver.py
It expects pt file to start with 4bit*, while most of the models have *-4bit-* at the middle, may be it is better to replace

paths_4bit = ["4bit*.safetensors", "4bit*.pt"]

with

paths_4bit = ["*4bit*.safetensors", "*4bit*.pt"]

ERROR: quant_cuda-0.0.0-cp38-cp38-win_amd64.whl is not a supported wheel on this platform.

when i install by the cmd, i got the error: ERROR: quant_cuda-0.0.0-cp38-cp38-win_amd64.whl is not a supported wheel on this platform.

and some notice: A new release of pip is available: 23.0.1 -> 23.1.1
To update, run: python.exe -m pip install --upgrade pip

thanks.

install_requirements error libmamba

Somehow the install_requirements.bat is runing until this point and aborts installation.

Installing pip packages: flask-cloudflared==0.0.10, flask-ngrok, flask-cors, lupa==1.10, transformers==4.28.0, datasets, huggingface_hub==0.12.1, safetensors, accelerate==0.18.0, git+https://github.com/VE-FORBRYDERNE/mkultra, flask-session, python-socketio[client], ansi2html, flask_compress, ijson, bitsandbytes, ftfy, pydub, diffusers, git+https://github.com/0cc4m/hf_bleeding_edge/, --find-links=https://0cc4m.github.io/KoboldAI/gptq-whl-links.html, gptq_koboldai==0.0.5, einops
error libmamba Subprocess call failed: Die Anforderung wird nicht untersttzt.
critical libmamba Subprocess call failed. Aborting.

cant load models 4bit

Good afternoon, I have an old version where everything works.
35f908e147fcac121bdafaf7ca4b751d8091f480 unknown <scav@scav.(none)> 1681297611 +0300 clone: from https://github.com/0cc4m/KoboldAI
I'm trying to reinstall or clear install, but the models are not loading. errors log

`CUDA SETUP: CUDA runtime path found: D:\KoboldAI\KoboldAI\miniconda3\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\bitsandbytes\libbitsandbytes_cuda118.dll...
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
ERROR | modeling.inference_models.hf_torch:_get_model:403 - Lazyloader failed, falling back to stock HF load. You may run out of RAM here. Details:
ERROR | modeling.inference_models.hf_torch:_get_model:404 - [Errno 2] No such file or directory: 'D:\KoboldAI\KoboldAI\models\MetaIX_Guanaco-33B-4bit\pytorch_model-00001-of-00007.bin'
ERROR | modeling.inference_models.hf_torch:_get_model:405 - Traceback (most recent call last):
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\transformers\modeling_utils.py", line 463, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "D:\KoboldAI\KoboldAI\modeling\lazy_loader.py", line 523, in torch_load
model_dict = old_torch_load(
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\torch\serialization.py", line 791, in load
with _open_file_like(f, 'rb') as opened_file:
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\torch\serialization.py", line 271, in _open_file_like
return _open_file(name_or_buffer, mode)
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\torch\serialization.py", line 252, in init
super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'D:\KoboldAI\KoboldAI\models\MetaIX_Guanaco-33B-4bit\pytorch_model-00001-of-00007.bin'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\KoboldAI\KoboldAI\modeling\inference_models\hf_torch.py", line 393, in _get_model
model = AutoModelForCausalLM.from_pretrained(
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\transformers\models\auto\auto_factory.py", line 484, in from_pretrained
return model_class.from_pretrained(
File "D:\KoboldAI\KoboldAI\modeling\patches.py", line 92, in new_from_pretrained
return old_from_pretrained(
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\transformers\modeling_utils.py", line 2881, in from_pretrained
) = cls._load_pretrained_model(
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\transformers\modeling_utils.py", line 3214, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File "D:\KoboldAI\KoboldAI\miniconda3\lib\site-packages\transformers\modeling_utils.py", line 466, in load_state_dict
with open(checkpoint_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:\KoboldAI\KoboldAI\models\MetaIX_Guanaco-33B-4bit\pytorch_model-00001-of-00007.bin'

INFO | modeling.inference_models.hf_torch:_get_model:406 - Falling back to stock HF load...
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]
WARNING | modeling.inference_models.hf_torch:_get_model:439 - Fell back to GPT2LMHeadModel due to [Errno 2] No such file or directory: 'D:\KoboldAI\KoboldAI\models\MetaIX_Guanaco-33B-4bit\pytorch_model-00001-of-00007.bin'
You are using a model of type llama to instantiate a model of type gpt2. This is not supported for all configurations of models and can yield errors.
You are loading your model in 8bit or 4bit but no linear modules were found in your model. this can happen for some architectures such as gpt2 that uses Conv1D instead of Linear layers. Please double check your model architecture, or submit an issue on github if you think this is a bug.`

Issue with loading 30b model which was previously good

Win 10, 4090, metalx/alpasta 30B. 60 GPU layer

i had this OOM problem before then it was fixed on model-structure-update branch
then updated to latestqptq and this error came

previously good commit hash: 4180620
bad commit hash: a2d01bb

error msg:
https://pastebin.com/kTRzHP9z

Can't split 4bit model between gpu/cpu, and can't run only on cpu

I am able to load 4bit GPTQ models all the way up to 30/33b just on my gpu (4090) just fine, however, when attempting to load 60b solely to cpu (turn both sliders on load dialog to 0) I get an error "indexerror: list out of range" and it won't even attempt to load. I have 128GB of RAM. When I try to split between cpu/gpu (set gpu preload to any value, tried 55 and 10 with same results) then it loads ok but errors during inference saying "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!". What am I missing? Do 4bit GPTQ models only run on GPU?

ModuleNotFoundError: No module named 'gptq.bigcode'

I updated latestmerge branch gptq-for-llama and latestgptq branch KoboldAI and now KoboldAI gives me this error:

Traceback (most recent call last):
  File "aiserver.py", line 604, in <module>
    from modeling.inference_models.hf_torch_4bit import load_model_gptq_settings
  File "/mnt/Storage/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 37, in <module>
    from gptq.bigcode import load_quant as bigcode_load_quant
ModuleNotFoundError: No module named 'gptq.bigcode'

I also installed the latest hf_bleeding_edge.

i cannot load any ai models and i keep getting this error no matter what i do. this happened after i did "git pull" command from this repository

Exception in thread Thread-14:
Traceback (most recent call last):
File "B:\python\lib\threading.py", line 932, in _bootstrap_inner
self.run()
File "B:\python\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "B:\python\lib\site-packages\socketio\server.py", line 731, in _handle_event_internal
r = server._trigger_event(data[0], namespace, sid, *data[1:])
File "B:\python\lib\site-packages\socketio\server.py", line 756, in trigger_event
return self.handlers[namespace]event
File "B:\python\lib\site-packages\flask_socketio_init.py", line 282, in _handler
return self.handle_event(handler, message, namespace, sid,
File "B:\python\lib\site-packages\flask_socketio_init.py", line 828, in _handle_event
ret = handler(*args)
File "aiserver.py", line 615, in g
return f(*a, **k)
File "aiserver.py", line 3191, in get_message
load_model(use_gpu=msg['use_gpu'], gpu_layers=msg['gpu_layers'], disk_layers=msg['disk_layers'], online_model=msg['online_model'])
File "aiserver.py", line 1980, in load_model
model.load(
File "C:\KoboldAI\modeling\inference_model.py", line 177, in load
self._load(save_model=save_model, initial_load=initial_load)
File "C:\KoboldAI\modeling\inference_models\hf_torch_4bit.py", line 198, in _load
self.model = self._get_model(self.get_local_model_path(), tf_kwargs)
File "C:\KoboldAI\modeling\inference_models\hf_torch_4bit.py", line 378, in _get_model
model = load_quant_offload(llama_load_quant, utils.koboldai_vars.custmodpth, path_4bit, utils.koboldai_vars.gptq_bits, groupsize, self.gpu_layers_list, force_bias=v2_bias)
TypeError: load_quant_offload() got an unexpected keyword argument 'force_bias'

Request for T5 gptq model support.

I attempted to load up flan-ul2 4-bit 128g gptq, but it looks like T5ForConditionalGeneration isn't supported, or perhaps Encoder/Decoder type LLMs in general.
In particular, https://github.com/qwopqwop200/transformers-t5 would also likely be needed to provide support for quantized T5.

i keep getting a merge conflict when trying to git pull from the new updated 4bit-plugin dev branch

From https://github.com/0cc4m/KoboldAI

branch HEAD -> FETCH_HEAD
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Auto-merging commandline.bat
CONFLICT (modify/delete): docs/gptq-whl-links.html deleted in HEAD and modified in e1127b9. Version e1127b9 of docs/gptq-whl-links.html left in tree.
Auto-merging environments/huggingface.yml
CONFLICT (content): Merge conflict in environments/huggingface.yml
CONFLICT (modify/delete): modeling/inference_models/hf_torch_4bit.py deleted in HEAD and modified in e1127b9. Version e1127b9 of modeling/inference_models/hf_torch_4bit.py left in tree.
Auto-merging play.bat
Auto-merging update-koboldai.bat
Automatic merge failed; fix conflicts and then commit the result.

im also getting an error when trying to run the install requirements.bat

"
Empty environment created at prefix: B:\python
info libmamba ****************** Backtrace Start ******************
debug libmamba Loading configuration
trace libmamba Compute configurable 'create_base'
trace libmamba Compute configurable 'no_env'
trace libmamba Compute configurable 'no_rc'
trace libmamba Compute configurable 'rc_files'
trace libmamba Compute configurable 'root_prefix'
trace libmamba Get RC files configuration from locations up to HomeDir
trace libmamba Configuration not found at 'C:\Users\alpha.mambarc'
trace libmamba Configuration not found at 'C:\Users\alpha.condarc'
trace libmamba Configuration not found at 'C:\Users\alpha.conda\condarc.d'
trace libmamba Configuration not found at 'C:\Users\alpha.conda\condarc'
trace libmamba Configuration not found at 'C:\Users\alpha.conda.condarc'
trace libmamba Configuration not found at 'B:\python.mambarc'
trace libmamba Configuration not found at 'B:\python\condarc.d'
trace libmamba Configuration not found at 'B:\python\condarc'
trace libmamba Configuration not found at 'B:\python.condarc'
trace libmamba Configuration not found at 'C:\ProgramData\conda.mambarc'
trace libmamba Configuration not found at 'C:\ProgramData\conda\condarc.d'
trace libmamba Configuration not found at 'C:\ProgramData\conda\condarc'
trace libmamba Configuration not found at 'C:\ProgramData\conda.condarc'
trace libmamba Update configurable 'no_env'
trace libmamba Compute configurable 'file_specs'
error libmamba YAML error in spec file 'C:\KoboldAI\environments\huggingface.yml'
critical libmamba yaml-cpp: error at line 54, column 2: unexpected character in block scalar
info libmamba ****************** Backtrace End ********************

                                       __
      __  ______ ___  ____ _____ ___  / /_  ____ _
     / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
    / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
   / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
  /_/

Collect information..
Cleaning index cache..
Cleaning lock files..
Cleaning tarballs..
Cleaning packages..
The system cannot find the file specified.
Press any key to continue . . ."

Flask Error

Flask error. Fix is already pushed on main branch. Need to update dependency to flask 2.2.3.

environments/huggingface.yml

environments/rocm.yml

Error involving bfloat 16 on generation with MPT 7B 4-bit_128g

May 13, 2023, 5:57PM EST

With latestgptq branch, the MPT 7B model successfully loads on my RTX 3060 12GB, but on prompting to generate text, an error occurs involving bfloat16. Pastebin of error is below:

https://pastebin.com/8QQtXdmz

No 4-bit toggle

Hi!
I install Pygmalion with this guide and I have not 4-bit toggle. Is it a bug? What I do wrong?

"expected scalar type BFloat16 but found Half"

Tried your fork to run your model https://huggingface.co/OccamRazor/mpt-7b-storywriter-4bit-128g/

Using it locally, this model throws the error expected scalar type BFloat16 but found Half

IDK what to do to fix this.
Would you please let me know if it works for you or how to fix this?
Attached is debug and config.json:
debug.zip

Thanks

Cannot find the path specified & No module named 'hf_bleeding_edge' when trying to start.

Hello, first time trying out this branch, while normal KoboldAI works and so does ooabooga, this one I'm getting some errors that I have no idea what they could be nor how to fix it.

Everything has been installed as per usual, Windows 10 user.

This si the error I get when opening the "play.bat" to start the server:

The system cannot find the file specified.
Runtime launching in B: drive mode
The system cannot find the path specified.
Colab Check: False, TPU: False
Traceback (most recent call last):
  File "C:\Users\TheFairyMan\OneDrive\Documentos\AI-DIFFUSIONS\KoboldAI\aiserver.py", line 604, in <module>
    from modeling.inference_models.hf_torch_4bit import load_model_gptq_settings
  File "C:\Users\TheFairyMan\OneDrive\Documentos\AI-DIFFUSIONS\KoboldAI\modeling\inference_models\hf_torch_4bit.py", line 12, in <module>
    from transformers import GPTNeoForCausalLM, AutoTokenizer, LlamaTokenizer
ImportError: cannot import name 'LlamaTokenizer' from 'transformers' (C:\Users\TheFairyMan\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\__init__.py)

And then it never opens and just cancels, I have no diea which path it's refering to and no idea what to do about the Llama tokenizer problem.

How to load multiple graphics cards

Error on start

Hey!

I followed the instructions and set up the model from here in the models directory like this, but am getting the following error when loading the model after starting the application with play.bat:
image

I'm guessing I did something wrong, because I also don't see any way to switch to 4-bit in the UI, though I downloaded the files directly from the repo (twice).

Any thoughts?
System is running Windows 10, GPU is an RTX 4090.

Thanks for making this repo! I'm really excited to dive in.

anaconda3/lib/python3.9/runpy.py:127: RuntimeWarning: 'gptq.bigcode' found in sys.modules after import of package 'gptq', but prior to execution of 'gptq.bigcode'; this may result in unpredictable behaviour

I was going to post this on the gptq-for-llama repo, but it seems issues are turned off there.

Running: python -m gptq.bigcode
Results in:

anaconda3/lib/python3.9/runpy.py:127: RuntimeWarning: 'gptq.bigcode' found in sys.modules after import of package 'gptq', but prior to execution of 'gptq.bigcode'; this may result in unpredictable behaviour

The comments on this may be helpful? https://stackoverflow.com/questions/42364598/setting-up-imports-and-init-py-in-a-multi-module-project

Significant Speed Recression on P40 compared to United

LatestGPTQ Branch:

United Branch:

Test done with same model, same token context.
The generation speed seems unaffected but the united implementation seems to take a lot longer to "process" the tokens.
Until this issue is fixed I will stay on latestGPTQ.

Can't load 4bit models on Rocm

Whenever I try to load 4bit models I recieve this message. I'm using the latest version of code and can load normal models just fine. I'm using a 6600xt.
``
DEVICE ID | LAYERS | DEVICE NAME
0 | 32 | AMD Radeon RX 6600 XT
N/A | 0 | (Disk cache)
N/A | 0 | (CPU)
INFO | modeling.inference_models.hf_torch_4bit:_get_model:372 - Using GPTQ file: /home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors/4bit.safetensors, 4-bit model, type llama, version 1, groupsize -1
ERROR | main:g:615 - An error has been caught in function 'g', process 'MainProcess' (47821), thread 'MainThread' (140211034212160):
Traceback (most recent call last):

File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/green/thread.py", line 43, in __thread_body
func(*args, **kwargs)
│ │ └ {}
│ └ ()
└ <bound method Thread._bootstrap of <Thread(Thread-30, started daemon 140203333914496)>>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/threading.py", line 890, in _bootstrap
self._bootstrap_inner()
│ └ <function start_new_thread..wrap_bootstrap_inner at 0x7f83a1e5a790>
└ <Thread(Thread-30, started daemon 140203333914496)>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/eventlet/green/thread.py", line 64, in wrap_bootstrap_inner
bootstrap_inner()
└ <bound method Thread._bootstrap_inner of <Thread(Thread-30, started daemon 140203333914496)>>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
│ └ <function Thread.run at 0x7f856a182a60>
└ <Thread(Thread-30, started daemon 140203333914496)>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
│ │ │ │ │ └ {}
│ │ │ │ └ <Thread(Thread-30, started daemon 140203333914496)>
│ │ │ └ (<socketio.server.Server object at 0x7f83a3cc0280>, 'Sr6GSnjVaPX_RH0qAAAD', 'GyLyF9O7Lgzw26FwAAAC', ['load_model', {'model': ...
│ │ └ <Thread(Thread-30, started daemon 140203333914496)>
│ └ <bound method Server._handle_event_internal of <socketio.server.Server object at 0x7f83a3cc0280>>
└ <Thread(Thread-30, started daemon 140203333914496)>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/socketio/server.py", line 731, in _handle_event_internal
r = server._trigger_event(data[0], namespace, sid, *data[1:])
│ │ │ │ │ └ ['load_model', {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': ...
│ │ │ │ └ 'Sr6GSnjVaPX_RH0qAAAD'
│ │ │ └ '/'
│ │ └ ['load_model', {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': ...
│ └ <function Server._trigger_event at 0x7f840f0c54c0>
└ <socketio.server.Server object at 0x7f83a3cc0280>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/socketio/server.py", line 756, in _trigger_event
return self.handlers[namespace]event
│ │ │ │ └ ('Sr6GSnjVaPX_RH0qAAAD', {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', '...
│ │ │ └ 'load_model'
│ │ └ '/'
│ └ {'/': {'get_model_info': <function get_model_info at 0x7f83a2a9e430>, 'OAI_Key_Update': <function get_oai_models at 0x7f83a2a...
└ <socketio.server.Server object at 0x7f83a3cc0280>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/flask_socketio/init.py", line 282, in _handler
return self._handle_event(handler, message, namespace, sid,
│ │ │ │ │ └ 'Sr6GSnjVaPX_RH0qAAAD'
│ │ │ │ └ '/'
│ │ │ └ 'load_model'
│ │ └ <function UI_2_load_model at 0x7f83a270c670>
│ └ <function SocketIO._handle_event at 0x7f83a4f458b0>
└ <flask_socketio.SocketIO object at 0x7f83a3cc02b0>
File "/home/bartosz/KoboldAI/runtime/envs/koboldai-rocm/lib/python3.8/site-packages/flask_socketio/init.py", line 828, in _handle_event
ret = handler(*args)
│ └ ({'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': '...
└ <function UI_2_load_model at 0x7f83a270c670>

File "aiserver.py", line 615, in g
return f(*a, **k)
│ │ └ {}
│ └ ({'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': '...
└ <function UI_2_load_model at 0x7f83a270c3a0>

File "aiserver.py", line 6493, in UI_2_load_model
load_model(use_gpu=data['use_gpu'], gpu_layers=data['gpu_layers'], disk_layers=data['disk_layers'], online_model=data['online_model'], url=koboldai_vars.colaburl, use_8_bit=data['use_8_bit'])
│ │ │ │ │ │ └ {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': ''...
│ │ │ │ │ └ <koboldai_settings.koboldai_vars object at 0x7f83a33fe8e0>
│ │ │ │ └ {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': ''...
│ │ │ └ {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': ''...
│ │ └ {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': ''...
│ └ {'model': 'NeoCustom', 'path': '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors', 'use_gpu': True, 'key': ''...
└ <function load_model at 0x7f83a2a9e8b0>

File "aiserver.py", line 1980, in load_model
model.load(
│ └ <function InferenceModel.load at 0x7f83a2b40f70>
└ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>

File "/home/bartosz/KoboldAI/modeling/inference_model.py", line 177, in load
self._load(save_model=save_model, initial_load=initial_load)
│ │ │ └ False
│ │ └ True
│ └ <function HFTorch4BitInferenceModel._load at 0x7f83a2b68940>
└ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>

File "/home/bartosz/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 198, in _load
self.model = self._get_model(self.get_local_model_path(), tf_kwargs)
│ │ │ │ │ │ └ {}
│ │ │ │ │ └ <function HFInferenceModel.get_local_model_path at 0x7f83a2b638b0>
│ │ │ │ └ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>
│ │ │ └ <function HFTorch4BitInferenceModel._get_model at 0x7f83a2b689d0>
│ │ └ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>
│ └ None
└ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>

File "/home/bartosz/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 378, in _get_model
model = load_quant_offload(llama_load_quant, utils.koboldai_vars.custmodpth, path_4bit, utils.koboldai_vars.gptq_bits, groupsize, self.gpu_layers_list, force_bias=v2_bias)
│ │ │ │ │ │ │ │ │ │ └ False
│ │ │ │ │ │ │ │ │ └ [32]
│ │ │ │ │ │ │ │ └ <modeling.inference_models.hf_torch_4bit.HFTorch4BitInferenceModel object at 0x7f83a1f15bb0>
│ │ │ │ │ │ │ └ -1
│ │ │ │ │ │ └ <koboldai_settings.koboldai_vars object at 0x7f83a33fe8e0>
│ │ │ │ │ └ <module 'utils' from '/home/bartosz/KoboldAI/utils.py'>
│ │ │ │ └ '/home/bartosz/KoboldAI/models/Pygmalion-7b-4bit-GPTQ-Safetensors/4bit.safetensors'
│ │ │ └ <koboldai_settings.koboldai_vars object at 0x7f83a33fe8e0>
│ │ └ <module 'utils' from '/home/bartosz/KoboldAI/utils.py'>
│ └ <function load_quant at 0x7f83a2f339d0>
└ <function load_quant_offload at 0x7f83a2e5ea60>

TypeError: load_quant_offload() got an unexpected keyword argument 'force_bias'

./play-rocm.sh gptq error fedora 39

./play-rocm.sh
Colab Check: False, TPU: False
Traceback (most recent call last):
File "aiserver.py", line 604, in
from modeling.inference_models.hf_torch_4bit import load_model_gptq_settings
File "/home/madrugada/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 31, in
import gptq
ModuleNotFoundError: No module named 'gptq'

Support for MythoMax-L2-13B-GPTQ

I couldn't get this model to run but it would be nice if it was possible as I prefer KoboldAI over oobabooga.
MythoMax-L2-13B has 4K tokens and the GPTQ model can be run with around 8-10 gigs of VRAM so it's sort of easy to run, and it makes long responses and it is meant for roleplaying / storywriting.

Although, tutorial on how to run this with KoboldAI would work for me too.

please add code for landmark attention to 4bit-plugin

You can find it here: https://github.com/eugenepentland/landmark-attention-qlora.git

I tried to use the 4bit-plugin branch and add it myself, but I hardly know python. I made the changes to fit what it seemed to me like you are doing (in that other repo you have--I forget the name of it rn), and it built the modules, but I can't get them to work. It also needs exllama from outside the tree, I think, and I'm not sure what to do with that, either.

If you don't have time to do it yourself, please briefly explain what I have to do (assuming the branch is actually working, heh).

Thanks.

Exllama in KoboldAI emits a spurious space at the beginning of generations that end with a stop token.

To reproduce, use this prompt:

### Instruction:
Generate a html image element for an example png.
### Response:
<img src="https://example.com/example-image-name.png

Set output length to 5, Temperature to 0.6, TopP to 0.9 and TopK to 10 (other settings will work too)
Generate several times. Sometimes the output will look like
<img src="https://example.com/example-image-name.png" alt="Example Image
and sometimes like
<img src="https://example.com/example-image-name.png " />
(note the spurious space before the terminating quote.

I added logging of the generated tokens as well as the raw decoded output. Looking at the log, the first generated token is ["] (id 29908). However in the context, the token turns into [ "] (id 376). See the attached screenshot:

After some additional debugging, the spurious space is consistently inserted when the generated output ends with a stop token [</s>] (token id 2).

Example:

Selected: 29908 ["]
Selected: 5272 [ alt]
...
Selected: 2 [</s>]
INFO       | modeling.inference_model:raw_generate:603 - Generated 5 tokens in 4.46 seconds, for an average rate of 1.12 tokens per second.
GENERATION @ 2023-08-18 10:55:26 |  " alt="" /></s>
GENERATION @ 2023-08-18 10:55:26 |  " alt="" />

Versus:

Selected: 29908 ["]
Selected: 5272 [ alt]
...
Selected: 7084 [ Image]
INFO       | modeling.inference_model:raw_generate:603 - Generated 5 tokens in 1.23 seconds, for an average rate of 4.07 tokens per second.
GENERATION @ 2023-08-18 11:34:06 | " alt="Example Image
GENERATION @ 2023-08-18 11:34:06 | " alt="Example Image

Error using previously good model.

OS: Windows 10
CPU: AMD Ryzen 5 5600
GPU: RTX 3060 12 GB
RAM: 16 GB

Trying to load WizardLM-7B, which worked just fine a few days ago no longer works with this new update. It keeps giving an "There is no item named '4bit-128g/data/0' in the archive" error.

Known Good commit hash: 4180620
Bad commit hash: a2d01bb

Full Error MSG:
https://pastebin.com/SXUv3hJb

ImportError: cannot import name 'url_quote' from 'werkzeug.urls'

Trying to do a fresh install after deleting my old copy. Here is the error:

The system cannot find the file specified.
Runtime launching in B: drive mode
Traceback (most recent call last):
  File "aiserver.py", line 67, in <module>
    from utils import debounce
  File "D:\KoboldAI\utils.py", line 192, in <module>
    from flask_socketio import emit
  File "B:\python\lib\site-packages\flask_socketio\__init__.py", line 18, in <module>
    import flask
  File "B:\python\lib\site-packages\flask\__init__.py", line 5, in <module>
    from .app import Flask as Flask
  File "B:\python\lib\site-packages\flask\app.py", line 30, in <module>
    from werkzeug.urls import url_quote
ImportError: cannot import name 'url_quote' from 'werkzeug.urls' (B:\python\lib\site-packages\werkzeug\urls.py)

Steps to reproduce are the same as the installation instructions on the readme (Windows 10, Nvidia GPU, B: drive mode):

git clone https://github.com/0cc4m/KoboldAI -b latestgptq --recurse-submodules
cd KoboldAI
install_requirements.bat
Launch 'play.bat'

Might be an issue with the September 30th release of Werkzeug 3.0.0? Not sure. Looks like this pull request confirms it though.

Failed to load 4bit-128g WizardLM 7B

Not sure if this is meant to work at present, but I got a RuntimeError: Internal: src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())] loading https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ/, after cloning it and doing ln -s https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GPTQ/blob/main/WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors 4bit-128g.compat.no-act-order.safetensors.

This is with:

$ git show HEAD
commit 0a7113f99780abb15e9a058a7a8501767e54940a (HEAD -> latestgptq, origin/latestgptq, origin/HEAD)
Merge: b8e8b0f 530e204
Author: 0cc4m <[email protected]>
Date:   Wed May 24 06:32:54 2023 +0200

    Merge pull request #35 from YellowRoseCx/patch-1
    
    Update README.md to GPTQ-KoboldAI 0.0.5

$ git remote -v
origin	https://github.com/0cc4m/KoboldAI (fetch)
origin	https://github.com/0cc4m/KoboldAI (push)

Colab Check: False, TPU: False
INFO       | __main__:general_startup:1312 - Running on Repo: https://github.com/0cc4m/KoboldAI Branch: latestgptq
INIT       | Starting   | Flask
INIT       | OK         | Flask
INIT       | Starting   | Webserver
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed
INIT       | OK         | Webserver
MESSAGE    | Webserver started! You may now connect with a browser at http://127.0.0.1:5000
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2805 - Client connected! UI_1
Connection Attempt: 127.0.0.1
INFO       | __main__:do_connect:2805 - Client connected! UI_1
ERROR      | koboldai_settings:__setattr__:1210 - __setattr__ just set model_selected to NeoCustom in koboldai_vars. That variable isn't defined!
INFO       | __main__:get_model_info:1513 - Selected: NeoCustom, /home/lb/GIT/KoboldAI/models/TheBloke_WizardLM-7B-uncensored-GPTQ
INIT       | Searching  | GPU support
INIT       | Found      | GPU support
INIT       | Starting   | Transformers
INIT       | Info       | Final device configuration:
       DEVICE ID  |  LAYERS  |  DEVICE NAME
   (primary)   0  |      32  |  NVIDIA GeForce RTX 3090
               1  |       0  |  Tesla P40
               2  |       0  |  Tesla P40
             N/A  |       0  |  (Disk cache)
             N/A  |       0  |  (CPU)
INFO       | modeling.inference_models.hf_torch_4bit:_get_model:371 - Using GPTQ file: /home/lb/GIT/KoboldAI/models/TheBloke_WizardLM-7B-uncensored-GPTQ/4bit-128g.safetensors, 4-bit model, type llama, version 2, groupsize 128
Loading model ...
Done.
Exception in thread Thread-18:
Traceback (most recent call last):
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/socketio/server.py", line 731, in _handle_event_internal
    r = server._trigger_event(data[0], namespace, sid, *data[1:])
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/socketio/server.py", line 756, in _trigger_event
    return self.handlers[namespace][event](*args)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/flask_socketio/__init__.py", line 282, in _handler
    return self._handle_event(handler, message, namespace, sid,
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/flask_socketio/__init__.py", line 828, in _handle_event
    ret = handler(*args)
  File "aiserver.py", line 615, in g
    return f(*a, **k)
  File "aiserver.py", line 3191, in get_message
    load_model(use_gpu=msg['use_gpu'], gpu_layers=msg['gpu_layers'], disk_layers=msg['disk_layers'], online_model=msg['online_model'])
  File "aiserver.py", line 1980, in load_model
    model.load(
  File "/home/lb/GIT/KoboldAI/modeling/inference_model.py", line 177, in load
    self._load(save_model=save_model, initial_load=initial_load)
  File "/home/lb/GIT/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 199, in _load
    self.tokenizer = self._get_tokenizer(self.get_local_model_path())
  File "/home/lb/GIT/KoboldAI/modeling/inference_models/hf_torch_4bit.py", line 391, in _get_tokenizer
    tokenizer = LlamaTokenizer.from_pretrained(utils.koboldai_vars.custmodpth)
  File "aiserver.py", line 112, in new_pretrainedtokenizerbase_from_pretrained
    tokenizer = old_pretrainedtokenizerbase_from_pretrained(cls, *args, **kwargs)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1811, in from_pretrained
    return cls._from_pretrained(
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1965, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/transformers/models/llama/tokenization_llama.py", line 96, in __init__
    self.sp_model.Load(vocab_file)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/home/lb/GIT/KoboldAI/runtime/envs/koboldai/lib/python3.8/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

src/sentencepiece_processor.cc error when loading GPT4X model

RuntimeError: Internal: src/sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())]
sentencepiece_processor.cc(923) LOG(ERROR) src/sentencepiece_processor.cc(290) [model_] Model is not initialized.

This happened in 2 different runpod instances with different processors but the same .

Heres the repo. Tried with both model files after renaming accordingly. This is working as of early last week and is working on an existing system but will not work in a brand new environment. Can you please confirm if this is an issue on your forks side or on the model side? Happy to link you to my environment to help troubleshoot.

https://huggingface.co/MetaIX/GPT4-X-Alpaca-30B-Int4

Interface not loading... WSL/Windows

Default port 5000 seems not to be loading. Loaded once, then froze after I clicked 'new interface' now it refuses to load and chrome just says cannot find address. How do I edit the default port? running in WSL trying to open in Chrome/Windows.

i got an other error

Exception in thread Thread-25 (_handle_event_internal):
Traceback (most recent call last):
File "C:\ProgramData\anaconda3\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\ProgramData\anaconda3\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\anaconda3\lib\site-packages\socketio\server.py", line 733, in _handle_event_internal
r = server._trigger_event(data[0], namespace, sid, *data[1:])
File "C:\ProgramData\anaconda3\lib\site-packages\socketio\server.py", line 758, in trigger_event
return self.handlers[namespace]event
File "C:\ProgramData\anaconda3\lib\site-packages\flask_socketio_init.py", line 282, in _handler
return self.handle_event(handler, message, namespace, sid,
File "C:\ProgramData\anaconda3\lib\site-packages\flask_socketio_init.py", line 828, in handle_event
ret = handler(*args)
File "D:\Project\KoboldAI\aiserver.py", line 466, in g
return f(*a, **k)
File "D:\Project\KoboldAI\aiserver.py", line 3915, in get_message
load_model(use_gpu=msg['use_gpu'], gpu_layers=msg['gpu_layers'], disk_layers=msg['disk_layers'], online_model=msg['online_model'])
File "D:\Project\KoboldAI\aiserver.py", line 2235, in load_model
model_config = AutoConfig.from_pretrained(vars.custmodpth.replace('/', ''), revision=args.revision, cache_dir="cache")
File "C:\ProgramData\anaconda3\lib\site-packages\transformers\models\auto\configuration_auto.py", line 796, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "C:\ProgramData\anaconda3\lib\site-packages\transformers\models\auto\configuration_auto.py", line 503, in getitem
raise KeyError(key)
KeyError: 'llama'

Can't Generate With 4bit Quantized Model

I cloned latestgptq branch with --recurse-submodules flag.
git clone https://github.com/0cc4m/KoboldAI -b latestgptq --recurse-submodules
I quantized a model using the gptq inside repos.
python llama.py models/test c4 --wbits 4 --true-sequential --act-order --save_safetensors models/test/4bit.safetensors
I can manually run the inference.
python repos/gptq/llama_inference.py models/test --wbits 4 --load models/test/4bit.safetensors --text "Once upon a time, "
It also looks like it loads fine with KoboldAI.

loading Model
INIT       | Searching  | GPU support
WARNING    | __main__:load_model:2882 - This model does not support hybrid generation. --breakmodel_gpulayers will be ignored.
INIT       | Found      | GPU support
INIT       | Starting   | Transformers
4-bit CPU offloader active
Using 4-bit file: /content/KoboldAI/models/test/4bit.safetensors, groupsize -1
Trying to load llama model in 4-bit
Loading model ...
/usr/local/lib/python3.10/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
Done.
INFO       | __main__:load_model:3396 - Pipeline created: test
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed

However, I get an error when I try to submit text from KoboldAI and generate.

ERROR      | __main__:generate:6516 - Traceback (most recent call last):
  File "/content/KoboldAI/aiserver.py", line 6503, in generate
    genout, already_generated = tpool.execute(core_generate, txt, minimum, maximum, found_entries)
  File "/usr/local/lib/python3.10/dist-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python3.10/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.10/dist-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/content/KoboldAI/aiserver.py", line 5682, in core_generate
    result = raw_generate(
  File "/content/KoboldAI/aiserver.py", line 5910, in raw_generate
    batch_encoded = torch_raw_generate(
  File "/content/KoboldAI/aiserver.py", line 6006, in torch_raw_generate
    genout = generator(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/content/KoboldAI/aiserver.py", line 2456, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2478, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/KoboldAI/repos/gptq/offload.py", line 225, in llama_offload_forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids'

Slow speed for some models.

Hey,

I tried using this fork and I realized that the speed was really slow for some models that I was using
https://huggingface.co/reeducator/vicuna-13b-cocktail/tree/main

For vicuna-cocktail for example I get something like 2 tokens/s even though I easily reach 10 tokens/s on on ooba's webui.

Some other models (like raw llama 13b) gives me 7 tokens/s which is fine

I guess this has to do with the vicuna-cocktail not having being saved with the "save_pretrained" option? I don't know just trying to guess there.

Anyway, if you could look at that and try to get "normal" speed with every situation that would be cool

Thanks in advance.

Can't Find 4Bit Model

I'm trying to load llama-7b-4bit model.

git clone https://github.com/0cc4m/KoboldAI -b 4bit --recurse-submodules
more setup
cd repos
git clone https://github.com/0cc4m/GPTQ-for-LLaMa -b gptneox
cd GPTQ-for-LLaMa
python setup_cuda.py install
cd ..
cd ..

I put 4bit.pt, config.json, and tokenizer.model in models/.
Then I ran python aiserver.py
I can open the link, but I can't find the model to load. It just says please a load model on the left, but it just says "read only" under "1) model."
There's no error in the console either.
I downloaded debug, but there doesn't seem to be any error either.
Sorry, first time trying this, but what am I missing?

WinError 127 on nvfuser_codegen.dll

I have windows 10

I installed the koboldAI using the exe in https://github.com/henk717/KoboldAI/releases
according to https://docs.pygmalion.chat/local-installation-(gpu)/koboldai4bit/ I downloaded https://cdn.discordapp.com/attachments/1067465338905702572/1105952533421228032/update-koboldai-occam-4bit.bat and ran it inside the KoboldAI directory
I ran play.bat and I got error OSError: [WinError 127] The specified procedure could not be found. Error loading "B:\python\lib\site-packages\torch\lib\nvfuser_codegen.dll" or one of its dependencies.

I ran the install_requirements.bat with deleting the previous environment, but it did not work. I re-installed everything, but I got the same error.

Attempting to pass model params to ExLlama on startup causes an AttributeError

Summary

It appears that self.model_config is None in ExLlama's class.py (https://github.com/0cc4m/KoboldAI/blob/exllama/modeling/inference_models/exllama/class.py#L423), and is assumed to exist when you get to that code via passing in --model_parameters.

Additionally, play.sh has an issue with parsing JSON use space as an IFS, which is how the help model param tells you to format your JSON.

Steps to reproduce:

Attempt to start with something like the following, which attempts to set model_parameters: /play.sh --host --model airoboros-l2-70b-gpt4-2.0 --model_backend ExLlama --model_parameters "{'0_Layers':35,'1_Layers':45,'model_ctx':4096}"

Actual Behavior:

Output:

Colab Check: False, TPU: False
INIT       | OK         | KAI Horde Models

 ## Warning: this project requires Python 3.9 or higher.

INFO       | __main__:<module>:680 - We loaded the following model backends: 
Huggingface GPTQ
KoboldAI Old Colab Method
KoboldAI API
Huggingface
Horde
Read Only
OpenAI
ExLlama
Basic Huggingface
GooseAI
INFO       | __main__:general_startup:1395 - Running on Repo: https://github.com/0cc4m/KoboldAI.git Branch: exllama
MESSAGE    | Welcome to KoboldAI!
MESSAGE    | You have selected the following Model: airoboros-l2-70b-gpt4-2.0
ERROR      | __main__:<module>:10948 - An error has been caught in function '<module>', process 'MainProcess' (12311), thread 'MainThread' (140613286643520):
Traceback (most recent call last):

> File "aiserver.py", line 10948, in <module>
    run()
    └ <function run at 0x7fe259d69ca0>

  File "aiserver.py", line 10849, in run
    command_line_backend = general_startup()
                           └ <function general_startup at 0x7fe25a321dc0>

  File "aiserver.py", line 1634, in general_startup
    model_backends[args.model_backend].set_input_parameters(arg_parameters)
    │              │    │                                   └ {'0_Layers': 35, '1_Layers': 45, 'model_ctx': 4096, 'max_ctx': 2048, 'compress_emb': 1, 'ntk_alpha': 1, 'id': 'airoboros-l2-7...
    │              │    └ 'ExLlama'
    │              └ Namespace(apikey=None, aria2_port=None, cacheonly=False, colab=False, configname=None, cpu=False, customsettings=None, f=None...
    └ {'Huggingface GPTQ': <modeling.inference_models.gptq_hf_torch.class.model_backend object at 0x7fe22525b2b0>, 'KoboldAI Old Co...

  File "/home/.../KoboldAI-Client-llama/modeling/inference_models/exllama/class.py", line 423, in set_input_parameters
    self.model_config.device_map.layers = []
    │    └ None
    └ <modeling.inference_models.exllama.class.model_backend object at 0x7fe22b4f31c0>

AttributeError: 'NoneType' object has no attribute 'device_map'

Expected Behavior:

The model parameters can be set at startup

Environment:

$ git remote -v
origin  https://github.com/0cc4m/KoboldAI.git (fetch)
origin  https://github.com/0cc4m/KoboldAI.git (push)

$ git status
On branch exllama
Your branch is up to date with 'origin/exllama'


  commit 973aea12ea079e9c5de1e418b848a0407da7eab7 (HEAD -> exllama, origin/exllama)
  Author: 0cc4m <[email protected]>
  Date:   Sun Jul 23 22:07:34 2023 +0200
  
      Only import big python modules for GPTQ once they get used

Additionally, the following change should be made in play.sh:

$ git diff play.sh
  diff --git a/play.sh b/play.sh
  index 8ce7b781..3e88ae28 100755
  --- a/play.sh
  +++ b/play.sh
  @@ -3,4 +3,4 @@ export PYTHONNOUSERSITE=1
   if [ ! -f "runtime/envs/koboldai/bin/python" ]; then
   ./install_requirements.sh cuda
   fi
  -bin/micromamba run -r runtime -n koboldai python aiserver.py $*
  +bin/micromamba run -r runtime -n koboldai python aiserver.py "$@"

So that you can pass in JSON as the model params with spaces between the KV pairs, as the help parameter instructs you:

$ ./play.sh --host --model airoboros-l2-70b-gpt4-2.0 --model_backend ExLlama --model_parameters help
...

INFO | main:general_startup:1395 - Running on Repo: https://github.com/0cc4m/KoboldAI.git Branch: exllama
MESSAGE | Welcome to KoboldAI!
MESSAGE | You have selected the following Model: airoboros-l2-70b-gpt4-2.0
ERROR | main:general_startup:1627 - Please pass through the parameters as a json like "{'[ID]': '[Value]', '[ID2]': '[Value]'}" using --model_parameters (required parameters shown below)
ERROR | main:general_startup:1628 - Parameters (ID: Default Value (Help Text)): 0_Layers: [None] (The number of layers to put on NVIDIA GeForce RTX 3090.)
1_Layers: [0] (The number of layers to put on NVIDIA GeForce RTX 3090.)
max_ctx: 2048 (The maximum context size the model supports)
compress_emb: 1 (If the model requires compressed embeddings, set them here)
ntk_alpha: 1 (NTK alpha value)

0cc4m / koboldai Goto Github PK

koboldai's People

Stargazers

Watchers

Forkers

koboldai's Issues

Summary

Environment

Steps to Reproduce

Observed Results

Expected Results

Summary

Steps to reproduce:

Actual Behavior:

Expected Behavior:

Environment:

Recommend Projects

Recommend Topics

Recommend Org