Giter Site home page Giter Site logo

nerogar / onetrainer Goto Github PK

View Code? Open in Web Editor NEW
1.4K 23.0 119.0 2.25 MB

OneTrainer is a one-stop solution for all your stable diffusion training needs.

License: GNU Affero General Public License v3.0

Python 99.06% Batchfile 0.21% Shell 0.61% Dockerfile 0.11%
fine-tuning lora stable-diffusion training

onetrainer's People

Contributors

allenbenz avatar aplio avatar calamdor avatar captin411 avatar conanak99 avatar dougbtv avatar finfanfin avatar float-trip avatar hameerabbasi avatar heasterian avatar janca avatar lolzen avatar lshqqytiger avatar mx avatar nerogar avatar orcinus avatar prog0111 avatar sirtrippsalot avatar theforgotten69 avatar vladmandic avatar xirvian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onetrainer's Issues

Unable to create backup, which crashes training.

Looks like a problem with slashes. \ /

I can't be the only one? The program mixes \ and / in many places. Everything else runs, but once i hit a backup epoch, boom.

Creating Backup G:/workspace\backup\2023-11-21_20-33-05
Traceback (most recent call last):
File "G:\OneTrainer\modules\trainer\GenericTrainer.py", line 329, in backup
self.model_saver.save(
File "G:\OneTrainer\modules\modelSaver\StableDiffusionLoRAModelSaver.py", line 107, in save
self.__save_internal(model, output_model_destination)
File "G:\OneTrainer\modules\modelSaver\StableDiffusionLoRAModelSaver.py", line 68, in __save_internal
torch.save(model.optimizer.state_dict(), os.path.join(destination, "optimizer", "optimizer.pt"))
File "G:\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 618, in save
with _open_zipfile_writer(f) as opened_zipfile:
File "G:\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 492, in _open_zipfile_writer
return container(name_or_buffer)
File "G:\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 463, in init
super().init(torch._C.PyTorchFileWriter(self.name))
RuntimeError: Parent directory G: does not exist.
Could not save backup. Check your disk space!

Unable to start training for SDXL LoRA

Traceback (most recent call last):
File "/tools/OneTrainer/scripts/train.py", line 33, in
main()
File "/tools/OneTrainer/scripts/train.py", line 20, in main
trainer.start()
File "/tools/OneTrainer/modules/trainer/GenericTrainer.py", line 130, in start
self.data_loader = self.create_data_loader(
File "/tools/OneTrainer/modules/trainer/BaseTrainer.py", line 57, in create_data_loader
return create.create_data_loader(
File "/tools/OneTrainer/modules/util/create.py", line 246, in create_data_loader
return StableDiffusionXLFineTuneDataLoader(train_device, temp_device, args, model, train_progress)
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLFineTuneDataLoader.py", line 18, in init
super(StableDiffusionXLFineTuneDataLoader, self).init(
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 34, in init
self.__ds = self.create_dataset(
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 364, in create_dataset
preparation_modules = self._preparation_modules(args, model)
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 202, in _preparation_modules
downscale_mask = ScaleImage(in_name='mask', out_name='latent_mask', factor=0.125)
NameError: name 'ScaleImage' is not defined. Did you mean: 'rescale_image'?

Changing ScaleImage to Downscale gets me past this point, but then I run into this error:

Traceback (most recent call last):
File "/tools/OneTrainer/scripts/train.py", line 33, in
main()
File "/tools/OneTrainer/scripts/train.py", line 20, in main
trainer.start()
File "/tools/OneTrainer/modules/trainer/GenericTrainer.py", line 130, in start
self.data_loader = self.create_data_loader(
File "/tools/OneTrainer/modules/trainer/BaseTrainer.py", line 57, in create_data_loader
return create.create_data_loader(
File "/tools/OneTrainer/modules/util/create.py", line 246, in create_data_loader
return StableDiffusionXLFineTuneDataLoader(train_device, temp_device, args, model, train_progress)
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLFineTuneDataLoader.py", line 18, in init
super(StableDiffusionXLFineTuneDataLoader, self).init(
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 34, in init
self.__ds = self.create_dataset(
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 364, in create_dataset
preparation_modules = self._preparation_modules(args, model)
File "/tools/OneTrainer/modules/dataLoader/StableDiffusionXLBaseDataLoader.py", line 206, in _preparation_modules
encode_prompt_1 = EncodeClipText(in_name='tokens_1', hidden_state_out_name='text_encoder_1_hidden_state', pooled_out_name=None, add_layer_norm=False, text_encoder=model.text_encoder_1, hidden_state_output_index=-(2+args.text_encoder_layer_skip))
TypeError: EncodeClipText.init() got an unexpected keyword argument 'add_layer_norm'

This occurs when attempting to train with or without touching TE1/2.

Latest install broken

Install stats...

creating venv in D:\OneTrainer\venv
activating venv D:\OneTrainer\venv
installing dependencies
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118
Collecting diffusers
  Cloning https://github.com/huggingface/diffusers.git (to revision 7200985) to c:\users\jason\appdata\local\temp\pip-install-sn1sov7x\diffusers_fd2cb56825804a8da7f6ff7b3dd0b543
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/diffusers.git 'C:\Users\Jason\AppData\Local\Temp\pip-install-sn1sov7x\diffusers_fd2cb56825804a8da7f6ff7b3dd0b543'
  WARNING: Did not find branch or tag '7200985', assuming revision or ref.
  Running command git checkout -q 7200985
  Resolved https://github.com/huggingface/diffusers.git to commit 7200985
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting transformers
  Cloning https://github.com/huggingface/transformers.git (to revision 656e869) to c:\users\jason\appdata\local\temp\pip-install-sn1sov7x\transformers_5efb4222a95d478383f9a8a31e03f549
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git 'C:\Users\Jason\AppData\Local\Temp\pip-install-sn1sov7x\transformers_5efb4222a95d478383f9a8a31e03f549'
  WARNING: Did not find branch or tag '656e869', assuming revision or ref.
  Running command git checkout -q 656e869
  Resolved https://github.com/huggingface/transformers.git to commit 656e869
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting mgds
  Cloning https://github.com/Nerogar/mgds.git (to revision c7e3a7c) to c:\users\jason\appdata\local\temp\pip-install-sn1sov7x\mgds_47e51c8d55cf4768ad67eccc732769b0
  Running command git clone --filter=blob:none --quiet https://github.com/Nerogar/mgds.git 'C:\Users\Jason\AppData\Local\Temp\pip-install-sn1sov7x\mgds_47e51c8d55cf4768ad67eccc732769b0'
  WARNING: Did not find branch or tag 'c7e3a7c', assuming revision or ref.
  Running command git checkout -q c7e3a7c
  Resolved https://github.com/Nerogar/mgds.git to commit c7e3a7c
  Preparing metadata (setup.py) ... done
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp310-cp310-win_amd64.whl (14.6 MB)
Collecting opencv-python==4.7.0.72
  Using cached opencv_python-4.7.0.72-cp37-abi3-win_amd64.whl (38.2 MB)
Collecting pillow==9.3.0
  Using cached https://download.pytorch.org/whl/Pillow-9.3.0-cp310-cp310-win_amd64.whl (2.5 MB)
Collecting tqdm==4.64.1
  Using cached https://download.pytorch.org/whl/tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting PyYAML==6.0
  Using cached PyYAML-6.0-cp310-cp310-win_amd64.whl (151 kB)
Collecting pytorch-lightning==2.0.3
  Using cached pytorch_lightning-2.0.3-py3-none-any.whl (720 kB)
Collecting torch==2.0.1+cu118
  Using cached https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp310-cp310-win_amd64.whl (2619.1 MB)
Collecting torchvision==0.15.2+cu118
  Using cached https://download.pytorch.org/whl/cu118/torchvision-0.15.2%2Bcu118-cp310-cp310-win_amd64.whl (4.9 MB)
Collecting accelerate==0.18.0
  Using cached accelerate-0.18.0-py3-none-any.whl (215 kB)
Collecting safetensors==0.3.1
  Using cached safetensors-0.3.1-cp310-cp310-win_amd64.whl (263 kB)
Collecting tensorboard==2.13.0
  Using cached tensorboard-2.13.0-py3-none-any.whl (5.6 MB)
Collecting omegaconf
  Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB)
ERROR: Could not find a version that satisfies the requirement xformers==0.0.20.dev539 (from versions: 0.0.1, 0.0.2, 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.9, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.16rc424, 0.0.16rc425, 0.0.16, 0.0.17rc481, 0.0.17rc482, 0.0.17, 0.0.18, 0.0.19, 0.0.20, 0.0.21.dev543, 0.0.21.dev544, 0.0.21.dev546, 0.0.21.dev547, 0.0.21.dev548)
ERROR: No matching distribution found for xformers==0.0.20.dev539

[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: D:\OneTrainer\venv\Scripts\python.exe -m pip install --upgrade pip

************
Install done
************
Press any key to continue . . .

Then when running start-ui.bat

D:\OneTrainer>start-ui
activating venv D:\OneTrainer\venv
Traceback (most recent call last):
  File "D:\OneTrainer\scripts\train_ui.py", line 6, in <module>
    from modules.ui.TrainUI import TrainUI
  File "D:\OneTrainer\modules\ui\TrainUI.py", line 8, in <module>
    import customtkinter as ctk
ModuleNotFoundError: No module named 'customtkinter'
Press any key to continue . . .

[Enhancement] New features about text augmentation

Hello,

I found it would be good if we can have the preview of text augmentation just like the image augmentation. And there are also some potential augmentations for captions:

  • Randomly dropping a caption chunk by a given probability. Need to have a list of strings to exclude caption chunks that user don't want to be dropped.

I also have a question about the shuffle of dataset. Does msg dataloader will shuffle the order of training data in each epoch? I think it should be normal to always shuffle the dataset.

Many thanks!

Training Issue and Requirements issue

Hi all,

First off, always looking forward to a new method of training to try!

  1. Theres an issue with a fresh install, using the install.bat or manually with pip -r requirements.txt - it fails to build wheel on the line "git+https://github.com/huggingface/transformers.git@5bb4430#egg=transformers". Removing the tag etc to just "git+https://github.com/huggingface/transformers.git" allows it to successfully install.

  2. After installing, using the method above, and afaik setting the options correctly and starting to train, I get the following error message in console, i've not dug into it yet, but here is the error message:

Traceback (most recent call last):
  File "E:\AIStuff\Software\OneTrainer\modules\ui\TrainUI.py", line 555, in training_thread_function
    trainer.start()
  File "E:\AIStuff\Software\OneTrainer\modules\trainer\GenericTrainer.py", line 95, in start
    self.data_loader = self.create_data_loader(
  File "E:\AIStuff\Software\OneTrainer\modules\trainer\BaseTrainer.py", line 55, in create_data_loader
    return create.create_data_loader(
  File "E:\AIStuff\Software\OneTrainer\modules\util\create.py", line 187, in create_data_loader
    return MgdsStableDiffusionFineTuneDataLoader(args, model, train_progress)
  File "E:\AIStuff\Software\OneTrainer\modules\dataLoader\MgdsStableDiffusionFineTuneDataLoader.py", line 14, in __init__
    super(MgdsStableDiffusionFineTuneDataLoader, self).__init__(args, model, train_progress)
  File "E:\AIStuff\Software\OneTrainer\modules\dataLoader\MgdsStableDiffusionBaseDataLoader.py", line 26, in __init__
    self.ds = self.create_dataset(
  File "E:\AIStuff\Software\OneTrainer\modules\dataLoader\MgdsStableDiffusionBaseDataLoader.py", line 337, in create_dataset
    return self._create_mgds(
  File "E:\AIStuff\Software\OneTrainer\modules\dataLoader\MgdsBaseDataLoader.py", line 23, in _create_mgds
    ds = MGDS(
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 351, in __init__
    self.loading_pipeline.start()
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 290, in start
    module.start()
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1076, in start
    self.__refresh_cache()
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1067, in __refresh_cache
    torch.save(split_item, os.path.join(cache_dir, str(index) + '.pt'))
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 440, in save
    with _open_zipfile_writer(f) as opened_zipfile:
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 315, in _open_zipfile_writer
    return container(name_or_buffer)
  File "E:\AIStuff\Software\OneTrainer\venv\lib\site-packages\torch\serialization.py", line 288, in __init__
    super().__init__(torch._C.PyTorchFileWriter(str(name)))
RuntimeError: Parent directory E: does not exist.

Considering it's running off the E drive itself, you can clearly see it exists :)
Looking for suggestions on this one, before I start digging.

Oh and im not 100% sure on it, but the "debug directory" folder selection on the UI under general, seems to be a filebrowser, not a folder browser, it's looking for a specific file, not a folder to choose. I assume its meant to be a folderbrowser type.

Strange results in training, picture and settings attached

As per title, my resulting images are more or less all like:

immagine

prompt is: a man in a red coat

My settings:
{
"training_method": "LORA",
"model_type": "STABLE_DIFFUSION_15",
"debug_mode": false,
"debug_dir": "debug",
"workspace_dir": "C:/Users/one/Pictures/LoRa_Training/Loratry",
"cache_dir": "C:/Users/one/Pictures/LoRa_Training/Loratry/cache",
"tensorboard": true,
"base_model_name": "C:/AI/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned.ckpt",
"extra_model_name": "",
"weight_dtype": "FLOAT_32",
"output_dtype": "FLOAT_16",
"output_model_format": "SAFETENSORS",
"output_model_destination": "C:/Users/one/Pictures/LoRa_Training/Loratry/models/onetrainer/supa3d_oneT",
"concept_file_name": "training_concepts/concepts.json",
"circular_mask_generation": false,
"random_rotate_and_crop": true,
"aspect_ratio_bucketing": true,
"latent_caching": true,
"latent_caching_epochs": 1,
"optimizer": "ADAMW_8BIT",
"learning_rate_scheduler": "COSINE_WITH_RESTARTS",
"learning_rate": 0.0002,
"learning_rate_warmup_steps": 200,
"learning_rate_cycles": 10,
"weight_decay": 0.01,
"epochs": 100,
"batch_size": 1,
"gradient_accumulation_steps": 1,
"ema": "OFF",
"ema_decay": 0.999,
"ema_update_step_interval": 5,
"train_text_encoder": true,
"train_text_encoder_epochs": 30,
"text_encoder_learning_rate": 5e-05,
"text_encoder_layer_skip": 1,
"train_unet": true,
"train_unet_epochs": 100000,
"unet_learning_rate": 0.0002,
"offset_noise_weight": 0.05,
"rescale_noise_scheduler_to_zero_terminal_snr": false,
"force_v_prediction": false,
"force_epsilon_prediction": false,
"train_device": "cuda",
"temp_device": "cpu",
"train_dtype": "FLOAT_16",
"only_cache": false,
"resolution": 512,
"masked_training": false,
"unmasked_probability": 0.1,
"unmasked_weight": 0.1,
"normalize_masked_area_loss": false,
"max_noising_strength": 1.0,
"token_count": 1,
"initial_embedding_text": "*",
"lora_rank": 128,
"lora_alpha": 128.0,
"attention_mechanism": "XFORMERS",
"sample_definition_file_name": "training_samples/samples.json",
"sample_after": 2,
"sample_after_unit": "MINUTE",
"backup_after": 30,
"backup_after_unit": "MINUTE",
"backup_before_save": true
}

what could it be?

Specific training data naming requirements?

I am attempting to explore OneTrainer but I am having trouble with the training process as there appears to be some hidden rules on the file names that is breaking training? I have added a concept eith the following settings:

name: gs
path: C:/Users/bneigher/Desktop/OneTrainerDemo/train/img
prompt_source: From image file name
include_sub_dires: true

training folder:
1.jpg, 1.txt, 2.jpg, 2.txt, ect...

When I run training, I see the following error:

PIL.UnidentifiedImageError: cannot identify image file 'C:/Users/bneigher/Desktop/OneTrainerDemo/train/img\\40_gs_xl\\._1.jpg'

Perhaps this is user error.. but I did notice that it appears that this application creates hidden files as nominal operation.. which I would imagine would present some challenges if not cleaned up when necessary (example being the training dataset folder)

"FileNotFoundError" when running train.bat

Seemingly installed fine. Ran the program and added a dataset folder to Workspace Directory, then specified an output checkpoint name on the Model tab. Training didn't get any error, but upon exporting the train.bat and running it, I get the following:

C:\Users\T\OneTrainer>python scripts/train.py --training-method="LORA" --model-type="STABLE_DIFFUSION_15" --debug-dir="debug" --workspace-dir="C:/Users/T/Docs/test" --cache-dir="workspace-cache/run" --tensorboard --base-model-name="runwayml/stable-diffusion-v1-5" --extra-model-name="" --weight-dtype="FLOAT_32" --output-dtype="FLOAT_32" --output-model-format="CKPT" --output-model-destination="models/lora.ckpt" --concept-file-name="training_concepts/concepts.json" --aspect-ratio-bucketing --latent-caching --latent-caching-epochs="1" --optimizer="ADAMW" --learning-rate-scheduler="CONSTANT" --learning-rate="0.0003" --learning-rate-warmup-steps="200" --learning-rate-cycles="1" --weight-decay="0.01" --epochs="100" --batch-size="4" --gradient-accumulation-steps="1" --ema="OFF" --ema-decay="0.999" --ema-update-step-interval="5" --train-text-encoder --train-text-encoder-epochs="30" --text-encoder-learning-rate="0.0003" --text-encoder-layer-skip="0" --train-unet --train-unet-epochs="100000" --unet-learning-rate="0.0003" --offset-noise-weight="0.05" --train-device="cuda" --temp-device="cpu" --train-dtype="FLOAT_16" --resolution="512" --unmasked-probability="0.1" --unmasked-weight="0.1" --max-noising-strength="1.0" --token-count="1" --initial-embedding-text="*" --lora-rank="16" --lora-alpha="1.0" --attention-mechanism="XFORMERS" --sample-definition-file-name="training_samples/samples.json" --sample-after="1" --sample-after-unit="MINUTE" --backup-after="10" --backup-after-unit="MINUTE" --backup-before-save
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton.language'
Traceback (most recent call last):
File "C:\Users\TOneTrainer\scripts\train.py", line 33, in
main()
File "C:\Users\T\OneTrainer\scripts\train.py", line 18, in main
trainer = GenericTrainer(args, callbacks, commands)
File "C:\Users\T\OneTrainer\modules\trainer\GenericTrainer.py", line 64, in init
self.tensorboard_subprocess = subprocess.Popen(
File "C:\Users\T\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\T\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified

My Python install is functioning fine for Auto1111 webui.

UPDATE: Feeding the dataset into the Workspace Directory is not the correct way to do this it seems. I pointed this to a blank directory instead, and then created a "concept" and pointed that to the dataset directory.
This prompted another error "IndexError: list index out of range". Which I resolved by replacing the default Cache Directory with a blank folder on my drive.

Seems to be working now.

RuntimeError: "grid_sampler_2d_cuda" not implemented for 'BFloat16'

Hello, whenever I cache with train data type BF16 I'm told it's not implemented. I cache in fp16 then run in bf16 with seemingly no further issue as long as I preserve the cache, workable but annoying.


Traceback (most recent call last):
File "D:\AI\OneTrainer\modules\ui\TrainUI.py", line 561, in training_thread_function
trainer.start()
File "D:\AI\OneTrainer\modules\trainer\GenericTrainer.py", line 107, in start
self.data_loader = self.create_data_loader(
File "D:\AI\OneTrainer\modules\trainer\BaseTrainer.py", line 55, in create_data_loader
return create.create_data_loader(
File "D:\AI\OneTrainer\modules\util\create.py", line 177, in create_data_loader
return MgdsStableDiffusionFineTuneDataLoader(args, model, train_progress)
File "D:\AI\OneTrainer\modules\dataLoader\MgdsStableDiffusionFineTuneDataLoader.py", line 14, in init
super(MgdsStableDiffusionFineTuneDataLoader, self).init(args, model, train_progress)
File "D:\AI\OneTrainer\modules\dataLoader\MgdsStableDiffusionBaseDataLoader.py", line 26, in init
self.ds = self.create_dataset(
File "D:\AI\OneTrainer\modules\dataLoader\MgdsStableDiffusionBaseDataLoader.py", line 337, in create_dataset
return self._create_mgds(
File "D:\AI\OneTrainer\modules\dataLoader\MgdsBaseDataLoader.py", line 23, in _create_mgds
ds = MGDS(
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 357, in init
self.loading_pipeline.start()
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 302, in start
module.start_next_epoch()
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1083, in start_next_epoch
self.__refresh_cache()
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1066, in __refresh_cache
split_item[name] = self.get_previous_item(name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\DiffusersDataLoaderModules.py", line 34, in get_item
image = self.get_previous_item(self.in_name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 394, in get_item
image = self.get_previous_item(self.image_in_name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 919, in get_item
previous_item = self.get_previous_item(name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 879, in get_item
previous_item = self.get_previous_item(name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 839, in get_item
previous_item = self.get_previous_item(name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 799, in get_item
previous_item = self.get_previous_item(name, index)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
File "D:\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 761, in get_item
previous_item = functional.rotate(previous_item, angle, interpolation=InterpolationMode.BILINEAR)
File "D:\AI\OneTrainer\venv\lib\site-packages\torchvision\transforms\functional.py", line 1140, in rotate
return F_t.rotate(img, matrix=matrix, interpolation=interpolation.value, expand=expand, fill=fill)
File "D:\AI\OneTrainer\venv\lib\site-packages\torchvision\transforms_functional_tensor.py", line 669, in rotate
return _apply_grid_transform(img, grid, interpolation, fill=fill)
File "D:\AI\OneTrainer\venv\lib\site-packages\torchvision\transforms_functional_tensor.py", line 560, in _apply_grid_transform
img = grid_sample(img, grid, mode=mode, padding_mode="zeros", align_corners=False)
File "D:\AI\OneTrainer\venv\lib\site-packages\torch\nn\functional.py", line 4244, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: "grid_sampler_2d_cuda" not implemented for 'BFloat16'

[Bug] IndexError: list assignment index out of range

Hello,

Thanks for providing this tool to let user easier to finetune the SD model! This tool is cool to have!

However, when trying to start the test, I met an error said IndexError: list assignment index out of range like below
image

I guess this issue is related to the dataset itself. I have a folder contains all the images and corresponding caption .txt file with the same name as related image. I have created a concept dataset in concepts tab by using "From image file name" mode(in fact I have tried 3 modes but all failed with same error).

Do you have any idea about how to fix this issue? Thanks!

Colab version

Hello, I would like to give oneTrainer a chance, but I only have a 3070 with 8gb of Vram.
Will I be able to train SDXL Loras or only 1.5 models?
Also is there a Colab version? Thanks

Issues with VAE Finetuning

Hi, I'm reporting some issues with Fine Tune VAE

I had to add to the arguments
enabled_in_name='settings.enable_random_circular_mask_shrink'

random_mask_rotate_crop = RandomMaskRotateCrop(mask_name='mask', additional_names=inputs, min_size=args.resolution,

enabled_in_name='concept.enable_random_flip'

random_flip = RandomFlip(names=inputs)

the ema property is not defined in the model, the error is thrown here but the ema property is accesed in other places so I had to add a line with self.model.ema=None after line 91 as a workaround:

if self.model.ema:

Also I have to activate Random Rotate and Crop and Aspect Ratio Bucketing or otherwise throws an error, also I cannot sample images during training, it tries to open an image that hasn't been created

Traceback (most recent call last):
  File "/mnt/D2/Neural/OneTrainer/scripts/train.py", line 33, in <module>
    main()
  File "/mnt/D2/Neural/OneTrainer/scripts/train.py", line 24, in main
    trainer.train()
  File "/mnt/D2/Neural/OneTrainer/modules/trainer/GenericTrainer.py", line 314, in train
    self.__execute_sample_during_training()
  File "/mnt/D2/Neural/OneTrainer/modules/trainer/GenericTrainer.py", line 143, in __execute_sample_during_training
    fun()
  File "/mnt/D2/Neural/OneTrainer/modules/trainer/GenericTrainer.py", line 301, in <lambda>
    lambda: self.__sample_during_training(train_progress)
  File "/mnt/D2/Neural/OneTrainer/modules/trainer/GenericTrainer.py", line 192, in __sample_during_training
    self.__sample_loop(train_progress, sample_definitions)
  File "/mnt/D2/Neural/OneTrainer/modules/trainer/GenericTrainer.py", line 166, in __sample_loop
    self.model_sampler.sample(
  File "/mnt/D2/Neural/OneTrainer/modules/modelSampler/StableDiffusionVaeSampler.py", line 35, in sample
    image = Image.open(prompt)
  File "/mnt/D2/Neural/OneTrainer/venv/lib/python3.10/site-packages/PIL/Image.py", line 3131, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'Muscular anthro male'

After this, it finally starts training normally.

Install.bat not working.

After following the instructions (doing git clone), I tried to run install.bat but keep getting the same failure error:

activating venv C:\Users\silve\Documents\AI Art__Tools\General\Training\OneTrainer\venv
installing dependencies
The system cannot find the path specified.


Install done


Press any key to continue . . .

Running start-ui.bat also fails, which confirms that the install didn't work correctly. From my folder structure, it seems like the venv folder simply isn't being created. I have Python installed, and even reinstalled it to check.

resizing UI window

Hi been using OT for few weeks,
The Dataset tool UI window is not resizable at the moment, the image tagger prompt box is hidden under windows task bar when it launched,
Thanks :)

[Bug] Cannot open optimizer settings

Hello,

After updating to latest version, I cannot click on the "..." button of optimizer to set parameters.

Error message:
"AttributeError: 'TrainingTab' object has no attribute 'tk'"

TypeError: StableDiffusionModelLoader.load()

Trying this out for the first time after being used to using kohya_ss GUI. Loaded up the sd1.5 LoRA config preset, added a concept (it correctly shows a preview image), added sampling settings, and then get this error after clicking Start Training:

Traceback (most recent call last):
  File "C:\Stuff\AI\OneTrainer\modules\ui\TrainUI.py", line 555, in training_thread_function
    trainer.start()
  File "C:\Stuff\AI\OneTrainer\modules\trainer\GenericTrainer.py", line 79, in start
    self.model = self.model_loader.load(
  File "C:\Stuff\AI\OneTrainer\modules\modelLoader\StableDiffusionLoRAModelLoader.py", line 110, in load
    model = base_model_loader.load(model_type, base_model_name, None)
TypeError: StableDiffusionModelLoader.load() missing 1 required positional argument: 'extra_model_name'

I tried pointing to my own local sd-1.5.ckpt model, same error.

Can't finetune sdxl with a 4090, out of memory

When I try to fine-tune sdxl 0.9 model. I get out of memory errors. I of course make sure it's a fresh boot up and nothing is running in the background. I turned off caching and tensorboard in the gui settings but it says it's caching anyway during the initial training.

I don't know if that's what's eating up the vram at the beginning or not. The 4090 has 24GB of vram and is supposedly supposed to be able to fine-tune.

bug report after updating to the latest version

Hello,

After updating to the latest version by using update.bat script, I face a bug:
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.1.0+cu118)
Python 3.10.11 (you have 3.10.9)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
A matching Triton is not available, some optimizations will not be enabled.

Which means I cannot enable attention with xformers anymore. I think the reason is the incompatible version between CUDA+Pytorch and xformers.

Could you please check it? Many thanks!

Error: missing 1 required positional argument: 'extra_model_name'

Hi there. I set up a dataset to test OneTrainer on a simple task, a SD 1.5 embedding. I used the GUI to set up everything based off the provided template and changed a couple details and the obvious things like the base model etc, added my dataset via the concepts tab and saved the whole thing as a preset so I can later work off of that.

When I ran the training with the GUI I received:
TypeError: StableDiffusionModelLoader.load() missing 1 required positional argument: 'extra_model_name'

Since I cannot see this param exposed anywhere in the GUI, I used the export as script function and manually edited the batch file, in this case just filling in the extra model name with the base model path again, then another try with a name for the model (I am not sure what this parameter is really requiring heh) and one try where I just removed the extra model name param entirely from the script, hoping it would default to something that works this way.
All of these tries, including the removal of the parameter altogether, lead to the exact same error
TypeError: StableDiffusionModelLoader.load() missing 1 required positional argument: 'extra_model_name'

I am completely out of ideas. What am I overlooking here? Send help!
Thanks :)

Any way to set optimizer arguments?

Is there any way to set optimizer arguments?
For example: safeguard_warmup=False, d_coef=2, use_bias_correction=True. These arguments are essential to Prodigy optimizer.
Thank you.

Feedback after using OneTrainer

Hello,

It's me again. I have played with OneTrainer for a while since the issue I mentioned has been fixed. During testing it, I feel like it's quiet good and easy to use.

Based on my experience, I'd like to share some feedback to you.

  1. Model Output Destination only works when enable Backup Before save. If disable it, I will lose the final trained model because the script won't save the final trained model to a backup (in my case, I backup after 20 epochs and total epochs to 100). To save the final model, I need to set the epoch to 101 so that the script can save model at 100 epochs.
  2. An error happens when run "start_ui.py". It said: "A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton'"
  3. Can add a tab 'Test' to let user load their trained model to play and test with.
  4. Once there are "workspace" and "workspace-cache" folders, the script will overwrite the results and cache inside the folder. Instead, I prefer to let the script create a new folder under "workspace" and "workspace-cache" called "run_xxx" (means you will have different running id for each experiment. We can look into the workspace folder to see existing folder id and plus it by 1 to have a new assigned id.
  5. After reading the description of latent caching, I still not understood it clearly yet. Just out of curiosity, could you explain a little bit to me about what is this feature actually doing? I feel like it is caching some intermediary data of the training samples by a given latent caching epochs. E.g., if the latent caching epochs set to 10, it will create different 10 versions of training data based on the data augmentation methods I enabled. During training, the script will load those 10 versions of data (after 10 epochs) so that the diversity of training samples is increased. Am I right?

Other features look really nice! It's quiet easy to play with! Thank you for delivering this cool tool!

KeyError: 'tokens_1' while training sdxl

arch based linux manjaro with 4090 and amd threadripper

I get this error when trying to train sdxl. I can train 1.5 fine.

enumerating sample paths: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1265.25it/s]
writing debug images for 'decoded_image': 0%| | 0/80 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/vhey/OneTrainer/modules/ui/TrainUI.py", line 555, in training_thread_function
trainer.start()
File "/home/vhey/OneTrainer/modules/trainer/GenericTrainer.py", line 103, in start
self.data_loader = self.create_data_loader(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/modules/trainer/BaseTrainer.py", line 55, in create_data_loader
return create.create_data_loader(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/modules/util/create.py", line 179, in create_data_loader
return MgdsStableDiffusionXLFineTuneDataLoader(args, model, train_progress)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/modules/dataLoader/MgdsStableDiffusionXLFineTuneDataLoader.py", line 14, in init
super(MgdsStableDiffusionXLFineTuneDataLoader, self).init(args, model, train_progress)
File "/home/vhey/OneTrainer/modules/dataLoader/MgdsStableDiffusionXLBaseDataLoader.py", line 26, in init
self.ds = self.create_dataset(
^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/modules/dataLoader/MgdsStableDiffusionXLBaseDataLoader.py", line 327, in create_dataset
return self._create_mgds(
^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/modules/dataLoader/MgdsBaseDataLoader.py", line 23, in _create_mgds
ds = MGDS(
^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 357, in init
self.loading_pipeline.start()
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 302, in start
module.start_next_epoch()
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/DebugDataLoaderModules.py", line 38, in start_next_epoch
image_tensor = self.get_previous_item(self.image_in_name, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/DebugDataLoaderModules.py", line 114, in get_item
latent_image = self.get_previous_item(self.in_name, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 206, in get_item
item[name] = self.get_previous_item(name, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 51, in get_previous_item
item = module.get_item(index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/DiffusersDataLoaderModules.py", line 115, in get_item
distribution = self.get_previous_item(self.in_name, index)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/MGDS.py", line 59, in get_previous_item
item = module.get_item(index, item_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/vhey/OneTrainer/venv/lib/python3.11/site-packages/mgds/GenericDataLoaderModules.py", line 1096, in get_item
item[name] = split_item[name]
~~~~~~~~~~^^^^^^
KeyError: 'tokens_1'

No such file or directory

Traceback (most recent call last):
File "E:\SD\ONETRIN\OneTrainer\modules\ui\TrainUI.py", line 476, in training_thread_function
trainer.train()
File "E:\SD\ONETRIN\OneTrainer\modules\trainer\GenericTrainer.py", line 206, in train
self.__sample_during_training(train_progress)
File "E:\SD\ONETRIN\OneTrainer\modules\trainer\GenericTrainer.py", line 120, in __sample_during_training
self.model_sampler.sample(
File "E:\SD\ONETRIN\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 351, in sample
image.save(destination)
File "E:\SD\ONETRIN\OneTrainer\venv\lib\site-packages\PIL\Image.py", line 2350, in save
fp = builtins.open(filename, "w+b")
FileNotFoundError: [Errno 2] No such file or directory: 'E:/SD/Dataset\samples\2 - a photo gtk of girl \training-sample-0-0-0.png'

Unable to load local checkpoint during LoRA training

When training a LoRA I am unable to load my local sd1.5 checkpoint as a base model. This is the error after clicking Start Training with the Base Model set to C:/Stuff/AI/Stable Diffusion/models/Stable-diffusion/sd-v1-5.ckpt:

Traceback (most recent call last):
  File "C:\Stuff\AI\OneTrainer\modules\ui\TrainUI.py", line 555, in training_thread_function
    trainer.start()
  File "C:\Stuff\AI\OneTrainer\modules\trainer\GenericTrainer.py", line 79, in start
    self.model = self.model_loader.load(
  File "C:\Stuff\AI\OneTrainer\modules\modelLoader\StableDiffusionLoRAModelLoader.py", line 110, in load
    model = base_model_loader.load(model_type, weight_dtype, base_model_name, None)
  File "C:\Stuff\AI\OneTrainer\modules\modelLoader\StableDiffusionModelLoader.py", line 226, in load
    raise Exception("could not load model: " + base_model_name)
Exception: could not load model: C:/Stuff/AI/Stable Diffusion/models/Stable-diffusion/sd-v1-5.ckpt

If I instead use the default of runwayml/stable-diffusion-v1-5 then it works, but I don't want yet another duplicate model cluttering my already full hard drive.

My local sd1.5 model works fine for A1111 inference and kohya_ss training, so idk why it isn't loading here.

Error when 'Normalize Masked Area Loss' option is toggled on while 'Masked Training' is turned off

To reproduce:

  1. Choose SD1.5 LoRA preset
  2. Navigate to 'training' tab
  3. Setup normal parameters for LoRA training
  4. Validate that 'Masked Training' is turned off
  5. Turn on 'Normalize Masked Area Loss' option
  6. Start training
  7. Observe the following error is thrown:

Traceback (most recent call last):
File "M:\repos\OneTrainer\modules\ui\TrainUI.py", line 734, in __training_thread_function
trainer.train()
File "M:\repos\OneTrainer\modules\trainer\GenericTrainer.py", line 505, in train
loss = self.model_setup.calculate_loss(self.model, batch, model_output_data, self.args)
File "M:\repos\OneTrainer\modules\modelSetup\BaseStableDiffusionSetup.py", line 373, in calculate_loss
return self._diffusion_loss(
File "M:\repos\OneTrainer\modules\modelSetup\mixin\ModelSetupDiffusionLossMixin.py", line 76, in _diffusion_loss
clamped_mask = torch.clamp(batch['latent_mask'], args.unmasked_weight, 1)
KeyError: 'latent_mask'

It is my assumption that the 'Normalize Masked Area Loss' option should be ignored 'if/when' the 'Masked Training' option is turned off.

Bitsandbytes error on launch

Hi, installed everything today and got this error

bin C:\AI\OneTrainer\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.so
C:\AI\OneTrainer\venv\lib\site-packages\bitsandbytes\cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
function 'cadam32bit_grad_fp32' not found

Training does not start giving me error about CUDA GPU not found

Did someone encountered the same problem?

Error during sampling

Traceback (most recent call last):
File "E:\Tools\ML\OneTrainer\modules\ui\TrainUI.py", line 556, in training_thread_function
trainer.train()
File "E:\Tools\ML\OneTrainer\modules\trainer\GenericTrainer.py", line 278, in train
self.__execute_sample_during_training()
File "E:\Tools\ML\OneTrainer\modules\trainer\GenericTrainer.py", line 124, in __execute_sample_during_training
fun()
File "E:\Tools\ML\OneTrainer\modules\trainer\GenericTrainer.py", line 268, in
lambda: self.__sample_during_training(train_progress)
File "E:\Tools\ML\OneTrainer\modules\trainer\GenericTrainer.py", line 173, in __sample_during_training
self.__sample_loop(train_progress, sample_definitions)
File "E:\Tools\ML\OneTrainer\modules\trainer\GenericTrainer.py", line 147, in __sample_loop
self.model_sampler.sample(
File "E:\Tools\ML\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 339, in sample
image = self.__sample_base(
File "E:\Tools\ML\OneTrainer\venv\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "E:\Tools\ML\OneTrainer\modules\modelSampler\StableDiffusionSampler.py", line 149, in __sample_base
latent_image = noise_scheduler.step(
File "E:\Tools\ML\OneTrainer\venv\lib\site-packages\diffusers\schedulers\scheduling_pndm.py", line 257, in step
return self.step_plms(model_output=model_output, timestep=timestep, sample=sample, return_dict=return_dict)
File "E:\Tools\ML\OneTrainer\venv\lib\site-packages\diffusers\schedulers\scheduling_pndm.py", line 373, in step_plms
prev_sample = self._get_prev_sample(sample, timestep, prev_timestep, model_output)
File "E:\Tools\ML\OneTrainer\venv\lib\site-packages\diffusers\schedulers\scheduling_pndm.py", line 407, in _get_prev_sample
alpha_prod_t = self.alphas_cumprod[timestep]
IndexError: index 1001 is out of bounds for dimension 0 with size 1000

"optimizer got an empty parameter list"

Trying to get this to run on a headless linux machine, so running via CLI only.
Attempting to fine tune SDXL.
I'm getting the following error:

image

I've tried passing both a huggingface path as the model_name and an actual file path to a local copy, same thing.
Additional bit of weirdness - "decoder-model-name" is a required parameter, even though it's not even used for SDXL training, but wuerstchen.

Here are my parameters:

python scripts/train.py --training-method FINE_TUNE --model-type STABLE_DIFFUSION_XL_10_BASE --workspace-dir /store/ml/onetrainer/run1 --cache-dir /store/ml/onetrainer/cache --tensorboard --base-model-name stabilityai/stable-diffusion-xl-base-1.0 --decoder-model-name stabilityai/stable-diffusion-xl-base-1.0 --weight-dtype FLOAT_16 --output-dtype FLOAT_16 --output-model-format SAFETENSORS --output-model-destination /store/ml/onetrainer/output/model1.safetensors --gradient-checkpointing --concept-file-name /store/ml/onetrainer/concepts.json --random-rotate-and-crop --aspect-ratio-bucketing --optimizer ADAMW --optimizer-weight-decay 0.01 --learning-rate 8e-7 --learning-rate-warmup-steps 500 --epochs 100 --batch-size 1 --ema CPU --ema-decay 0.9999 --train-dtype FLOAT_16 --resolution 1024 --attention-mechanism SDP --vae-weight-dtype FLOAT_32 --sample-definition-file-name /store/ml/onetrainer/samples.json --sample-after 250 --sample-after-unit STEP --sample-image-format PNG --save-after 1000 --save-after-unit STEP --backup-after 1000 --backup-after-unit STEP

Any clues?

does the training scripts support multi-gpu training?

i check the code, i think it can't train the model on multi-gpu or multi device. do you have some advice to support multi device training? because the whole training process is wrapped into the trainer class, i think it's difficult to support multi gpu training

Disable dataset caching

I would like to skip the cache process before training because my disk space is limited. I want to access the files directly at every step of the training. Is there a switch to disable this feature?
@Nerogar

Lora training fails with latest update despite having alignprop disabled via UI/train.sh

Traceback (most recent call last):
File "/tools/OneTrainer/scripts/train.py", line 33, in
main()
File "/tools/OneTrainer/scripts/train.py", line 24, in main
trainer.train()
File "/tools/OneTrainer/modules/trainer/GenericTrainer.py", line 505, in train
loss = self.model_setup.calculate_loss(self.model, batch, model_output_data, self.args)
File "/tools/OneTrainer/modules/modelSetup/BaseStableDiffusionXLSetup.py", line 379, in calculate_loss
return self._diffusion_loss(
File "/tools/OneTrainer/modules/modelSetup/mixin/ModelSetupDiffusionLossMixin.py", line 34, in _diffusion_loss
if data['loss_type'] == 'align_prop':
KeyError: 'loss_type'
train.txt

Error when training embedding on SDXL 1.0

Hello!
I receive this error in the console when trying to train an embedding. I tried every possible setting that could potentially cause issues, but this error persists unchanged:

Traceback (most recent call last):
File "xxx\OneTrainer\modules\ui\TrainUI.py", line 577, in training_thread_function
trainer.start()
File "xxx\OneTrainer\modules\trainer\GenericTrainer.py", line 91, in start
self.model = self.model_loader.load(
AttributeError: 'NoneType' object has no attribute 'load'

Just picked SDXL 1.0 preset, switched to embedding, set up paths to workspace dirs, parameters for training, concept etc and tried to start training.

Implementing support for PIXART-α Fine Tuning and DreamBooth

PIXART-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

https://pixart-alpha.github.io/

They have train scripts here : https://github.com/PixArt-alpha/PixArt-alpha/tree/master/train_scripts

This model is literally better than SDXL

Normally I am making tutorials with Kohya for Stable Diffusion but OneTrainer was also on my radar. I think OneTrainer can shine with new model training.

I made a full tutorial for PIXART those who wonders.

PIXART-α : First Open Source Rival to Midjourney - Better Than Stable Diffusion SDXL - Full Tutorial

image

SDXL: lora_te2_text_projection is not found in created LoRA modules.

I got the above error trying to load my recently created SDXL LoRA created with OneTrainer. I suppose it is possible that a setting change I made caused this, but it is not readily apparent. OneTrainer did not provide any error messages or anything to indicate that something was wrong during training.

No such file or directory: 'training_concepts/concepts.json'

Hello, when I am trying to run it via UI I am getting "No such file or directory: 'training_concepts/concepts.json'". Could you let me know what the concepts.json and samples.json have to contain so I can make them manually and run via CLI?

Issue with sampling

Hello everyone! For some reason with mostly default settings, I'm seriously struggling to get it to sample during training, i get these errors:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument tensors in method wrapper_CUDA_cat)
Error during sampling, proceeding without samplin

LoRA Masked Training: KeyError: 'latent_mask'

When attempting to use the Masked Training option when training a LoRA I get this error:

activating venv C:\Stuff\AI\OneTrainer\venv
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'



You have disabled the safety checker for <class 'diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline'> by passing `safety_checker=None`. Ensure that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling it only for use-cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
enumerating sample paths: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 71.28it/s]
caching resolutions: 100%|██████████████████████████████████████████████████████████| 27/27 [00:00<00:00, 27066.49it/s]
step:   0%|                                                                                      | 0/6 [00:00<?, ?it/s]
epoch:   0%|                                                                                   | 0/100 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Stuff\AI\OneTrainer\modules\ui\TrainUI.py", line 556, in training_thread_function
    trainer.train()
  File "C:\Stuff\AI\OneTrainer\modules\trainer\GenericTrainer.py", line 257, in train
    for epoch_step, batch in enumerate(tqdm(self.data_loader.dl, desc="step")):
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\tqdm\std.py", line 1195, in __iter__
    for obj in iterable:
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 357, in __getitem__
    return self.loading_pipeline.get_item(index)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 322, in get_item
    return self.output_module.get_item(index)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 200, in get_item
    item[name] = self.get_previous_item(name, index)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 50, in get_previous_item
    item = module.get_item(index, item_name)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1224, in get_item
    item[name] = self.get_previous_item(name, index)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 50, in get_previous_item
    item = module.get_item(index, item_name)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\DiffusersDataLoaderModules.py", line 115, in get_item
    distribution = self.get_previous_item(self.in_name, index)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\MGDS.py", line 50, in get_previous_item
    item = module.get_item(index, item_name)
  File "C:\Stuff\AI\OneTrainer\venv\lib\site-packages\mgds\GenericDataLoaderModules.py", line 1093, in get_item
    item[name] = split_item[name]
KeyError: 'latent_mask'

This error does not happen when the Masked Training setting is turned off.

Here is a screenshot of what my dataset looks like:

dataset
I used Lyne to create a transparent attention mask around the subject, then had ChatGPT write a python script to rename and convert those transparent images to black+white so they could be used with OneTrainer.

Tutorials?

Are there any proper tutorials for this? I have absolutely NO CLUE what is going on. It is training, backing things up, making samples, and I have NO CLUE where any of it is going. It isn't going into the folders that I designated or even to the folders that OneTuner has. I'm not trying to have a bunch of "hidden" files filling up my computer. What is going on?

How to use in Linux?

I followed the install instructions. git clone, venv, install requirements, then run start-ui.sh. It then creates a conda environment ignoring the python venv. None of the scripts work as well looking for "module.ui". Can we add instructions to install in linux please? Thanks.

ModuleNotFoundError: No module named 'modules.ui'; 'modules' is not a package

Is there any threshold to the limit the tokens?

Hello guys,

I'm wondering if there are some captions which are too long (exceeding 77 token limits), would there be a token limit threshold to cut the captions during training?

It should be wonderful if there is no such kind of token limits so we can use longer captions to finetune the SD model.

Error on manual install: WARNING: Did not find branch or tag '9d7c08c

I had issues with the bat install as I think the python path it was looking for was different.
Anyways, just tried to manually install it and notice this as it was installing the requirements:

WARNING: Did not find branch or tag '9d7c08c', assuming revision or ref.

I'm not sure what this means or if it's required but figured I'd say something.

queue system?

Hi, is it possible to add a queue system? I often leave the pc overnight and I would like it to train several models during the night, not just the last one I left. Or is it too difficult to implement?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.