ashleykleynhans / kohya-docker Goto Github PK
View Code? Open in Web Editor NEWDocker image for Kohya_ss Web UI
License: GNU General Public License v3.0
Docker image for Kohya_ss Web UI
License: GNU General Public License v3.0
While running a runpod container with this setup, I'm getting an accelerate not found
. Any tips to debug?
Full logs below
04:39:07-748602 INFO Start training LoRA Standard ...
04:39:07-750284 INFO Checking for duplicate image filenames in training data
directory...
04:39:07-752158 INFO Valid image folder names found in:
/workspace/organize/watches/img
04:39:07-753621 INFO Headless mode, skipping verification if model already
exist... if model already exist it will be
overwritten...
04:39:07-755396 INFO Folder 20_wxwatch watch: 24 images found
04:39:07-756743 INFO Folder 20_wxwatch watch: 480 steps
04:39:07-758027 INFO Total steps: 480
04:39:07-759290 INFO Train batch size: 2
04:39:07-759959 INFO Gradient accumulation steps: 1
04:39:07-760572 INFO Epoch: 10
04:39:07-761151 INFO Regulatization factor: 1
04:39:07-761755 INFO max_train_steps (480 / 2 / 1 * 10 * 1) = 2400
04:39:07-762504 INFO stop_text_encoder_training = 0
04:39:07-763121 INFO lr_warmup_steps = 240
04:39:07-763757 INFO Can't use LR warmup with LR Scheduler constant...
ignoring...
04:39:07-764482 INFO Saving training config to
/workspace/organize/watches/model/wxwatches_20240119-04
3907.json...
04:39:07-765433 INFO accelerate launch --num_cpu_threads_per_process=2
"./sdxl_train_network.py" --enable_bucket
--min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="stabilityai/stable-dif
fusion-xl-base-1.0"
--train_data_dir="/workspace/organize/watches/img"
--resolution="1024,1024"
--output_dir="/workspace/organize/watches/model"
--logging_dir="/workspace/organize/watches/log"
--network_alpha="1" --save_model_as=safetensors
--network_module=networks.lora --text_encoder_lr=5e-05
--unet_lr=0.0001 --network_dim=8
--output_name="wxwatches"
--lr_scheduler_num_cycles="10" --no_half_vae
--learning_rate="5e-05" --lr_scheduler="constant"
--train_batch_size="2" --max_train_steps="2400"
--save_every_n_epochs="1" --mixed_precision="fp16"
--save_precision="fp16" --cache_latents
--cache_latents_to_disk --optimizer_type="Adafactor"
--optimizer_args scale_parameter=False
relative_step=False warmup_init=False
--max_grad_norm="1" --max_data_loader_n_workers="0"
--bucket_reso_steps=64 --xformers --bucket_no_upscale
--noise_offset=0.0
/bin/sh: 1: accelerate: not found
Hi
Had managed to run up the docker. but am lost what to do next. Can give advise how to run the ui to do the lora fine tuning?
This i get when i'm trying to do the log thing
return buttonbox(msg=msg,
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/easygui/boxes/button_box.py", line 95, in buttonbox
bb = ButtonBox(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/easygui/boxes/button_box.py", line 147, in init
self.ui = GUItk(msg, title, choices, images, default_choice, cancel_choice, self.callback_ui)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/easygui/boxes/button_box.py", line 263, in init
self.boxRoot = tk.Tk()
File "/usr/lib/python3.10/tkinter/init.py", line 2299, in init
self.tk = _tkinter.create(screenName, baseName, className, interactive, wantobjects, useTk, sync, use)
_tkinter.TclError: no display name and no $DISPLAY environment variable
I"m not joking you, this worked less than a month ago, i got an SDXL lora running for 0.9 less than three weeks ago
I installed the docker container like so:
sudo docker run -d --name kohya --gpus all -v '/kohya/kohya-docker/workspace' -p 3000:3001 -p 8000:8000 -p 8888:8888 -p 2999:2999 ashleykza/kohya:latest
also without the name:
sudo docker run -d --name kohya --gpus all -v '/kohya/kohya-docker/workspace' -p 3000:3001 -p 8000:8000 -p 8888:8888 -p 2999:2999 ashleykza/kohya:latest
I'm able to see Runpod, jupyterLab, and the runpod uploader, but the kohya_ss webui will not load. I am on localhost:3000. I tried looking on another device on my network and i am unable to access the page, logs just say "container is READY". It was working yesterday. I have recreated the container a few times.
Starting Jupyter Lab... Jupyter Lab started Starting RunPod Uploader... RunPod Uploader started Running pre-start script... Template version: 24.0.6 Syncing kohya_ss to workspace, please wait... Syncing Application Manager to workspace, please wait... Fixing venv... Fixing venv. Old Path: /kohya_ss/venv New Path: /workspace/kohya_ss/venv Configuring accelerate... Starting Kohya_ss Web UI Kohya_ss started Log file: /workspace/logs/kohya_ss.log All services have been started RUNPOD_PUBLIC_IP is not set. Skipping FileZilla configuration. Updating rclone... 2024/04/27 22:49:13 NOTICE: rclone is up to date Exporting environment variables... Container is READY!
I made all the folders and uploaded the data like I do on my local PC
but still get this error in th Jupyter- Image folder does not exist
The stable-diffusion-docker image uses a different port for Kohya_ss than this image, so stopping Kohya_ss using application manager on this image uses the wrong port.
Need to find a solution to be able to use different ports for stable-diffusion-docker and kohya-docker.
Hi
i used the built docker image ashleykza/kohya:latest to run and all is fine. However, when trying to train, i hit into issue as i believe there is some incompatibility of CUDA 11.8 with my machine. Am using CUDA 11.6 on my machine. Can i check is the Dockerfile in this repo the same one which is used to built the "ashleykza/kohya:latest" as i have issue with the #0 324.1 E: Couldn't find any package by glob 'python3.10-venv'" which python3.10-venv is one of the packages to installed. Thanks
Hey would you mind running:
'rclone selfupdate'
Rclone requires to be on latest version to work with dropbox
Would save us the trouble.
Thank you!
Would love Cron added in order to schedule regular rclone transfers during training. Checkpoint file size is huge. Helps to start transferring in parts regularly during training
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.