miccunifi / ladi-vton Goto Github PK
View Code? Open in Web Editor NEW[ACM MM 2023] - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
License: Other
[ACM MM 2023] - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
License: Other
SD-XL Inpaint:
Controlnet Inpaint:
Do I need to train to use this models? Or is it enough for me to develop a new pipe code? What should I do? Can you help me?
The model trained on VITON-HD is amazing, it definitely is one of the best
Is it just me, or the results generated using Dresscode no where near the VITON-HD model.
The model not being able to work with Textual details on t-shirts is understandable.
But I have actually written a pre-processing pipeline in Colab and run inference on custom data, and even a little miss in the parsing's generates a totally bad image. I am now getting perfect pre-processings with the size of data and positioning also right but the Dresscode model literally doesn't work with faces. All the faces in the final result are distorted as attached below . Aren't the EMASC modules supposed to restore the face?
Is there anything which can be done to solve this issue or is it not possible?
As depicted in the caption the figure you reported (Figure 7) is an ablation study on the Stable Diffusion VAE and EMASC contribution.
The images in the figure are compressed and decompressed using the Stable Diffusion VAE without using the denoising network.
Alberto
Originally posted by @ABaldrati in #9 (comment)
We're running inference to recreate paper results using the VITON-HD dataset (test_pairs.txt in our conda environment), however the results appear to be slightly modified versions of the original clothing. It appears as if it's going through a diffusion pass, but not applying the clothing.
Running on a Windows 11 4090 PC, following the default settings/commands provided.
Hi,thank you for your great work!
After inferencing on VITON-HD dataset with your released model, I got KID_p 0.0015 (1.08 in your paper), and KID_u 0.0018 (1.60 in your paper). Why such a big difference?
Thanks for any advices.
hi,I found that a lot of information is lost in the results after warping (such as logos and patterns). It seems that without refinement and only using TPS, more details will be retained. Has the author done any ablation experiments using only TPS training?
First of all, thank you for this amazing work ! Do you plan to release the training code as well?
I want to use a pre-trained large model, but the input requirements for the model are generally square. For human body images, which are generally rectangular, how do I process the image to meet the needs of the pre-trained model? Simply filling in the blanks seems to make the whole image more sparse.
Thanks for your great work. I want to test your model trained by dresscode. However, I don't have dresscode. I use the vitonhd.py to load vitonhd training data and use the pretrained model by dresscode. Unluckly, the sleeve is over original boundary. Can you give me same advice.
Another question, when I test your code by vitonhd(vitonhd.py to load data and pretrained by vitonhd),I find that the sleeve is not match my human parsing. Can I give depth/canny control to your diffusion model?
Thanks for any advices.
Hello, I'm a beginner in artificial intelligence. Your work is very good. I'm very interested in it, but I'm trying to reproduce it now. When I train the EMASC module, there is always an error prompt of "out of memory". Do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.69 GiB total capacity; 22.12 GiB already allocated; 103.25 MiB free; 22.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I would like to know how can we make inference on only one single image.
Can I use this project to try on tattoos? If yes, what I need to do?
I have made a Colab preprocessing pipeline for Ladi-VTON which can run inference on custom data using the DressCode model.
Here is the my drive Link https://drive.google.com/drive/folders/19XL0kvTw6SoCCAOJY9FgvuQJ9M_JAZHt?usp=sharing
You will need to make a copy of my drive in your google drive with the same name first and use GPU on Colab
I have made pre-processing usable for the DressCode dataset.
Keep your input images in /images folder and write the test pairs properly.
Then after running ladi-vton_DressCode.ipynb input folder for inference will automatically be generated.
By running inference on custom data, ladi-vton messes up the faces. So I have made a Refinement Notebook using Google Mediapipe just for this purpose,
The intermediate results after inference are in results folder, and final results after refinement will be in final folder.
I have used this exact drive to generate some results and it mostly works. There are some problems with few specific garments.
Thinking to shift the drive into a Github repo after sometime. If you have any doubts/suggestions you can post.
Thank you for sharing your work. I'm now trying to train your module with 8 GPUs with gpu memory 15GB each, but it shows OOM error. Can you share your training gpu spec? How many GPUs did you use to train VTO module?
Error Message:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 76: character maps to <undefined>
Thanks to the authors for such influential work.
I would like to confirm if the EMASC module is trained by reconstruction loss, as mentioned in the article for L1 and VGG loss. How is its input constructed, is it a model image I, a mapping of I to \tilde{I}?
During the training of the enhanced Stable Diffusion pipeline, is there any model image I used as input? Because the sampling operation of the model image I appears in Equation 3 and Equation 4.
I am more curious about how the whole training process of diffusion-based models actually works. Not limited to this work, I have similar confusion in other work and I hope to get help from the authors here. What I think about the mechanism of the diffusion model is that it is fitting the distribution of the data set, so what is the underlying principle that it is being applied to the try-on task. What is it fitting in a given task? In other words, similar to the second question, how to effectively use the ground truth picture, i.e. the model picture I.
Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.
Hi! Congrats on the excellent paper!
Could you tell me how much time it takes to run an inference and on which hardware?
Also, with which hardware did you train and for how long?
Thanks!!
Hi, thank you for such a nice work. I was wondering, have you tried your model on real-world data outside of the datasets you mentioned in the paper, both target model and garment?
Also, a question regarding training, have you merged those datasets for training or have you trained separate models for each dataset?
Hi,
I'm trying to train all the model with 1024x768 images. I performed to train TPS & EMASC using this shape with some code modifications. Training works well according to metrics and visuals results.
But, it doesn't work at all for Inversion adapter and VTO. Both training produce no loss reduction during training (close to constant using hard smoothing on wandb and very oscillating without smoothing).
I tested also using 512x384 shape and it gives me the same results.
Is it an expected result ?
I'm using default parameters except batch_size = 8 for VTO and batch_size=1 for Inversion adapter on a single A100 GPU. I assume that a greater value than 1 could prevent this training issue, but my HW doesn't allow to use a bigger one 😞
I tried to reduce learning rate but it results to the same issue.
Commands used to train Inversion adapter and VTO :
python src/train_inversion_adapter.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/inverter_1024 --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --allow_tf32 --pretrained_model_name_or_path pretrained_models/stable-diffusion-2-inpainting/ --height 1024 --width 768 --train_batch_size 1 --test_batch_size 1
python src/train_vto.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/vto_1024 --inversion_adapter_dir checkpoints/inverter_1024/ --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --height 1024 --width 768 --train_batch_size 8 --test_batch_size 8 --allow_tf32
Could you please, help me to resolve this pb ?
Thanks for your clean work btw :)
How long untill the training code is released?
To run the inference.py script, you must have at least images in the "cloth", "image", "image-parse-v3", "openpose_json folders" (i use vitonhd dataset) . Everything is clear with the "images", "cloth" and "image". I also learned how to receive "openpose_json" files. But I can't find how to get images for "image-parse-v3"? help me please
Hey, great project
How to run inferience for a single image (not included in the training set)
Team have plans for a google colab?
Congrats on your work! In the paper, you mentioned that:
we propose to extend the kernel channels of the first convolutional layer by adding zero initialized weights to match the new input channel dimension
Will you also fine-tune the first convolutional layer or the stable diffusion model during your training to accommodate for the channel change?
BTW, will the code be released before the end of June?
I followed all instructions which this project provide, and I get this error:
ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU
After that I removed --enable_xformers_memory_efficient_attention
argument from my command, the error changed like this:
...src/inference.py", line 226, in main
generator = torch.Generator("cuda").manual_seed(args.seed)
^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device type CUDA is not supported for torch.Generator() api.
I've searched that error and then I've found pytorch mps document. I changed the src/inference.py:226 code to:
generator = torch.Generator("mps").manual_seed(args.seed)
So maybe I've been trying something wrong, because it didn't work too. I was going to try without GPU but i didn't do it. Is there a way to close cuda/GPU?
Thank you
My environment:
Macbook Pro 14-inch, 2021
chip: Apple M1 Pro
os: 13.6 (22G120)
python: 3.11
Can I try on the pants, please, and if so, how can I do it?
Hello! First, I want to congratulate you on these amazing results. I also wanted to ask when the training code will be available. Additionally, is this similar to fine-tuning the stable diffusion model on but on multiple concepts ?
The Encoder uses the get_down_block
here using parameter attn_num_head_channels
but get_down_block
doesn't have such parameter in newer versions of the diffusers
library
Hi , I've been attempting to replicate the results you demonstrated in Figure 7 of your paper. However, the outcome is not as presented in your paper. Specifically, the pattern on the T-shirt is not being reproduced.
Here's the same garment I found in zalando-hd
Here is the result when I run your code.
Could you possibly assist or provide any guidance to address this issue? Thanks in advance.
Hi,thank you for your great work!
I was trying to wrtie train code and do some training, but I was confused by the We first train the EMASC modules, the textual-inversion adapter,and the warping component. Then, we freeze all the weights of allmodules except for the textual inversion adapter and train the proposed enhanced Stable Diffusion pipeline in 4.2, should I first freeze other weights including unet and train textual inversion adapter or should I free other weight and train textual inversion adapter and unet together。
I found that only cloth
,image
,openpose-json
,image-parse-v3
data are needed when inference with VITON-HD datasets.
If I don't provide the data cloth-mask
,image-parse-agorisc-v3.2
and so on, will it have any impact on the inference results?
Thank you.
Thanks for your great work. I have a hard time to build the some environment on windows system, and I noticed your yml file was built on linux system. Could you upload a yml file for windows? thanks in advance.
Thanks for your project! how can i use your project for trying on bottoms (pants, skirts) or for dresses? I tested on the VITON-HD dataset, and I got only the top.
Hi,
Appreciate the great work and contribution.
I tested the ladi-vton model on a large number of images on VITON-HD dataset. Some of them which are not working properly i am sharing below
What could be the reasons for these problems and does finetuning or training with images on a larger number of sleeve sleeveless combination would resolve this issue?
Hi, thanks for sharing your great work!
I'm very interested in exploring the application of LDM in virtual try-on and inspired by your work. But I'm confused by the second third row of Tab. 4 in your paper.
I notice the performance doesn't drop obviously with empty strings (row 1) or textual elements (row 2). How can I get textual elements? Maybe pass the garment images directly through VE to U-NET?
Moreover, why does the performance drop dramatically with f_theta? even much worse than empty strings?
Looking forward to your reply! Thank you again!
Hi,
Thanks for your great work and congrats on your acceptance.
May I know when you will release the training code?
What projects (neural networks) should I use to get images as image-parse-v3 and openpose_json files? I want to use my model images, but this requires their images in these formats if I understand the project correctly.
I encountered the above situation when I run
Hi,
When are you planning to release the code?
i was just wondering when is the training code going to be released, Thank you in advance
Hi, your work is so wonderful! Here is some questions.
I noticed that declaring val_pipe in the training code as an instance of StableDiffusionTryOnePipeline will occupy a very large amount of GPU memory, and inference.py itself will also occupy a large amount of GPU memory when running. It would be much better to replace VAE with intermediate features with original VAE. Have you noticed it? May I ask what GPU you are running on when running inference.py?
Thank you!
Hi Folks,
I tried inferencing on single image taken from dresscode with all preprocessed data from the original source data itself with minor tweaks. I am getting unexpected results from the custom inference.
Even when I am doing preprocessing myself the results are similar. I have attached input and output images as reference.(Pose map has 18 channels so couldn't visualize it properly here).
Can anyone help me here?
I'm getting this error when trying to run on the DressCode Dataset. can anyone help me on this issue? What I'm doing wrong?
here is the command that i used
python src/inference.py --dataset dresscode --dresscode_dataroot ./DressCode/DressCode --output_dir ./results --test_order unpaired
I've added prints in the dataset.py code where the error was happening and i got this:
By running command:
python src/inference.py --dataset vitonhd --vitonhd_dataroot zalando-hd-resized --output_dir output --test_order paired --batch_size 1 --mixed_precision fp16
I found it not works well in textual or letters, like those badcases:
That phenomenon is not mentioned in your paper,is there any way to fix it ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.