miccunifi / ladi-vton Goto Github PK

[ACM MM 2023] - LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On

License: Other

Python 100.00%

dresscode fashionai generative-model latent-diffusion-models stable-diffusion textual-inversion virtual-tryon viton-hd acmmm acmmm2023

ladi-vton's Issues

What should I do to use different Inpaint models?

SD-XL Inpaint:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py

Controlnet Inpaint:

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py

Do I need to train to use this models? Or is it enough for me to develop a new pipe code? What should I do? Can you help me?

Working of ladi-vton with the DressCode dataset

The model trained on VITON-HD is amazing, it definitely is one of the best
Is it just me, or the results generated using Dresscode no where near the VITON-HD model.
The model not being able to work with Textual details on t-shirts is understandable.
But I have actually written a pre-processing pipeline in Colab and run inference on custom data, and even a little miss in the parsing's generates a totally bad image. I am now getting perfect pre-processings with the size of data and positioning also right but the Dresscode model literally doesn't work with faces. All the faces in the final result are distorted as attached below . Aren't the EMASC modules supposed to restore the face?

Is there anything which can be done to solve this issue or is it not possible?

****

how to using the Stable Diffusion VAE without using the denoising network

As depicted in the caption the figure you reported (Figure 7) is an ablation study on the Stable Diffusion VAE and EMASC contribution.
The images in the figure are compressed and decompressed using the Stable Diffusion VAE without using the denoising network.

Alberto

Originally posted by @ABaldrati in #9 (comment)

Clothing replaced by same clothing

We're running inference to recreate paper results using the VITON-HD dataset (test_pairs.txt in our conda environment), however the results appear to be slightly modified versions of the original clothing. It appears as if it's going through a diffusion pass, but not applying the clothing.

Running on a Windows 11 4090 PC, following the default settings/commands provided.

Images:

questions about the KID metric

Hi,thank you for your great work!

After inferencing on VITON-HD dataset with your released model, I got KID_p 0.0015 (1.08 in your paper), and KID_u 0.0018 (1.60 in your paper). Why such a big difference？

Thanks for any advices.

How about to maintain details when warping？

hi，I found that a lot of information is lost in the results after warping (such as logos and patterns). It seems that without refinement and only using TPS, more details will be retained. Has the author done any ablation experiments using only TPS training?

Training Code

First of all, thank you for this amazing work ! Do you plan to release the training code as well?

How to use this on custom dataset?

Hey I am trying to use the model in custom dataset, here you can look I am doing the inference on this image but the result was not what was expected. How can use this model on custom data? do I need to run finetuning on a specific individual to understand its body type?

Could I ask you for some advice?

I want to use a pre-trained large model, but the input requirements for the model are generally square. For human body images, which are generally rectangular, how do I process the image to meet the needs of the pre-trained model? Simply filling in the blanks seems to make the whole image more sparse.

could not get good result

Thanks for your great work. I want to test your model trained by dresscode. However, I don't have dresscode. I use the vitonhd.py to load vitonhd training data and use the pretrained model by dresscode. Unluckly, the sleeve is over original boundary. Can you give me same advice.

Another question, when I test your code by vitonhd(vitonhd.py to load data and pretrained by vitonhd)，I find that the sleeve is not match my human parsing. Can I give depth/canny control to your diffusion model?

Thanks for any advices.

Out of memory when training the emasc module

Hello, I'm a beginner in artificial intelligence. Your work is very good. I'm very interested in it, but I'm trying to reproduce it now. When I train the EMASC module, there is always an error prompt of "out of memory". Do you have any suggestions?
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 23.69 GiB total capacity; 22.12 GiB already allocated; 103.25 MiB free; 22.16 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to make inference on only one photo not just the whole dataset?

I would like to know how can we make inference on only one single image.

try on tattoos

Can I use this project to try on tattoos? If yes, what I need to do?

[Colab Guide] Drive Link for Quick Inference on custom data with LADI-VTON using Colab

I have made a Colab preprocessing pipeline for Ladi-VTON which can run inference on custom data using the DressCode model.

Here is the my drive Link https://drive.google.com/drive/folders/19XL0kvTw6SoCCAOJY9FgvuQJ9M_JAZHt?usp=sharing
You will need to make a copy of my drive in your google drive with the same name first and use GPU on Colab

I have made pre-processing usable for the DressCode dataset.
Keep your input images in /images folder and write the test pairs properly.
Then after running ladi-vton_DressCode.ipynb input folder for inference will automatically be generated.

By running inference on custom data, ladi-vton messes up the faces. So I have made a Refinement Notebook using Google Mediapipe just for this purpose,
The intermediate results after inference are in results folder, and final results after refinement will be in final folder.

I have used this exact drive to generate some results and it mostly works. There are some problems with few specific garments.
Thinking to shift the drive into a Github repo after sometime. If you have any doubts/suggestions you can post.

Generated image deviates from the original garment image

Thank you for your work, I ran the generating code the generated image has some deviation from the original garment image, the image is below. Can you please tell me where the problem is.

Ask for training gpu spec

Thank you for sharing your work. I'm now trying to train your module with 8 GPUs with gpu memory 15GB each, but it shows OOM error. Can you share your training gpu spec? How many GPUs did you use to train VTO module?

I can't run VTON dataset.

Error Message:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 76: character maps to <undefined>

Some questions about the training process

Thanks to the authors for such influential work.

I would like to confirm if the EMASC module is trained by reconstruction loss, as mentioned in the article for L1 and VGG loss. How is its input constructed, is it a model image I, a mapping of I to \tilde{I}?
During the training of the enhanced Stable Diffusion pipeline, is there any model image I used as input? Because the sampling operation of the model image I appears in Equation 3 and Equation 4.
I am more curious about how the whole training process of diffusion-based models actually works. Not limited to this work, I have similar confusion in other work and I hope to get help from the authors here. What I think about the mechanism of the diffusion model is that it is fitting the distribution of the data set, so what is the underlying principle that it is being applied to the try-on task. What is it fitting in a given task? In other words, similar to the second question, how to effectively use the ground truth picture, i.e. the model picture I.

how to prepare text prompt

Hi, thank you for your nice work. But I would like to ask how to obtain text prompt for training. It seems the VITON-HD dataset did not provide text prompt.

Time and hardware for inference and training

Hi! Congrats on the excellent paper!

Could you tell me how much time it takes to run an inference and on which hardware?

Also, with which hardware did you train and for how long?

Thanks!!

Not realistic

Hi, so I tried using this model but the result generated is not realistic. Do you know why this can be. The upper body shirt on this is tried on using the model. It is this shirt.

Real world results

Hi, thank you for such a nice work. I was wondering, have you tried your model on real-world data outside of the datasets you mentioned in the paper, both target model and garment?

Also, a question regarding training, have you merged those datasets for training or have you trained separate models for each dataset?

Issue with training VTO & Inversion Adapter

Hi,

I'm trying to train all the model with 1024x768 images. I performed to train TPS & EMASC using this shape with some code modifications. Training works well according to metrics and visuals results.

But, it doesn't work at all for Inversion adapter and VTO. Both training produce no loss reduction during training (close to constant using hard smoothing on wandb and very oscillating without smoothing).

I tested also using 512x384 shape and it gives me the same results.
Is it an expected result ?

I'm using default parameters except batch_size = 8 for VTO and batch_size=1 for Inversion adapter on a single A100 GPU. I assume that a greater value than 1 could prevent this training issue, but my HW doesn't allow to use a bigger one 😞
I tried to reduce learning rate but it results to the same issue.

Commands used to train Inversion adapter and VTO :

python src/train_inversion_adapter.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/inverter_1024 --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --allow_tf32 --pretrained_model_name_or_path pretrained_models/stable-diffusion-2-inpainting/ --height 1024 --width 768 --train_batch_size 1 --test_batch_size 1
python src/train_vto.py --dataset vitonhd --vitonhd_dataroot data/viton-hd/ --output_dir checkpoints/vto_1024 --inversion_adapter_dir checkpoints/inverter_1024/ --gradient_checkpointing --enable_xformers_memory_efficient_attention --use_clip_cloth_features --height 1024 --width 768 --train_batch_size 8 --test_batch_size 8 --allow_tf32

Could you please, help me to resolve this pb ?
Thanks for your clean work btw :)

training code

How long untill the training code is released?

How to get image "image-parse-v3"

To run the inference.py script, you must have at least images in the "cloth", "image", "image-parse-v3", "openpose_json folders" (i use vitonhd dataset) . Everything is clear with the "images", "cloth" and "image". I also learned how to receive "openpose_json" files. But I can't find how to get images for "image-parse-v3"? help me please

How to run inferience for a single image (not included in the training set)

Hey, great project

How to run inferience for a single image (not included in the training set)

Team have plans for a google colab?

Questions about extending the first convolutional layer

Congrats on your work! In the paper, you mentioned that:

we propose to extend the kernel channels of the first convolutional layer by adding zero initialized weights to match the new input channel dimension

Will you also fine-tune the first convolutional layer or the stable diffusion model during your training to accommodate for the channel change?

BTW, will the code be released before the end of June?

How can i run this project with m1 mac

I followed all instructions which this project provide, and I get this error:

ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU

After that I removed --enable_xformers_memory_efficient_attention argument from my command, the error changed like this:

...src/inference.py", line 226, in main
    generator = torch.Generator("cuda").manual_seed(args.seed)
                ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Device type CUDA is not supported for torch.Generator() api.

I've searched that error and then I've found pytorch mps document. I changed the src/inference.py:226 code to:

generator = torch.Generator("mps").manual_seed(args.seed)

So maybe I've been trying something wrong, because it didn't work too. I was going to try without GPU but i didn't do it. Is there a way to close cuda/GPU?

Thank you

My environment:
Macbook Pro 14-inch, 2021
chip: Apple M1 Pro
os: 13.6 (22G120)
python: 3.11

Can handle lower body garments？

Can I try on the pants, please, and if so, how can I do it?

Traning Code

Hello! First, I want to congratulate you on these amazing results. I also wanted to ask when the training code will be available. Additionally, is this similar to fine-tuning the stable diffusion model on but on multiple concepts ?

Outdated arguments in function

The Encoder uses the get_down_block here using parameter attn_num_head_channels but get_down_block doesn't have such parameter in newer versions of the diffusers library

Poor results

Hi , I've been attempting to replicate the results you demonstrated in Figure 7 of your paper. However, the outcome is not as presented in your paper. Specifically, the pattern on the T-shirt is not being reproduced.

Here's the same garment I found in zalando-hd

Here is the result when I run your code.

Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Problem about training

Hi,thank you for your great work!

I was trying to wrtie train code and do some training， but I was confused by the We first train the EMASC modules, the textual-inversion adapter,and the warping component. Then, we freeze all the weights of allmodules except for the textual inversion adapter and train the proposed enhanced Stable Diffusion pipeline in 4.2， should I first freeze other weights including unet and train textual inversion adapter or should I free other weight and train textual inversion adapter and unet together。

poor result

I have tested on VitonHD dataset and getting very poor results, see command below:

python3 src/inference.py --dataset vitonhd --vitonhd_dataroot /content/VITON-HD --output_dir ./ --test_order unpaired --category all --batch_size 8 --mixed_precision fp16 --num_workers 8

What data will affect inference results of VITON-HD?

I found that only cloth,image,openpose-json,image-parse-v3 data are needed when inference with VITON-HD datasets.
If I don't provide the data cloth-mask,image-parse-agorisc-v3.2 and so on, will it have any impact on the inference results?
Thank you.

What is your GPU type and How long you training

issue with yml file

Thanks for your great work. I have a hard time to build the some environment on windows system, and I noticed your yml file was built on linux system. Could you upload a yml file for windows? thanks in advance.

Use lower_body, dresses and all

Thanks for your project! how can i use your project for trying on bottoms (pants, skirts) or for dresses? I tested on the VITON-HD dataset, and I got only the top.

Bad Generated Images

Hi,
Appreciate the great work and contribution.
I tested the ladi-vton model on a large number of images on VITON-HD dataset. Some of them which are not working properly i am sharing below

if the model image 'I' is wearing a full sleeves cloth and if i try to replace it with sleeveless or half sleeves, the portion of sleeve remains on the body [image 2].
It kind of tries to inpaint the mask portion exactly which doesn't look perfect in some images and the mask portion is visible at the bottom where the style is not in-shirt type
The occlusion part is not handled perfectly , the images with occlusion are not generate perfectly, in fact results are very distorted [image1]
And yes, the texture information is not preserved of the cloth image properly [image4]

What could be the reasons for these problems and does finetuning or training with images on a larger number of sleeve sleeveless combination would resolve this issue?

Poor result on kid score

Hey, great project!
I want to replicate the results of paired dataset, other metric score is same as shown in the paper, but the kid score is very smaller than that shown in paper:

Could you possibly assist or provide any guidance to address this issue? Thanks in advance.

Question about ablation study

Hi, thanks for sharing your great work!

I'm very interested in exploring the application of LDM in virtual try-on and inspired by your work. But I'm confused by the second third row of Tab. 4 in your paper.

I notice the performance doesn't drop obviously with empty strings (row 1) or textual elements (row 2). How can I get textual elements? Maybe pass the garment images directly through VE to U-NET?

Moreover, why does the performance drop dramatically with f_theta? even much worse than empty strings?

Looking forward to your reply! Thank you again!

training code

Hi,

Thanks for your great work and congrats on your acceptance.

May I know when you will release the training code?

use own model images

What projects (neural networks) should I use to get images as image-parse-v3 and openpose_json files? I want to use my model images, but this requires their images in these formats if I understand the project correctly.

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

I encountered the above situation when I run

Code release date

Hi,

When are you planning to release the code?

Asking

i was just wondering when is the training code going to be released, Thank you in advance

VAE with intermediate features takes up more GPU memory than original VAE

Hi, your work is so wonderful! Here is some questions.

I noticed that declaring val_pipe in the training code as an instance of StableDiffusionTryOnePipeline will occupy a very large amount of GPU memory, and inference.py itself will also occupy a large amount of GPU memory when running. It would be much better to replace VAE with intermediate features with original VAE. Have you noticed it? May I ask what GPU you are running on when running inference.py?

Thank you!

Bad Result on custom image from DressCode Dataset

Hi Folks,

I tried inferencing on single image taken from dresscode with all preprocessed data from the original source data itself with minor tweaks. I am getting unexpected results from the custom inference.

Even when I am doing preprocessing myself the results are similar. I have attached input and output images as reference.(Pose map has 18 channels so couldn't visualize it properly here).

Can anyone help me here?

RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 0

I'm getting this error when trying to run on the DressCode Dataset. can anyone help me on this issue? What I'm doing wrong?
here is the command that i used
python src/inference.py --dataset dresscode --dresscode_dataroot ./DressCode/DressCode --output_dir ./results --test_order unpaired
I've added prints in the dataset.py code where the error was happening and i got this:

Not works well in Textual or Letters

By running command:
python src/inference.py --dataset vitonhd --vitonhd_dataroot zalando-hd-resized --output_dir output --test_order paired --batch_size 1 --mixed_precision fp16
I found it not works well in textual or letters, like those badcases:

That phenomenon is not mentioned in your paper，is there any way to fix it ?

miccunifi / ladi-vton Goto Github PK

ladi-vton's Issues

Recommend Projects

Recommend Topics

Recommend Org