showlab / tune-a-video Goto Github PK

View Code? Open in Web Editor NEW

4.2K 49.0 377.0 2.88 MB

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Home Page: https://tuneavideo.github.io

License: Apache License 2.0

Python 88.29% Jupyter Notebook 11.71%

tune-a-video's Introduction

Tune-A-Video

This repository is the official implementation of Tune-A-Video.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Given a video-text pair as input, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.

News

🚨 Announcing LOVEU-TGVE: A CVPR competition for AI-based video editing! Submissions due Jun 5. Don't miss out! 🤩

[02/22/2023] Improved consistency using DDIM inversion.
[02/08/2023] Colab demo released!
[02/03/2023] Pre-trained Tune-A-Video models are available on Hugging Face Library!
[01/28/2023] New Feature: tune a video on personalized DreamBooth models.
[01/28/2023] Code released!

Setup

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4, v2-1). You can also use fine-tuned Stable Diffusion models trained on different styles (e.g, Modern Disney, Anything V4.0, Redshift, etc.).

[DreamBooth] DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few images (3~5 images) of a subject. Tuning a video on DreamBooth models allows personalized text-to-video generation of a specific subject. There are some public DreamBooth models available on Hugging Face (e.g., mr-potato-head). You can also train your own DreamBooth model following this training example.

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-4"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=12.5).videos

save_videos_grid(video, f"./{prompt}.gif")

Results

Pretrained T2I (Stable Diffusion)

Input Video	Output Video

"A man is skiing"	"Spider Man is skiing on the beach, cartoon style”	"Wonder Woman, wearing a cowboy hat, is skiing"	"A man, wearing pink clothes, is skiing at sunset"

"A rabbit is eating a watermelon on the table"	"A rabbit is ~~eating a watermelon~~ on the table"	"A cat with sunglasses is eating a watermelon on the beach"	"A puppy is eating a cheeseburger on the table, comic style"

"A jeep car is moving on the road"	"A Porsche car is moving on the beach"	"A car is moving on the road, cartoon style"	"A car is moving on the snow"

"A man is dribbling a basketball"	"James Bond is dribbling a basketball on the beach"	"An astronaut is dribbling a basketball, cartoon style"	"A lego man in a black suit is dribbling a basketball"

Pretrained T2I (personalized DreamBooth)

Input Video	Output Video

"A bear is playing guitar"	"1girl is playing guitar, white hair, medium hair, cat ears, closed eyes, cute, scarf, jacket, outdoors, streets"	"1boy is playing guitar, bishounen, casual, indoors, sitting, coffee shop, bokeh"	"1girl is playing guitar, red hair, long hair, beautiful eyes, looking at viewer, cute, dress, beach, sea"

Input Video	Output Video

"A bear is playing guitar"	"A rabbit is playing guitar, modern disney style"	"A handsome prince is playing guitar, modern disney style"	"A magic princess with sunglasses is playing guitar on the stage, modern disney style"

Input Video	Output Video

"A bear is playing guitar"	"Mr Potato Head, made of lego, is playing guitar on the snow"	"Mr Potato Head, wearing sunglasses, is playing guitar on the beach"	"Mr Potato Head is playing guitar in the starry night, Van Gogh style"

Citation

If you make use of our work, please cite our paper.

@inproceedings{wu2023tune,
  title={Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation},
  author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Shi, Yufei and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7623--7633},
  year={2023}
}

Shoutouts

This code builds on diffusers. Thanks for open-sourcing!
Thanks hysts for the awesome gradio demo.

tune-a-video's People

Contributors

Stargazers

Watchers

Forkers

pollinations codeaudit johndpope techthiyanes abhishekiitd327 w0lramd sidward kalufinnle geekygladiator cedro3 thesoulharmonic aadityaverma russpalms lyrl izumisatoshi duyuankai1992 linhuaiyi saulocatharino davidhefan vmenezes599 dango233 jeffery9707 anthonyyuan siran575 costiash ssusantachary weikaikong xdhhh jackylee1 hoooon89 pharmapsychotic aerovfx assassindesign abdoiiii chenchy jorgemcgomes jackzhousz kanguyen-vn asksasasa83 zero506 dguo98 hongjing-qing treksis dgxbl abhinay1997 apprikatai hyojunguy 111111m junhaozhang98 chenyangqiqi xiaoya-li overclockteam smurf-1119 likeboo hyoukachitandaeru nscosine michael203522 avcssef cheeryleeyy plzcm 0xdigiscore amandafanny hhy5277 bedoyajiomara15 chivalrouss latifahape uakbr hufeihu lixirui142 lsclone qxmao237 github-arong gaoyuexit jacob780712 danxiangjie jianqingzheng liuyi61111 forkrepoandreference iwillcodeu nanwang0 iamleon121 system63mush shenjiawei19 lingerwsk rocket-tsang vinthony heshujing chenhao5188 smallvq123 fubearlai wanily alexanderinum cocowy1 zyxin810 leedaga aibabelx 10000plus lyhiving wodole iphyer

tune-a-video's Issues

Unet model

No pretrained unet model provided

attention_mask is None during training

In the paper it mentioned that the ST-Attn layer only consider the previous frame and the first frame. But in the code it seems that during training, the attention_mask for the SC-Attn layer is None. Could you explain why this setting is not same as in the paper?

The video of the dancing man

Could the author provide the video of the dancing man?

FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''

Traceback (most recent call last):
File "C:\Users\coron\Desktop\Tune-A-Video\inference.py", line 15, in
save_videos_grid(video, f"{prompt}.gif")
File "C:\Users\coron\Desktop\Tune-A-Video\tuneavideo\util.py", line 21, in save_videos_grid
os.makedirs(os.path.dirname(path), exist_ok=True)
File "C:\Users\coron\AppData\Local\Programs\Python\Python310\lib\os.py", line 225, in makedirs
mkdir(name, mode)
FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''

execution,inferens.py

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "outputs/man-surfing_lr3e-5_seed33/2023-01-30T23-03-19"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
unet.enable_xformers_memory_efficient_attention()
pipe = TuneAVideoPipeline.from_pretrained("checkpoints/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

torch.cuda.manual_seed(0)
prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")

I don't know if this is relevant.
Tune-A-Video\tuneavideo\util.py

import os
import imageio
import numpy as np
import torch
import torchvision

from einops import rearrange

def save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=4, fps=3):
videos = rearrange(videos, "b c t h w -> t b c h w")
outputs = []
for x in videos:
x = torchvision.utils.make_grid(x, nrow=n_rows)
x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
if rescale:
x = (x + 1.0) / 2.0 # -1,1 -> 0,1
x = (x * 255).numpy().astype(np.uint8)
outputs.append(x)

os.makedirs(os.path.dirname(path), exist_ok=True)
imageio.mimsave(path, outputs, fps=fps)

Error no file named scheduler_config.json found

I downloaded stable-diffusion-v1-4 checkpoints "sd-v1-4-full-ema.ckpt", "sd-v1-4.ckpt" and put them into folder "./checkpoints/stable-diffusion-v1-4"! However when I run the accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" I get the following error:

Traceback (most recent call last): File "/home/ubuntu/project/Tune-A-Video/train_tuneavideo.py", line 367, in <module> main(**OmegaConf.load(args.config)) File "/home/ubuntu/project/Tune-A-Video/train_tuneavideo.py", line 105, in main noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model_path, subfolder="scheduler") File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/diffusers/schedulers/scheduling_utils.py", line 118, in from_pretrained config, kwargs = cls.load_config( File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/diffusers/configuration_utils.py", line 320, in load_config raise EnvironmentError( OSError: Error no file named scheduler_config.json found in directory ./checkpoints/stable-diffusion-v1-4. Traceback (most recent call last): File "/opt/conda/envs/makevideo/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/launch.py", line 915, in launch_command simple_launcher(args) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/launch.py", line 578, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/envs/makevideo/bin/python3.9', 'train_tuneavideo.py', '--config=configs/man-skiing.yaml']' returned non-zero exit status 1.

Where can I get scheduler_config.json?

How does it run on Mac M1

pip3 install -r requirement.txt error like:

ERROR: Could not find a version that satisfies the requirement decord==0.6.0 (from versions: none)
ERROR: No matching distribution found for decord==0.6.0

The decord library has no arm adaptation, Is there an alternative library?
https://pypi.tuna.tsinghua.edu.cn/simple/decord/

help，请问这个怎么解决？

(video) D:\pythonProject\video>accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 0
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
04/11/2023 13:50:51 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: fp16

Traceback (most recent call last):
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\configuration_utils.py", line 326, in load_config
config_file = hf_hub_download(
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\huggingface_hub\utils_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\huggingface_hub\utils_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './checkpoints/stable-diffusion-v1-4'. Use repo_type argument if n
eeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_tuneavideo.py", line 367, in
main(**OmegaConf.load(args.config))
File "train_tuneavideo.py", line 105, in main
noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model_path, subfolder="scheduler")
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\schedulers\scheduling_utils.py", line 118, in from_pretrained
config, kwargs = cls.load_config(
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\configuration_utils.py", line 363, in load_config
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./checkpoints/stable-diffusion-v1-4 is not the path to
a directory containing a scheduler_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.
Traceback (most recent call last):
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\Scripts\accelerate.exe_main.py", line 7, in
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\launch.py", line 923, in launch_command
simple_launcher(args)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\launch.py", line 579, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\YFT-IT-002\anaconda3\envs\video\python.exe', 'train_tuneavideo.py', '--config=configs/man-skiing.yaml']' returned non-zero exit st
atus 1.

Change the device 0 to 1

How do I change the device 0 to one so that Pytorch will use device cuda 1

@zhangjiewu ?

ValueError : xformers is not available. Make sure it is installed correctly

When I run "trian_tuneavideo_py" I got this ValueError, how can I deal this or how to install it?

Have you received a letter from your lawyer? If not, I will introduce a photo of a certain celebrity in my future research. Because I am a true ikun.

Model compatibility.

I tried like 4 different sd models and none of them worked, but it works with the standard sd 1.4 model. Any help? How can i use different sd custom models?

小黑子，你干嘛！！！

Diffuser issue

I have installed the dev version diffuser from hugging face and gets this error "ModuleNotFoundError: No module named 'diffusers.modeling_utils'"

How shold I solve it?

请问下下载模型报错，如何解决？

We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./checkpoints/stable-diffusion-v1-4 is not the path to a directory containing a scheduler_config.jso
n file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode
上面的地址可以访问，但是会出现上述的报错，请问要如何解决？

ValueError: CrossAttnDownBlock2D does not exist.

Code:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "CompVis/stable-diffusion-v1-4"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")

Output:

ValueError: CrossAttnDownBlock2D does not exist.

本来挺喜欢你的工作的！

你干嘛！！！！

CUDA error: invalid argument

I have met some problems when I run the code. I think maybe there is something wrong with "accelerator.backward(loss)" in line 294 in train_tuneavideo.py. Hope to seek your valuable opinions. Thank you.

Question about the results

I followed your README, modifying nothing of your code. Here is the final results, not that promising. What's more, I find that sample-100, sample-200 and sample-300 did not change much. Can u share how to improve it?

Improved Consistency using DDIM inversion (?)

Hi, thank you so much for impressive works!

I noticed there was an updated 'News' of 'Improved Consistency through DDIM inversion'.
Can you explain a bit more about this update? So what I understood is "before : DDPM inversion (DDPM forward and reverse)" then "after: DDIM inversion (DDIM forward and reverse).
Am I right?
Also, then is DDIM sampler used in both fine-tuning an inference?

Thank you again for nice works.

In the training step, do you shuffle clips?

Hi, thank you first for amazing works.

I have a few questions about your training process.
(1) Did you fix the number of frames (clips) as 24? Because in every config file, clip length is consistently 24.
Does it impose that any number bigger or smaller than 24 doesn't perform as well as 24?

(2) In the training step, do you shuffle the order of frames (clips)? I have a feeling that it is not proper to shuffle the frames because the frame-related attention parts learn the order of frames too?

Thank you again.

Do you have the original VDM attention implementation?

Support Stable Diffusion 2

Your work is great! I have generated Anime easily by my model, Cool Japan Diffusion 1.x.

I would like to generate Anime by Stable Diffusion 2.1 because I develop Cool Japan Diffusion 2.x that is based on Stable Diffusion 2.1.
Do you plan to support Stable Diffusion 2.1?

你干嘛！！！

Something error happened In the training step ，see the picture.

Good Job！鸡你太美！

This needs controlnet pose estimation

With ability to control pose it would prevent the offset random movement between frames.

can this repo work on windows cpu machine?

I am trying to this repo on my window 10 cpu machine with pre-trained weight
Can this repo work on windows CPU machine?

ask for help

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 23.70 GiB total capacity; 8.31 GiB already allocated; 254.06 MiB free; 8.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My server environment is：

Attention weight in CrossAttnDownBlock3D is not trained?

Hi, Congrats on this awesome work!

One question. For this line, it seems the gradient is not going through.

I'm not familiar with torch.utils.checkpoint.checkpoint. Is this something to re-calculate the gradient during the backward pass while using no_grad() during the forward time?

However, even so, I didn't observe the weight change after one iteration of training. Specifically, unet.down_blocks[0].attentions[0].transformer_blocks[0].attn1.to_q.weight is not changed.

Is this correct? or I missed something?

当下载完成所需的预先条件后，遇到运行出错的问题，不知该如何设置参数

请问如何解决呢

When will be the controlnet part be released

Dear author,

Thanks for the excellent work. When would you release the code with control signals?

Best.

iKUN

My comment is: Lou Chu Ji Jiao.

Pretraining model download

I can only download ckpt weight files， raise EnvironmentError(f"It looks like the config file at '{config_file}' is not a valid JSON file.")
OSError: It looks like the config file at './checkpoints/stable-diffusion-v1-4/sd-v1-4.ckpt' is not a valid JSON file.

when the source codes can be released?

Python version

Thanks for the great job. May I know which version of python you are using?

Colab

Is it possible that you could make you code available as a Google colab? I find that the most accessible interface.

About Dataset

There are 1024 video samples mentioned in the paper. If it is convenient, can you tell me the address of the data set?

More flexible/powerful attention mechanism for longer videos?

The proposed SCAttention is very computationally efficient, but it intuitively seems to be insufficient when generating slightly longer videos. Ex: attending only on the previous frame is not sufficient for establishing the direction of movement of an object.
I guess this works in the provided examples because the videos only have 8 frames, so the first video frame combined with the previous frame might be sufficient for that. But in longer videos, the first frame of the video starts to become kind of irrelevant.

Have you considered or tried using other attention masks such as Local Attention (sliding window) or other non-quadratic sparse attention mechanisms? Any thoughts on this?

Thanks and congrats for the great work!

OSError: Error

Hello,

When i run on colab i get this error :

OSError: Error no file named diffusion_pytorch_model.bin found in directory

Do you have any tutorial on how use the colab ?

Best Regards

video prediction

Hello, your work is very attractive. Can this method be used as a video frame prediction task?

Added the model to Replicate

Hello,

I wrapped the model in a cog container and uploaded it to replicate here: https://replicate.com/pollinations/tune-a-video

Let me know if I should change anything / add credits / add you to the repo.

Here is the repository: https://github.com/pollinations/Tune-A-Video

True love ikun !

How to train with video size 256*256

Hi there! Due to limited GPU memory size, during training process it will trigger OOM, thus I turned to train with video size 256*256, batch_size=1. However, it leads to error like this:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.
You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

I don't quite sure what to do with this. Any advise to this issue? Thanks!

What is the license for this code?

Hi, I'm thinking of making a gradio demo app for this repo and I'd like to know the code license of this repo. Could you add a LICENSE file?

I get an error when I try to learn Dreambooth model

Thank you for a great job.
I'm having trouble training a normal model, but I'm having trouble training a Dreambooth model. Mr Potato doesn't work either, so I'd like to identify the cause.

$ accelerate launch train_tuneavideo.py --config="configs/mr-potato-head.yaml"
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
02/01/2023 10:24:30 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

{'variance_type', 'prediction_type', 'clip_sample'} was not found in config. Values will be initialized to default values.
{'use_linear_projection', 'resnet_time_scale_shift', 'num_class_embeds', 'class_embed_type', 'mid_block_type', 'only_cross_attention', 'dual_cross_attention', 'upcast_attention'} was not found in config. Values will be initialized to default values.
{'prediction_type', 'clip_sample'} was not found in config. Values will be initialized to default values.
/home/ubuntu/Tune-A-Video/tuneavideo/pipelines/pipeline_tuneavideo.py:82: FutureWarning: The configuration file of this scheduler: DDIMScheduler {
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.12.1",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"clip_sample": true,
"num_train_timesteps": 1000,
"prediction_type": "epsilon",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1,
"trained_betas": null
}
has not set the configuration clip_sample. clip_sample should be set to False in the configuration file. Please make sure to update the config accordingly as not setting clip_sample in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the scheduler/scheduler_config.json file
deprecate("clip_sample not set", "1.0.0", deprecation_message, standard_warn=False)
02/01/2023 10:24:42 - INFO - main - ***** Running training *****
02/01/2023 10:24:42 - INFO - main - Num examples = 1
02/01/2023 10:24:42 - INFO - main - Num Epochs = 500
02/01/2023 10:24:42 - INFO - main - Instantaneous batch size per device = 1
02/01/2023 10:24:42 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
02/01/2023 10:24:42 - INFO - main - Gradient Accumulation steps = 1
02/01/2023 10:24:42 - INFO - main - Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s]/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 352, in
main(**OmegaConf.load(args.config))
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 284, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 490, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 364, in forward
sample, res_samples = downsample_block(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet_blocks.py", line 301, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet_blocks.py", line 294, in custom_forward
return module(*inputs, return_dict=return_dict)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 111, in forward
hidden_states = block(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 243, in forward
hidden_states = self.attn1(norm_hidden_states, attention_mask=attention_mask, video_length=video_length) + hidden_states
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 283, in forward
query = self.reshape_heads_to_batch_dim(query)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'SparseCausalAttention' object has no attribute 'reshape_heads_to_batch_dim'

This environment is as follows

Tesla V100(32GB)
Python 3.10.9
torch 1.13.1
torchaudio 0.13.1
torchtext 0.14.1
torchvision 0.14.1
transformers 4.26.0

Also, when I tried to train Tune-A-Video with a model I trained myself using the Diffusers examples, I got a different error.

$ accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
02/01/2023 10:07:58 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 352, in
main(**OmegaConf.load(args.config))
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 107, in main
unet = UNet3DConditionModel.from_pretrained_2d(pretrained_model_path, subfolder="unet")
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 440, in from_pretrained_2d
model = cls.from_config(config)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 210, in from_config
model = cls(**init_dict)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 567, in inner_init
init(self, *args, **init_kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 158, in init
raise ValueError(f"unknown mid_block_type : {mid_block_type}")
ValueError: unknown mid_block_type : UNetMidBlock2DCrossAttn
Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/bin/python', 'train_tuneavideo.py', '--config=configs/man-surfing.yaml']' returned non-zero exit status 1.

Any hint would be appreciated

Totally bad results, cannot reproduce!

Why I cannot reproduce your results, and actually the results are quite bad...
The only different thing is that I have not installed the xformer package, and I reduce the n_sample_frames=12.
$WX{OMP3B7P7LVQ}AB_TZ(TJ$

set_use_memory_efficient_attention_xformers()

Hello. Thank you for sharing this great work!

When I run this project, I encountered the following error.

TypeError: set_use_memory_efficient_attention_xformers() takes 2 positional arguments but 3 were given

It seems to be the error made by the diffusers but I ask the help in this project since I installed the latest version of diffusers (0.13.0) and the previous version (0.11.1) also raises up the error.

Any help would be appreciated. Thank you!