Giter Site home page Giter Site logo

showlab / tune-a-video Goto Github PK

View Code? Open in Web Editor NEW
4.2K 49.0 377.0 2.88 MB

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Home Page: https://tuneavideo.github.io

License: Apache License 2.0

Python 88.29% Jupyter Notebook 11.71%

tune-a-video's Introduction

Tune-A-Video

This repository is the official implementation of Tune-A-Video.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou

Project Website arXiv Hugging Face Spaces Open In Colab


Given a video-text pair as input, our method, Tune-A-Video, fine-tunes a pre-trained text-to-image diffusion model for text-to-video generation.

News

🚨 Announcing LOVEU-TGVE: A CVPR competition for AI-based video editing! Submissions due Jun 5. Don't miss out! 🤩

  • [02/22/2023] Improved consistency using DDIM inversion.
  • [02/08/2023] Colab demo released!
  • [02/03/2023] Pre-trained Tune-A-Video models are available on Hugging Face Library!
  • [01/28/2023] New Feature: tune a video on personalized DreamBooth models.
  • [01/28/2023] Code released!

Setup

Requirements

pip install -r requirements.txt

Installing xformers is highly recommended for more efficiency and speed on GPUs. To enable xformers, set enable_xformers_memory_efficient_attention=True (default).

Weights

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4, v2-1). You can also use fine-tuned Stable Diffusion models trained on different styles (e.g, Modern Disney, Anything V4.0, Redshift, etc.).

[DreamBooth] DreamBooth is a method to personalize text-to-image models like Stable Diffusion given just a few images (3~5 images) of a subject. Tuning a video on DreamBooth models allows personalized text-to-video generation of a specific subject. There are some public DreamBooth models available on Hugging Face (e.g., mr-potato-head). You can also train your own DreamBooth model following this training example.

Usage

Training

To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"

Note: Tuning a 24-frame video usually takes 300~500 steps, about 10~15 minutes using one A100 GPU. Reduce n_sample_frames if your GPU memory is limited.

Inference

Once the training is done, run inference:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

pretrained_model_path = "./checkpoints/stable-diffusion-v1-4"
my_model_path = "./outputs/man-skiing"
unet = UNet3DConditionModel.from_pretrained(my_model_path, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()

prompt = "spider man is skiing"
ddim_inv_latent = torch.load(f"{my_model_path}/inv_latents/ddim_latent-500.pt").to(torch.float16)
video = pipe(prompt, latents=ddim_inv_latent, video_length=24, height=512, width=512, num_inference_steps=50, guidance_scale=12.5).videos

save_videos_grid(video, f"./{prompt}.gif")

Results

Pretrained T2I (Stable Diffusion)

Input Video Output Video
"A man is skiing" "Spider Man is skiing on the beach, cartoon style” "Wonder Woman, wearing a cowboy hat, is skiing" "A man, wearing pink clothes, is skiing at sunset"
"A rabbit is eating a watermelon on the table" "A rabbit is eating a watermelon on the table" "A cat with sunglasses is eating a watermelon on the beach" "A puppy is eating a cheeseburger on the table, comic style"
"A jeep car is moving on the road" "A Porsche car is moving on the beach" "A car is moving on the road, cartoon style" "A car is moving on the snow"
"A man is dribbling a basketball" "James Bond is dribbling a basketball on the beach" "An astronaut is dribbling a basketball, cartoon style" "A lego man in a black suit is dribbling a basketball"

Pretrained T2I (personalized DreamBooth)

Input Video Output Video
"A bear is playing guitar" "1girl is playing guitar, white hair, medium hair, cat ears, closed eyes, cute, scarf, jacket, outdoors, streets" "1boy is playing guitar, bishounen, casual, indoors, sitting, coffee shop, bokeh" "1girl is playing guitar, red hair, long hair, beautiful eyes, looking at viewer, cute, dress, beach, sea"

Input Video Output Video
"A bear is playing guitar" "A rabbit is playing guitar, modern disney style" "A handsome prince is playing guitar, modern disney style" "A magic princess with sunglasses is playing guitar on the stage, modern disney style"

Input Video Output Video
"A bear is playing guitar" "Mr Potato Head, made of lego, is playing guitar on the snow" "Mr Potato Head, wearing sunglasses, is playing guitar on the beach" "Mr Potato Head is playing guitar in the starry night, Van Gogh style"

Citation

If you make use of our work, please cite our paper.

@inproceedings{wu2023tune,
  title={Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation},
  author={Wu, Jay Zhangjie and Ge, Yixiao and Wang, Xintao and Lei, Stan Weixian and Gu, Yuchao and Shi, Yufei and Hsu, Wynne and Shan, Ying and Qie, Xiaohu and Shou, Mike Zheng},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={7623--7633},
  year={2023}
}

Shoutouts

tune-a-video's People

Contributors

hysts avatar stanlei52 avatar zhangjiewu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tune-a-video's Issues

attention_mask is None during training

In the paper it mentioned that the ST-Attn layer only consider the previous frame and the first frame. But in the code it seems that during training, the attention_mask for the SC-Attn layer is None. Could you explain why this setting is not same as in the paper?

FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''

Traceback (most recent call last):
File "C:\Users\coron\Desktop\Tune-A-Video\inference.py", line 15, in
save_videos_grid(video, f"{prompt}.gif")
File "C:\Users\coron\Desktop\Tune-A-Video\tuneavideo\util.py", line 21, in save_videos_grid
os.makedirs(os.path.dirname(path), exist_ok=True)
File "C:\Users\coron\AppData\Local\Programs\Python\Python310\lib\os.py", line 225, in makedirs
mkdir(name, mode)
FileNotFoundError: [WinError 3] 指定されたパスが見つかりません。: ''


execution,inferens.py

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "outputs/man-surfing_lr3e-5_seed33/2023-01-30T23-03-19"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
unet.enable_xformers_memory_efficient_attention()
pipe = TuneAVideoPipeline.from_pretrained("checkpoints/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

torch.cuda.manual_seed(0)
prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")

I don't know if this is relevant.
Tune-A-Video\tuneavideo\util.py

import os
import imageio
import numpy as np
import torch
import torchvision

from einops import rearrange

def save_videos_grid(videos: torch.Tensor, path: str, rescale=False, n_rows=4, fps=3):
videos = rearrange(videos, "b c t h w -> t b c h w")
outputs = []
for x in videos:
x = torchvision.utils.make_grid(x, nrow=n_rows)
x = x.transpose(0, 1).transpose(1, 2).squeeze(-1)
if rescale:
x = (x + 1.0) / 2.0 # -1,1 -> 0,1
x = (x * 255).numpy().astype(np.uint8)
outputs.append(x)

os.makedirs(os.path.dirname(path), exist_ok=True)
imageio.mimsave(path, outputs, fps=fps)

Error no file named scheduler_config.json found

I downloaded stable-diffusion-v1-4 checkpoints "sd-v1-4-full-ema.ckpt", "sd-v1-4.ckpt" and put them into folder "./checkpoints/stable-diffusion-v1-4"! However when I run the accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml" I get the following error:

Traceback (most recent call last): File "/home/ubuntu/project/Tune-A-Video/train_tuneavideo.py", line 367, in <module> main(**OmegaConf.load(args.config)) File "/home/ubuntu/project/Tune-A-Video/train_tuneavideo.py", line 105, in main noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model_path, subfolder="scheduler") File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/diffusers/schedulers/scheduling_utils.py", line 118, in from_pretrained config, kwargs = cls.load_config( File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/diffusers/configuration_utils.py", line 320, in load_config raise EnvironmentError( OSError: Error no file named scheduler_config.json found in directory ./checkpoints/stable-diffusion-v1-4. Traceback (most recent call last): File "/opt/conda/envs/makevideo/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/launch.py", line 915, in launch_command simple_launcher(args) File "/opt/conda/envs/makevideo/lib/python3.9/site-packages/accelerate/commands/launch.py", line 578, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/envs/makevideo/bin/python3.9', 'train_tuneavideo.py', '--config=configs/man-skiing.yaml']' returned non-zero exit status 1.

Where can I get scheduler_config.json?

help,请问这个怎么解决?

(video) D:\pythonProject\video>accelerate launch train_tuneavideo.py --config="configs/man-skiing.yaml"
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 0
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
04/11/2023 13:50:51 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cpu

Mixed precision type: fp16

Traceback (most recent call last):
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\configuration_utils.py", line 326, in load_config
config_file = hf_hub_download(
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\huggingface_hub\utils_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\huggingface_hub\utils_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': './checkpoints/stable-diffusion-v1-4'. Use repo_type argument if n
eeded.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train_tuneavideo.py", line 367, in
main(**OmegaConf.load(args.config))
File "train_tuneavideo.py", line 105, in main
noise_scheduler = DDPMScheduler.from_pretrained(pretrained_model_path, subfolder="scheduler")
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\schedulers\scheduling_utils.py", line 118, in from_pretrained
config, kwargs = cls.load_config(
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\diffusers\configuration_utils.py", line 363, in load_config
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./checkpoints/stable-diffusion-v1-4 is not the path to
a directory containing a scheduler_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.
Traceback (most recent call last):
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\Scripts\accelerate.exe_main
.py", line 7, in
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\launch.py", line 923, in launch_command
simple_launcher(args)
File "C:\Users\YFT-IT-002\anaconda3\envs\video\lib\site-packages\accelerate\commands\launch.py", line 579, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\YFT-IT-002\anaconda3\envs\video\python.exe', 'train_tuneavideo.py', '--config=configs/man-skiing.yaml']' returned non-zero exit st
atus 1.

Model compatibility.

I tried like 4 different sd models and none of them worked, but it works with the standard sd 1.4 model. Any help? How can i use different sd custom models?

Diffuser issue

I have installed the dev version diffuser from hugging face and gets this error "ModuleNotFoundError: No module named 'diffusers.modeling_utils'"

How shold I solve it?

请问下下载模型报错,如何解决?

We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like ./checkpoints/stable-diffusion-v1-4 is not the path to a directory containing a scheduler_config.jso
n file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode
上面的地址可以访问,但是会出现上述的报错,请问要如何解决?

ValueError: CrossAttnDownBlock2D does not exist.

Code:

from tuneavideo.pipelines.pipeline_tuneavideo import TuneAVideoPipeline
from tuneavideo.models.unet import UNet3DConditionModel
from tuneavideo.util import save_videos_grid
import torch

model_id = "CompVis/stable-diffusion-v1-4"
unet = UNet3DConditionModel.from_pretrained(model_id, subfolder='unet', torch_dtype=torch.float16).to('cuda')
pipe = TuneAVideoPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "a panda is surfing"
video = pipe(prompt, video_length=8, height=512, width=512, num_inference_steps=50, guidance_scale=7.5).videos

save_videos_grid(video, f"{prompt}.gif")

Output:

ValueError: CrossAttnDownBlock2D does not exist.

CUDA error: invalid argument

I have met some problems when I run the code. I think maybe there is something wrong with "accelerator.backward(loss)" in line 294 in train_tuneavideo.py. Hope to seek your valuable opinions. Thank you.
image

Question about the results

I followed your README, modifying nothing of your code. Here is the final results, not that promising. What's more, I find that sample-100, sample-200 and sample-300 did not change much. Can u share how to improve it?

sample-300

Improved Consistency using DDIM inversion (?)

Hi, thank you so much for impressive works!

I noticed there was an updated 'News' of 'Improved Consistency through DDIM inversion'.
Can you explain a bit more about this update? So what I understood is "before : DDPM inversion (DDPM forward and reverse)" then "after: DDIM inversion (DDIM forward and reverse).
Am I right?
Also, then is DDIM sampler used in both fine-tuning an inference?

Thank you again for nice works.

In the training step, do you shuffle clips?

Hi, thank you first for amazing works.

I have a few questions about your training process.
(1) Did you fix the number of frames (clips) as 24? Because in every config file, clip length is consistently 24.
Does it impose that any number bigger or smaller than 24 doesn't perform as well as 24?

(2) In the training step, do you shuffle the order of frames (clips)? I have a feeling that it is not proper to shuffle the frames because the frame-related attention parts learn the order of frames too?

Thank you again.

Support Stable Diffusion 2

Your work is great! I have generated Anime easily by my model, Cool Japan Diffusion 1.x.
test_
test

I would like to generate Anime by Stable Diffusion 2.1 because I develop Cool Japan Diffusion 2.x that is based on Stable Diffusion 2.1.
Do you plan to support Stable Diffusion 2.1?

ask for help

RuntimeError: CUDA out of memory. Tried to allocate 4.00 GiB (GPU 0; 23.70 GiB total capacity; 8.31 GiB already allocated; 254.06 MiB free; 8.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My server environment is:
image

Attention weight in CrossAttnDownBlock3D is not trained?

Hi, Congrats on this awesome work!

One question. For this line, it seems the gradient is not going through.

I'm not familiar with torch.utils.checkpoint.checkpoint. Is this something to re-calculate the gradient during the backward pass while using no_grad() during the forward time?

However, even so, I didn't observe the weight change after one iteration of training. Specifically, unet.down_blocks[0].attentions[0].transformer_blocks[0].attn1.to_q.weight is not changed.

Is this correct? or I missed something?

iKUN

My comment is: Lou Chu Ji Jiao.

Pretraining model download

I can only download ckpt weight files, raise EnvironmentError(f"It looks like the config file at '{config_file}' is not a valid JSON file.")
OSError: It looks like the config file at './checkpoints/stable-diffusion-v1-4/sd-v1-4.ckpt' is not a valid JSON file.

Python version

Thanks for the great job. May I know which version of python you are using?

Colab

Is it possible that you could make you code available as a Google colab? I find that the most accessible interface.

About Dataset

There are 1024 video samples mentioned in the paper. If it is convenient, can you tell me the address of the data set?

More flexible/powerful attention mechanism for longer videos?

The proposed SCAttention is very computationally efficient, but it intuitively seems to be insufficient when generating slightly longer videos. Ex: attending only on the previous frame is not sufficient for establishing the direction of movement of an object.
I guess this works in the provided examples because the videos only have 8 frames, so the first video frame combined with the previous frame might be sufficient for that. But in longer videos, the first frame of the video starts to become kind of irrelevant.

Have you considered or tried using other attention masks such as Local Attention (sliding window) or other non-quadratic sparse attention mechanisms? Any thoughts on this?

Thanks and congrats for the great work!

OSError: Error

Hello,

When i run on colab i get this error :

OSError: Error no file named diffusion_pytorch_model.bin found in directory

Do you have any tutorial on how use the colab ?

Best Regards

video prediction

Hello, your work is very attractive. Can this method be used as a video frame prediction task?

How to train with video size 256*256

Hi there! Due to limited GPU memory size, during training process it will trigger OOM, thus I turned to train with video size 256*256, batch_size=1. However, it leads to error like this:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.
You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

I don't quite sure what to do with this. Any advise to this issue? Thanks!

What is the license for this code?

Hi, I'm thinking of making a gradio demo app for this repo and I'd like to know the code license of this repo. Could you add a LICENSE file?

I get an error when I try to learn Dreambooth model

Thank you for a great job.
I'm having trouble training a normal model, but I'm having trouble training a Dreambooth model. Mr Potato doesn't work either, so I'd like to identify the cause.


$ accelerate launch train_tuneavideo.py --config="configs/mr-potato-head.yaml"
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
02/01/2023 10:24:30 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

{'variance_type', 'prediction_type', 'clip_sample'} was not found in config. Values will be initialized to default values.
{'use_linear_projection', 'resnet_time_scale_shift', 'num_class_embeds', 'class_embed_type', 'mid_block_type', 'only_cross_attention', 'dual_cross_attention', 'upcast_attention'} was not found in config. Values will be initialized to default values.
{'prediction_type', 'clip_sample'} was not found in config. Values will be initialized to default values.
/home/ubuntu/Tune-A-Video/tuneavideo/pipelines/pipeline_tuneavideo.py:82: FutureWarning: The configuration file of this scheduler: DDIMScheduler {
"_class_name": "DDIMScheduler",
"_diffusers_version": "0.12.1",
"beta_end": 0.012,
"beta_schedule": "scaled_linear",
"beta_start": 0.00085,
"clip_sample": true,
"num_train_timesteps": 1000,
"prediction_type": "epsilon",
"set_alpha_to_one": false,
"skip_prk_steps": true,
"steps_offset": 1,
"trained_betas": null
}
has not set the configuration clip_sample. clip_sample should be set to False in the configuration file. Please make sure to update the config accordingly as not setting clip_sample in the config might lead to incorrect results in future versions. If you have downloaded this checkpoint from the Hugging Face Hub, it would be very nice if you could open a Pull request for the scheduler/scheduler_config.json file
deprecate("clip_sample not set", "1.0.0", deprecation_message, standard_warn=False)
02/01/2023 10:24:42 - INFO - main - ***** Running training *****
02/01/2023 10:24:42 - INFO - main - Num examples = 1
02/01/2023 10:24:42 - INFO - main - Num Epochs = 500
02/01/2023 10:24:42 - INFO - main - Instantaneous batch size per device = 1
02/01/2023 10:24:42 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 1
02/01/2023 10:24:42 - INFO - main - Gradient Accumulation steps = 1
02/01/2023 10:24:42 - INFO - main - Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s]/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Traceback (most recent call last):
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 352, in
main(**OmegaConf.load(args.config))
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 284, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/utils/operations.py", line 490, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 364, in forward
sample, res_samples = downsample_block(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet_blocks.py", line 301, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
outputs = run_function(*args)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet_blocks.py", line 294, in custom_forward
return module(*inputs, return_dict=return_dict)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 111, in forward
hidden_states = block(
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 243, in forward
hidden_states = self.attn1(norm_hidden_states, attention_mask=attention_mask, video_length=video_length) + hidden_states
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/attention.py", line 283, in forward
query = self.reshape_heads_to_batch_dim(query)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1269, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'SparseCausalAttention' object has no attribute 'reshape_heads_to_batch_dim'


This environment is as follows

Tesla V100(32GB)
Python 3.10.9
torch 1.13.1
torchaudio 0.13.1
torchtext 0.14.1
torchvision 0.14.1
transformers 4.26.0

Also, when I tried to train Tune-A-Video with a model I trained myself using the Diffusers examples, I got a different error.

$ accelerate launch train_tuneavideo.py --config="configs/man-surfing.yaml"
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
02/01/2023 10:07:58 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16

{'variance_type'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 352, in
main(**OmegaConf.load(args.config))
File "/home/ubuntu/Tune-A-Video/train_tuneavideo.py", line 107, in main
unet = UNet3DConditionModel.from_pretrained_2d(pretrained_model_path, subfolder="unet")
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 440, in from_pretrained_2d
model = cls.from_config(config)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 210, in from_config
model = cls(**init_dict)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 567, in inner_init
init(self, *args, **init_kwargs)
File "/home/ubuntu/Tune-A-Video/tuneavideo/models/unet.py", line 158, in init
raise ValueError(f"unknown mid_block_type : {mid_block_type}")
ValueError: unknown mid_block_type : UNetMidBlock2DCrossAttn
Traceback (most recent call last):
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/lib/python3.10/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/.pyenv/versions/anaconda3-2022.05/envs/ldm310/bin/python', 'train_tuneavideo.py', '--config=configs/man-surfing.yaml']' returned non-zero exit status 1.


Any hint would be appreciated

Totally bad results, cannot reproduce!

Why I cannot reproduce your results, and actually the results are quite bad...
The only different thing is that I have not installed the xformer package, and I reduce the n_sample_frames=12.
WX{OMP3B7P7LVQ}AB_TZ(TJ

set_use_memory_efficient_attention_xformers()

Hello. Thank you for sharing this great work!

When I run this project, I encountered the following error.

TypeError: set_use_memory_efficient_attention_xformers() takes 2 positional arguments but 3 were given

It seems to be the error made by the diffusers but I ask the help in this project since I installed the latest version of diffusers (0.13.0) and the previous version (0.11.1) also raises up the error.

Any help would be appreciated. Thank you!

Pose Control Implementation

Hello,
I was wondering how exactly you guys managed to perform "pose control" with Tune-A-Video? To my knowledge, the process hasn't been outlined in the Tune-A-Video paper.

Screen Shot 2023-04-17 at 4 13 21 pm

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.