yingqinghe / lvdm Goto Github PK

LVDM: Latent Video Diffusion Models for High-Fidelity Long Video Generation

Home Page: https://yingqinghe.github.io/LVDM/

License: MIT License

Python 96.99% Shell 3.01%

lvdm's Introduction

Hi there 👋

Hello! I am Yingqing He. Nice to meet you!
👨‍💻‍ I am currently a PhD student at HKUST. My research focuses on text-to-video generation and multimodal generation.
📫 How to reach me: [email protected]
📣 Our lab is hiring engineering-oriented research assistants (RA). If you would like to apply, feel free to reach out with your CV!
🧁 Other projects:

[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Github
[ECCV 2024] Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation Github
Awesome Diffusion Models in High-Resolution Synthesis Github

lvdm's People

Contributors

Stargazers

Watchers

Forkers

paperwave westfish thegenerativegeneration whlzy julkaztwittera weichunpeng golddohyun xiong-creator hongdangshao eryoyo

lvdm's Issues

RuntimeError: Error(s) in loading state_dict for LatentDiffusion: Unexpected key(s) in state_dict: "cond_stage_model.transformer.text_model.embeddings.position_ids".

How can I solve this problem?
Thank you very much!

The quantitative results of reproduced generated samples and real samples are different with the LVDM official paper.

As instructed in LVDM official code, we downloaded and used the pretrained weights, then extracted the generated npz files as detailed in the README.md available at the official link (https://github.com/YingqingHe/LVDM). Following this procedure, we conducted evaluations and obtained the following results. (For the real samples, we configured 2048 training splits randomly, following the protocol of StyleGAN-V.)

However, for the SKY dataset, the FVD score was 318.66, and the KVD score was 21.89. Regarding the taichi dataset, the FVD score was 285.76, and the KVD score was 31.62. (In LVDM paper, for the SKY dataset, the FVD score was 95.2 ± 2.3, and the KVD score was 3.9 ± 0.1. Regarding the taichi dataset, the FVD score was 99.0 ± 2.6, and the KVD score was 15.3 ± 0.9)

Could you provide an explanation for any difficulties encountered in reproducing these results? Additionally, could you share the configuration details for the 2048 real samples you used?

Error during training the AE

When running bash shellscripts/train_lvdm_videoae.sh I get the following error:

    Traceback (most recent call last):
      File "main.py", line 250, in <module>
        model = instantiate_from_config(config.model)
      File "/home/paperspace/metaphysic/projects/marija/LVDM/lvdm/utils/common_utils.py", line 41, in instantiate_from_config
        return get_obj_from_str(config["target"])(**config.get("params", dict()))
    TypeError: __init__() got an unexpected keyword argument 'first_stage_config'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "main.py", line 514, in <module>
        if trainer.global_rank == 0:
    NameError: name 'trainer' is not defined

Any advice?

Total batch size of LVDM

Hello, thank you for sharing this great work. I wonder if there is a recommended total batch size for the training on the datasets that you used. Many thanks for your answer!

How to get .npz file for fake path while calculating fvd?

Hi author,
I saw the sample command in calculating fvd, the fake path is 2048x16x256x256x3-samples.npz. Would I ask how to get it? I have generated 2048 videos with 2s. Thx

Sky_lapse dataset structure

Sky_lapse dataset download is being interrupted by the Google Drive for me, I guess it is size/speed limitations.
Could someone please share the information about the structure of the dataset and what the format ucfTrainTestlist/classInd.txt?

how to choose a suitable scale_factor and shift_factor?

like this

LVDM/configs/lvdm_short/ucf.yaml

Lines 20 to 21 in 2fb64c2

    
           scale_factor: 0.220142075 
        
           shift_factor: 0.5837740898

LVDM/configs/lvdm_short/taichi.yaml

Lines 21 to 22 in 2fb64c2

    
           scale_factor: 0.175733933 
        
           shift_factor: 0.03291025

LVDM/configs/lvdm_long/sky_pred.yaml

Lines 23 to 24 in 2fb64c2

    
           scale_factor: 0.33422927 
        
           shift_factor: 1.4606637

Where can I find the InceptionI3D pre-training model in eval_cal_fvd_kvd?

Training code for the text2video generation?

Thanks a lot for providing this codebase! I was hoping to ask whether there's any training code available for text2video generation part? Thank you!

Different frame rates in WebVid

Hi, thanks for open sourcing your great work! These are really impressive results you showed.

I couldn’t find the data pipeline for T2V in the current code so I was hoping I could ask a question here.

How did you manage the different frame rates across videos (24 fps, 30 fps, 60 fps) in the WebVid dataset? If you just work with frames, the videos will have different time scales — will that affect training? Did you do anything to normalize the frame rate as a preprocessing step, or did the model take the frame rate as a conditioning input?

Thanks in advance.

When will the code be released?

I am particularly interested in the part training on the webvid dataset.

code release

Hello, your work is very inspiring, how long will it be to make the code public, and look forward to your reply.

what is the function of self.resume_new_epoch

LVDM/lvdm/models/ddpm3d.py

Lines 394 to 405 in 2fb64c2

    
           if (self.current_epoch == 0) or self.resume_new_epoch == 0: 
        
               self.epoch_start_time = time.time() 
        
               self.current_epoch_time = 0 
        
               self.total_time = 0 
        
               self.epoch_time_avg = 0 
        
           else: 
        
               self.current_epoch_time = time.time() - self.epoch_start_time 
        
               self.epoch_start_time = time.time() 
        
               self.total_time += self.current_epoch_time 
        
               self.epoch_time_avg = self.total_time / self.current_epoch 
        
           self.resume_new_epoch += 1 
        
           epoch_avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

GPU memory usage during training

Hi, thank you for sharing this great work. I'm trying to train LVDM on ucf101 for unconditional generation and I observed a weird gpu memory usage during training. When I used batch size=2 I got from nvidia-smi that my gpu memory usage during training is roughly 73000mb. However, when I increased the batch size to 32 the memory usage went down to ~35000mb during training. I tried to debug the code and it seems that the UNet model is consuming a large amount of memory when the batch size is small (memory increases from 8g to ~73g before and after line 626 - line 634 in lvdm/models/modules/openaimodel3d.py). I wonder if you have any insights on this issue. Thanks!

Request for UCF pretrained weight

Thank you for the interesting work.

I'm currently in the process of training LVDM on the UCF dataset, but it seems difficult to reproduce the results.

Would it be possible for you to share the pretrained weights for the UCF Dataset?

So I can't get the conditional frame in the next step...

about function all

LVDM/lvdm/models/ddpm3d.py

Lines 1341 to 1373 in 2fb64c2

    
           def add_frames_condition(self, input, c): 
        
               device = input.device 
        
               input_shape = input.shape 
        
               # get a mask indicating condition positions 
        
               if self.rand_temporal_mask: 
        
                   mask = random_temporal_masking(input_shape, self.p_interp, self.p_pred, device, n_prevs=self.n_prevs, interp_frame_stride=self.interp_frame_stride) 
        
               else: 
        
                   raise NotImplementedError 
        
               is_uncond = (mask.all() == 0) 
        
               # get a random timestep for condition 
        
               if self.noisy_cond: 
        
                   assert(self.max_noise_level is not None) 
        
                   s = torch.randint(0, self.max_noise_level, (input_shape[0],), device=device).long() 
        
                   if self.uncondnoS and is_uncond: 
        
                       s = None 
        
               else: 
        
                   s = None 
        
               # make conditional frames 
        
               if not is_uncond: 
        
                   cond_idx = [] 
        
                   for i in range(mask.shape[2]): 
        
                       if mask[:, :, i, :, :].all() == 1: 
        
                           cond_idx.append(i) 
        
                   cond_frames = input[:, :, cond_idx, :, :] # for both prediction & interpolation 
        
                   if self.noisy_cond: 
        
                       noise = torch.randn_like(cond_frames) 
        
                       cond_frames = self.q_sample(x_start=cond_frames, t=s, noise=noise) 
        
               else: 
        
                   cond_frames = None

Validation images not logging

How to set logging validation images?

	if (self.current_epoch == 0) or self.resume_new_epoch == 0:
	self.epoch_start_time = time.time()
	self.current_epoch_time = 0
	self.total_time = 0
	self.epoch_time_avg = 0
	else:
	self.current_epoch_time = time.time() - self.epoch_start_time
	self.epoch_start_time = time.time()
	self.total_time += self.current_epoch_time
	self.epoch_time_avg = self.total_time / self.current_epoch
	self.resume_new_epoch += 1
	epoch_avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

	def add_frames_condition(self, input, c):
	device = input.device
	input_shape = input.shape

	# get a mask indicating condition positions
	if self.rand_temporal_mask:
	mask = random_temporal_masking(input_shape, self.p_interp, self.p_pred, device, n_prevs=self.n_prevs, interp_frame_stride=self.interp_frame_stride)
	else:
	raise NotImplementedError

	is_uncond = (mask.all() == 0)

	# get a random timestep for condition
	if self.noisy_cond:
	assert(self.max_noise_level is not None)
	s = torch.randint(0, self.max_noise_level, (input_shape[0],), device=device).long()
	if self.uncondnoS and is_uncond:
	s = None
	else:
	s = None

	# make conditional frames
	if not is_uncond:
	cond_idx = []
	for i in range(mask.shape[2]):
	if mask[:, :, i, :, :].all() == 1:
	cond_idx.append(i)
	cond_frames = input[:, :, cond_idx, :, :] # for both prediction & interpolation
	if self.noisy_cond:
	noise = torch.randn_like(cond_frames)
	cond_frames = self.q_sample(x_start=cond_frames, t=s, noise=noise)
	else:
	cond_frames = None