Giter Site home page Giter Site logo

lvdm's Introduction

Hi there ๐Ÿ‘‹

Hello! I am Yingqing He. Nice to meet you!
๐Ÿ‘จโ€๐Ÿ’ปโ€ I am currently a PhD student at HKUST. My research focuses on text-to-video generation and multimodal generation.
๐Ÿ“ซ How to reach me: [email protected]
๐Ÿ“ฃ Our lab is hiring engineering-oriented research assistants (RA). If you would like to apply, feel free to reach out with your CV!
๐Ÿง Other projects:

  • [CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Github
  • [ECCV 2024] Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation Github
  • Awesome Diffusion Models in High-Resolution Synthesis Github

lvdm's People

Contributors

yingqinghe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lvdm's Issues

The quantitative results of reproduced generated samples and real samples are different with the LVDM official paper.

As instructed in LVDM official code, we downloaded and used the pretrained weights, then extracted the generated npz files as detailed in the README.md available at the official link (https://github.com/YingqingHe/LVDM). Following this procedure, we conducted evaluations and obtained the following results. (For the real samples, we configured 2048 training splits randomly, following the protocol of StyleGAN-V.)

However, for the SKY dataset, the FVD score was 318.66, and the KVD score was 21.89. Regarding the taichi dataset, the FVD score was 285.76, and the KVD score was 31.62. (In LVDM paper, for the SKY dataset, the FVD score was 95.2 ยฑ 2.3, and the KVD score was 3.9 ยฑ 0.1. Regarding the taichi dataset, the FVD score was 99.0 ยฑ 2.6, and the KVD score was 15.3 ยฑ 0.9)

Could you provide an explanation for any difficulties encountered in reproducing these results? Additionally, could you share the configuration details for the 2048 real samples you used?

Error during training the AE

When running bash shellscripts/train_lvdm_videoae.sh I get the following error:

    Traceback (most recent call last):
      File "main.py", line 250, in <module>
        model = instantiate_from_config(config.model)
      File "/home/paperspace/metaphysic/projects/marija/LVDM/lvdm/utils/common_utils.py", line 41, in instantiate_from_config
        return get_obj_from_str(config["target"])(**config.get("params", dict()))
    TypeError: __init__() got an unexpected keyword argument 'first_stage_config'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "main.py", line 514, in <module>
        if trainer.global_rank == 0:
    NameError: name 'trainer' is not defined

Any advice?

Total batch size of LVDM

Hello, thank you for sharing this great work. I wonder if there is a recommended total batch size for the training on the datasets that you used. Many thanks for your answer!

Sky_lapse dataset structure

Sky_lapse dataset download is being interrupted by the Google Drive for me, I guess it is size/speed limitations.
Could someone please share the information about the structure of the dataset and what the format ucfTrainTestlist/classInd.txt?

Different frame rates in WebVid

Hi, thanks for open sourcing your great work! These are really impressive results you showed.

I couldnโ€™t find the data pipeline for T2V in the current code so I was hoping I could ask a question here.

How did you manage the different frame rates across videos (24 fps, 30 fps, 60 fps) in the WebVid dataset? If you just work with frames, the videos will have different time scales โ€” will that affect training? Did you do anything to normalize the frame rate as a preprocessing step, or did the model take the frame rate as a conditioning input?

Thanks in advance.

code release

Hello, your work is very inspiring, how long will it be to make the code public, and look forward to your reply.

what is the function of self.resume_new_epoch

LVDM/lvdm/models/ddpm3d.py

Lines 394 to 405 in 2fb64c2

if (self.current_epoch == 0) or self.resume_new_epoch == 0:
self.epoch_start_time = time.time()
self.current_epoch_time = 0
self.total_time = 0
self.epoch_time_avg = 0
else:
self.current_epoch_time = time.time() - self.epoch_start_time
self.epoch_start_time = time.time()
self.total_time += self.current_epoch_time
self.epoch_time_avg = self.total_time / self.current_epoch
self.resume_new_epoch += 1
epoch_avg_loss = torch.stack([x['loss'] for x in outputs]).mean()

GPU memory usage during training

Hi, thank you for sharing this great work. I'm trying to train LVDM on ucf101 for unconditional generation and I observed a weird gpu memory usage during training. When I used batch size=2 I got from nvidia-smi that my gpu memory usage during training is roughly 73000mb. However, when I increased the batch size to 32 the memory usage went down to ~35000mb during training. I tried to debug the code and it seems that the UNet model is consuming a large amount of memory when the batch size is small (memory increases from 8g to ~73g before and after line 626 - line 634 in lvdm/models/modules/openaimodel3d.py). I wonder if you have any insights on this issue. Thanks!

Request for UCF pretrained weight

Thank you for the interesting work.

I'm currently in the process of training LVDM on the UCF dataset, but it seems difficult to reproduce the results.

Would it be possible for you to share the pretrained weights for the UCF Dataset?

Code Release

Hello, thank you for the wonderful and interesting work on bringing latent diffusion to videos. Do you plan to release the full code along with the results, and if so, what is your expected timeframe for the same? Thanks.

about prediction process

when training model in prediction mode, in function add_frames_condition, we get a mask like [b c=1 t=4 h w], and we know t=0 mask = 1, and t=1,2,3 mask = 0, just like the figure in your paper.

image

so mask.all() should be False, and mask.all() == 0 should be True, and is_uncond should be True...?

So I can't get the conditional frame in the next step...

about function all
image

LVDM/lvdm/models/ddpm3d.py

Lines 1341 to 1373 in 2fb64c2

def add_frames_condition(self, input, c):
device = input.device
input_shape = input.shape
# get a mask indicating condition positions
if self.rand_temporal_mask:
mask = random_temporal_masking(input_shape, self.p_interp, self.p_pred, device, n_prevs=self.n_prevs, interp_frame_stride=self.interp_frame_stride)
else:
raise NotImplementedError
is_uncond = (mask.all() == 0)
# get a random timestep for condition
if self.noisy_cond:
assert(self.max_noise_level is not None)
s = torch.randint(0, self.max_noise_level, (input_shape[0],), device=device).long()
if self.uncondnoS and is_uncond:
s = None
else:
s = None
# make conditional frames
if not is_uncond:
cond_idx = []
for i in range(mask.shape[2]):
if mask[:, :, i, :, :].all() == 1:
cond_idx.append(i)
cond_frames = input[:, :, cond_idx, :, :] # for both prediction & interpolation
if self.noisy_cond:
noise = torch.randn_like(cond_frames)
cond_frames = self.q_sample(x_start=cond_frames, t=s, noise=noise)
else:
cond_frames = None

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.