sihyun-yu / digan Goto Github PK

Official PyTorch implementation of Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks (ICLR 2022).

Home Page: https://sihyun.me/digan/

Shell 0.07% Python 89.62% C++ 3.17% Cuda 7.14%

video-generation gan implicit-neural-representation inr

digan's Introduction

digan's People

Contributors

Stargazers

Watchers

Forkers

naver-ai jungwoo-ha cv-ip joskid ai-hub-deep-learning-fundamental sushil1312 killsking sahilb2002 pkulwj1994 galray mkfmiku tdtrumble deschanel11 kkodoo dhamareshuk bpiyush

digan's Issues

How to perform space extrapolation?

I trained digan on a dataset at 128x128 resolution. I now intend to generate the output at 256x256 resolution. However, when I load the pretrained model, the output img_resolution is set at 128x128. I have tried changing the output resolution at multiple places, however, I am unable to do so. Any help on this would be appreciated.

About the GPU requirement

Dear authors,

Hello! First of all, thank you for your inspiring work!

I encountered an issue with multi-GPU training on my 8 V100-16G GPUs. When distributing models across GPUs,

if rank == 0:
    print(f'Distributing across {num_gpus} GPUs...')
ddp_modules = dict()
for name, module in [('G_mapping', G.mapping), ('G_synthesis', G.synthesis), ('D', D), (None, G_ema), ('augment_pipe', augment_pipe)]:
    if rank == 0:
        print("[Distributing] Module {} ...".format(name))
    
    if (num_gpus > 1) and (module is not None) and len(list(module.parameters())) != 0:
        module.requires_grad_(True)
        module = torch.nn.parallel.DistributedDataParallel(module, device_ids=[device], broadcast_buffers=False,
                                                           find_unused_parameters=False)
        module.requires_grad_(False)
    
    if rank == 0:
        print("[Distributed] Module {}".format(name))
    
    if name is not None:
        ddp_modules[name] = module

the process failed on first module G_mapping, reporting

[Distributing] Module G_mapping ...
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1640811806235/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, unhandled cuda error, NCCL version 21.0.3
ncclUnhandledCudaError: Call to CUDA function failed.

The GPU memory consumption status is as follow,

wangyuhan-8-v100         Sat Mar  5 12:28:13 2022  460.73.01
[0] Tesla V100-SXM2-16GB | 36'C,  22 % | 15415 / 16160 MB | yuhan:python/31701(1283M) yuhan:python/31696(6905M) yuhan:python/31699(1151M) yuhan:python/31700(1241M) yuhan:python/31697(1283M) yuhan:python/31698(1283M) yuhan:python/31702(1175M) yuhan:python/31703(1099M)
[1] Tesla V100-SXM2-16GB | 37'C,   0 % |  2022 / 16160 MB | yuhan:python/31697(2019M)
[2] Tesla V100-SXM2-16GB | 38'C,   0 % |  2022 / 16160 MB | yuhan:python/31698(2019M)
[3] Tesla V100-SXM2-16GB | 39'C,   0 % |  2014 / 16160 MB | yuhan:python/31699(2011M)
[4] Tesla V100-SXM2-16GB | 35'C,   0 % |  2014 / 16160 MB | yuhan:python/31700(2011M)
[5] Tesla V100-SXM2-16GB | 35'C,   0 % |  2022 / 16160 MB | yuhan:python/31701(2019M)
[6] Tesla V100-SXM2-16GB | 36'C,   0 % |  2014 / 16160 MB | yuhan:python/31702(2011M)
[7] Tesla V100-SXM2-16GB | 37'C,   0 % |  2014 / 16160 MB | yuhan:python/31703(2011M)

I am not very familiar with this and seemingly GPU_0 is running out of memory. I am wondering whether it is the reason behind the ncclUnhandledError.

Could you please help me figure out what caused this error? Is your implementation working on 16GB V100 GPUs?

Thank you very much.

Error on training from scratch

Hi I am getting an error on running the training script using the command provided in the README.md

python src/infra/launch.py hydra.run.dir=. +experiment_name=exp01 +dataset.name=kinetics

as below:

"self._image_fnames = sorted(fname for fname in self._all_fnames if self._file_ext(fname) in PIL.Image.EXTENSION)
TypeError: _file_ext() missing 1 required positional argument: 'fname'"

I have the data for kinetics processed according to the steps in prepare_data folder.

About releasing some checkpoints

Great work and thanks for releasing the code! Do you have any following plans to release the trained models as well? Thanks!

Added latent optimisation code for performing video-related tasks

Hi, I could not find the code for performing video-related tasks that were shown in the paper such as video interpolation, extrapolation, inversion, etc. I added these functionalities on top of your repository here - https://github.com/skymanaditya1/digan/blob/master/src/scripts/project.py.

Please let me know if this looks okay to you and if you would like, I can create a PR for the same (with the refactoring of course).

Evaluation details about UCF-101 (split information)

Thanks for your great work!

I have some questions about the FVD calculation on UCF-101 dataset.

As noted in the paper, there are two different experiments for UCF-101, (train) and (train + test) split.

My questions are as below:

Does this codebase use (train) split? (Does this codebase uses (train) split as training data and (test) split as real statistics for FVD?)
When you use (train + test) split, do you also calculate the real statistics for FVD from (train + test) data?

dataset - ImageFolderDataset

Thanks for sharing your great work. I found the following issues in dataset.py:

Path for kinetics and Sky datasets:
Before the line #579, there should be:

if 'kinetics' in self._path or 'KINETICS' in self._path or 'SKY' in self._path:      
  if train:
     dir_path = os.path.join(self._path, 'train')
  else:
     dir_path = os.path.join(self._path, 'val')

and line #579 should be changed to:
self._all_fnames = {os.path.relpath(os.path.join(root, fname), start=dir_path) for root, _dirs, files in os.walk(dir_path) for fname in files}

Otherwise, it won't work for the data from Kinetics and Sky datasets.

def _get_zipfile() is not defined in the code, but it is used in lines #582 and #607. The following lines can be added after the line #598 :

     def _get_zipfile(self):
         assert self._type == 'zip'
         if self._zipfile is None:
             self._zipfile = zipfile.ZipFile(self._path)
         return self._zipfile

In line #599, def _file_ext(fname) should be changed to def _file_ext(self, fname).

modulated_conv2d in ToRGBLayer

Hi,

Thank you for your work.

I was looking at your code and I noticed that in ToRGBLayer, the modulated_conv2d function is used to generate the RGB frames. Does this mean that the network is not fully implicit but contains convolutions in the last layer, or did I miss something?

digan/src/training/networks.py

Line 386 in 8368d5b

    
           x = modulated_conv2d(x=x, weight=self.weight, styles=styles, demodulate=False, fused_modconv=fused_modconv)

Thank you for your help!

training issuesc

Hi,
I can not download the i3d_pretrained_400.pt file
Can you provide it for me?
tanks!

ModuleNotFoundError: No module named 'torchsde'

Hi,

Thank you for your work. When I run generate_videos.py on the pretrained checkpoints, it gave out the following error:

Loading networks from "../digan/pretrained/ucf-101-train-test.pkl"...
Traceback (most recent call last):
File "src/scripts/generate_videos.py", line 59, in
generate_videos()
File "/mnt/home/v_jiangshihao/miniconda3/envs/digan/lib/python3.8/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/mnt/home/v_jiangshihao/miniconda3/envs/digan/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/mnt/home/v_jiangshihao/miniconda3/envs/digan/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/home/v_jiangshihao/miniconda3/envs/digan/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/mnt/home/v_jiangshihao/miniconda3/envs/digan/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "src/scripts/generate_videos.py", line 38, in generate_videos
G = legacy.load_network_pkl(f)['G_ema'].to(device).eval() # type: ignore
File "/mnt/home/v_jiangshihao/digan_new/src/legacy.py", line 21, in load_network_pkl
data = _LegacyUnpickler(f).load()
File "/mnt/home/v_jiangshihao/digan_new/src/torch_utils/persistence.py", line 190, in _reconstruct_persistent_obj
module = _src_to_module(meta.module_src)
File "/mnt/home/v_jiangshihao/digan_new/src/torch_utils/persistence.py", line 226, in _src_to_module
exec(src, module.dict) # pylint: disable=exec-used
File "", line 14, in
ModuleNotFoundError: No module named 'torchsde'

Do you know what's the cause of that? Thanks for your help!

MoCoGAN-HD comparison

Hi,
you compare to MoCoGAN-HD on Taichi where they do not report results on this dataset in their paper. I assume you used their repo to train on Taichi. Can you please share the checkpoint you used because I am trying to compare to both of your works.

Also can you share information how you did the time extrapolation? So how did you adjust Ts?

Zip dataset

Hi,

thanks for your work! I want to use your repo with a .zip dataset, however I get following error:

File "/DIGAN/training/dataset.py", line 538, in init
classes, class_to_idx = find_classes(path)
File "/DIGAN/training/dataset.py", line 68, in find_classes
classes = [d for d in os.listdir(dir) if os.path.isdir(os.path.join(dir, d))]
NotADirectoryError: [Errno 20] Not a directory: '/DIGAN/data/dataset.zip'

Also I wanted to ask if I can somehow combine your model with the FVD evaluation of StyleGAN-V. Can you maybe integrate their evaluation protocol into your pipeline on the fly during training? I am having problems doing that and I think their evaluation protocol uses a better FVD evaluation

Error: --data: [Errno 2] No such file or directory: '/data/UCF-101/train'

Hello~ thank you for making this project open source.
When I deployed the project, I downloaded the UCF-101 dataset and placed it in /data/UCF101/train, but when I ran the training code:
python src/infra/launch.py hydra.run.dir=. +experiment_name=test +dataset.name=UCF-101
I got an error:
Error: --data: [Errno 2] No such file or directory: '/data/UCF-101/train'
So is there something wrong?

About FVD computing

Hello! I have two questions about FVD computing.

frechet_video_distance.py forces the generated sequence to be of length 16

fake = torch.cat([rearrange(
    G(z, c, timesteps=16, noise_mode='const')[0].clamp(-1, 1).cpu(),
    '(b t) c h w -> b t h w c', t=16) for z, c in zip(grid_z, grid_c)])

If I want to train DIGAN with clips length of 32 or 128. What should I do for FVD computing?

How long does it take to compute FVD each time? Following your training setting, this process takes 30 minutes each time and is calculated every 400k image, which means that the time spent on FVD calculation will be 0.5hrs * 25000 / 400 = 32hrs. This is far more time-consuming than StyleGAN2 FID computing. I want to make sure is this the case when you train your own model?

Thank you very much!

README file

Following the README file, I failed to run the project. Here are some suggestions and questions:

I think the author should tell us the project only support Linux system at least.
I just want to train the model, but the guide is too simple. Firstly, what are the means of "<EXP_NAME>"? Can I think it is just a temp name, it is ok to pass any text? Secondly, how to change the training options? Can I run the project successfully without any changes? Thirdly, where to place the data directory? data/UCF-101?
it seems launch.py is just an encapsulation of train.py, since some default settings may not available for everyone, why not provide a set of train.py ?
I have looked through the project, code quality is good, but the README is really a disaster.

Training Error (CUDA error: CUBLAS_STATUS_EXECUTION_FAILED )

Hi, thanks for your great work. I am planning on training your model with custom dataset. I encounter following error:

""""CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)"""""

I tried multiple ways to solve this issue such as reducing batch size to 1, reducing number of gpus to 1 and reducing resolution of images to 64X64. I am training on NVIDIA Titan Xp GPUs with 12GB RAM. I didn't find any luck yet!

Can you help me resolve this issue?