Giter Site home page Giter Site logo

snap-research / mocogan-hd Goto Github PK

View Code? Open in Web Editor NEW
236.0 26.0 25.0 20.24 MB

[ICLR 2021 Spotlight] A Good Image Generator Is What You Need for High-Resolution Video Synthesis

License: Other

Python 83.82% C++ 1.33% Cuda 8.76% Shell 6.09%
gan video-generation deep-learning image-generation

mocogan-hd's Introduction

MoCoGAN-HD

(AFHQ, VoxCeleb)

Pytorch implementation of our method for high-resolution (e.g. 1024x1024) and cross-domain video synthesis.
A Good Image Generator Is What You Need for High-Resolution Video Synthesis
Yu Tian1, Jian Ren2, Menglei Chai2, Kyle Olszewski2, Xi Peng3, Dimitris N. Metaxas1, Sergey Tulyakov2
1Rutgers Univeristy, 2Snap Inc., 3University of Delaware
In ICLR 2021, Spotlight.

Pre-trained Image Generator & Video Datasets

In-domain Video Synthesis

UCF-101: image generator, video data, motion generator
FaceForensics: image generator, video data, motion generator
Sky-Timelapse: image generator, video data, motion generator

Cross-domain Video Synthesis

(FFHQ, VoxCeleb): FFHQ image generator, VoxCeleb, motion generator
(AFHQ, VoxCeleb): AFHQ image generator, VoxCeleb, motion generator
(Anime, VoxCeleb): Anime image generator, VoxCeleb, motion generator
(FFHQ-1024, VoxCeleb): FFHQ-1024 image generator, VoxCeleb, motion generator
(LSUN-Church, TLVDB): LSUN-Church image generator, TLVDB

Calculated pca stats are saved here.

Training

Organise the video dataset as follows:

Video dataset
|-- video1
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- video2
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- video3
    |-- img_0000.png
    |-- img_0001.png
    |-- img_0002.png
    |-- ...
|-- ...

In-domain Video Synthesis

UCF-101

Collect the PCA components from a pre-trained image generator.

python get_stats_pca.py --batchSize 4000 \
  --save_pca_path pca_stats/ucf_101 \
  --pca_iterations 250 \
  --latent_dimension 512 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --style_gan_size 256 \
  --gpu 0

Train the model

python -W ignore train.py --name ucf_101 \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --dataroot /path/to/ucf_101 \
  --checkpoints_dir checkpoints/ucf_101 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 100 \

Inference

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ucf_101_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch the_epoch_for_testing (should >= 0) \
  --results results/ucf_101 \
  --num_test_videos 10 \

FaceForensics

Collect the PCA components from a pre-trained image generator.

sh script/faceforensics/run_get_stats_pca.sh

Train the model

sh script/faceforensics/run_train.sh

Inference

sh script/faceforensics/run_evaluate.sh

Sky-Timelapse

Collect the PCA components from a pre-trained image generator.

sh script/sky_timelapse/run_get_stats_pca.sh

Train the model

sh script/sky_timelapse/run_train.sh

Inference

sh script/sky_timelapse/run_evaluate.sh

Cross-domain Video Synthesis

(FFHQ, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

python get_stats_pca.py --batchSize 4000 \
  --save_pca_path pca_stats/ffhq_256 \
  --pca_iterations 250 \
  --latent_dimension 512 \
  --img_g_weights /path/to/ffhq_image_generator \
  --style_gan_size 256 \
  --gpu 0

Train the model

python -W ignore train.py --name ffhq_256-voxel \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --dataroot /path/to/voxel_dataset \
  --checkpoints_dir checkpoints \
  --img_g_weights /path/to/ffhq_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 25 \
  --cross_domain \

Inference

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch the_epoch_for_testing (should >= 0) \
  --results results/ffhq_256 \
  --num_test_videos 10 \

(FFHQ-1024, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/ffhq-vox/run_get_stats_pca_1024.sh

Train the model

sh script/ffhq-vox/run_train_1024.sh

Inference

sh script/ffhq-vox/run_evaluate_1024.sh

(AFHQ, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/afhq-vox/run_get_stats_pca.sh

Train the model

sh script/afhq-vox/run_train.sh

Inference

sh script/afhq-vox/run_evaluate.sh

(Anime, VoxCeleb)

Collect the PCA components from a pre-trained image generator.

sh script/anime-vox/run_get_stats_pca.sh

Train the model

sh script/anime-vox/run_train.sh

Inference

sh script/anime-vox/run_evaluate.sh

(LSUN-Church, TLVDB)

Collect the PCA components from a pre-trained image generator.

sh script/lsun_church-tlvdb/run_get_stats_pca.sh

Train the model

sh script/lsun_church-tlvdb/run_train.sh

Inference

sh script/lsun_church-tlvdb/run_evaluate.sh

Fine-tuning

If you wish to resume interupted training or fine-tune a pre-trained model, run (use UCF-101 as an example):

python -W ignore train.py --name ucf_101 \
  --time_step 2 \
  --lr 0.0001 \
  --save_pca_path pca_stats/ucf_101 \
  --latent_dimension 512 \
  --dataroot /path/to/ucf_101 \
  --checkpoints_dir checkpoints \
  --img_g_weights /path/to/ucf_101_image_generator \
  --multiprocessing_distributed --world_size 1 --rank 0 \
  --batchSize 16 \
  --workers 8 \
  --style_gan_size 256 \
  --total_epoch 100 \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0

Training Control With Options

--w_residual controls the step of motion residual, default value is 0.2, we recommand <= 0.5
--n_pca # of PCA basis, used in the motion residual calculation, default value is 384 (out of 512 dim of StyleGAN2 w space), we recommand >= 256
--q_len size of queue to save logits used in constrastive loss, default value is 4,096
--video_frame_size spatial size of video frames for training, all synthesized video clips will be down-sampled to this size before feeding to the video discriminator, default value is 128, larger size may lead to better motion modeling
--cross_domain activate for cross-domain video synthesis, default value is False
--w_match weight for feature matching loss, default value is 1.0, large value improves content matching

Long Sequence Generation

LSTM Unrolling

In inference, you can generate long sequence by LSTM unrolling with --n_frames_G

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0 \
  --n_frames_G 32

Interpolation

In inference, you can generate long sequence by interpolation with --interpolation

python -W ignore evaluate.py  \
  --save_pca_path pca_stats/ffhq_256 \
  --latent_dimension 512 \
  --style_gan_size 256 \
  --img_g_weights /path/to/ffhq_image_generator \
  --load_pretrain_path /path/to/checkpoints \
  --load_pretrain_epoch 0 \
  --interpolation

Examples of Generated Videos

UCF-101

FaceForensics

Sky Timelapse

(FFHQ, VoxCeleb)

(FFHQ-1024, VoxCeleb)

(Anime, VoxCeleb)

(LSUN-Church, TLVDB)

Citation

If you use the code for your work, please cite our paper.

@inproceedings{
tian2021a,
title={A Good Image Generator Is What You Need for High-Resolution Video Synthesis},
author={Yu Tian and Jian Ren and Menglei Chai and Kyle Olszewski and Xi Peng and Dimitris N. Metaxas and Sergey Tulyakov},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=6puCSjH3hwA}
}

Acknowledgments

This code borrows StyleGAN2 Image Generator, BigGAN Discriminator, PatchGAN Discriminator.

mocogan-hd's People

Contributors

alanspike avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mocogan-hd's Issues

Question about the cross-domain video discriminator

Hi, thanks for your great work!

I have a question about the cross-domain video discriminator.

According to your paper, you can learn to synthesize video content from one dataset A (such as Anime-Face) while motion part from another dataset B (such as VoxCeleb). In this mode, I think the video discriminator will first learn how to classify the anime and the real person's contents, rather than distinguish meaningful motions. How do you ensure that the video discriminator is helpful during training in this mode?

Incorrect link for the image generator checkpoint on FaceForensics

Hi! Thank you for the project and the codebase! I noticed that for some datasets, links to the pretrained models do not work: e.g. the image generator link on FaceForensics leads to https://github.com/snap-research/MoCoGAN-HD/blob/main/pretrained_models/faceforensics-fid10.9920-snapshot-008765.pt, which does not exist (same for (Anime, VoxCeleb) and (AFHQ, VoxCeleb) cross-domain image generators). Could you please provide a link for the pretrained image generator on FaceForensics?

Question about Inception score evaluation

Hello
Thank you for your great work! I read the paper carefully.

I wonder how to calculate the inception score of UCF-101 in detail.
I read that you follow the tgan paper for evaluating the inception scores and use the C3D network for getting the predictions.

Which weights did you use for the C3D network? Did you train from the initial?
If not, could you let me know the weight of the C3D network and w how to use the C3D net?

In detail, in this paper, the size of the generated video(ucf101) is (224, 224). But pre-trained network of C3D in this link (https://github.com/rezoo/tgan2/releases/download/v1.0/conv3d_deepnetA_ucf.npz) was not trained with config (224, 224). How did you resize and normalize the frames..?

I would be very grateful if you could reply.
Thanks.

Inference issue using pre-trained models

Hi,
Great Work!
I was using the pre_trained model for inference on skytimelapse and ucf-101 dataset. However in both the cases, gray videos are generated. I have not made any changes to the code. There are no errors or warnings. Did you face any similar issue ?

video-gen_19_5_noise.mp4

Usage of UCF-101 dataset

Hi, thank you for sharing the code of your elegant work!

I have a question about the experimental setup on experiments with UCF-101 dataset.
Did you use the "train" split from the UCF-101 dataset or the whole dataset without split?

Thank you in advance!

Sincerely,
Sihyun

Question about the FVD evaluation

Hi,

First of all, thank you for your great work!

As I read your paper,
I understand that the FVD is calculated from 2048 videos with 128x128 resolution in UCF101 dataset.

To evaluate your model on UCF101, I randomly sampled the 2048 real videos (random video clips with 16 consecutive frames) and resize them into 128x128 resolution.
Then, I calculated FVD between sampled real and fake videos.

In result, I got 625.87 which is a little lower than the distance you reported.
I think there is some difference when building the real video samples compared to your implementation or there is a lot of oscillation of FVD as the randomness of sampling.

Can you inform me detailed evaluation process for FVD on UCF101 and faceforensics dataset?

Thanks,

README should be updated

Some orders in README and current scripts don't match.
image

should run like this
sh script/ffhq-vox/run_evaluate_1024.sh and sh script/ffhq-vox/run_get_stats_pca_1024.sh

Augmentation for training?

Hello,

As I saw issue #5 (specifically, the below comment), I understand DiffAugment is applied for training on UCF-101 dataset.

Is DiffAugment applied for FaceForensics dataset too?
Because similar to UCF-101 which only has a small number of samples per class,
FaceForensics has only 704 training data, and I think this is not enough amount of data to train GANs

Hi @sihyun-yu, have you tried to use the augmentation from this work?

The FID was calculated during training from StyleGAN2.

Originally posted by @alanspike in #5 (comment)

Thanks,

How to train on a custom dataset?

I have a custom dataset of face videos from the How2Sign dataset. I have the dataset in the format required by this repository. What are the steps for training on a custom dataset?

Hyperparameters to train StyleGANv2 on UCF-101

Hello! Thanks again for providing the implementation.
I am trying to retrain a "unconditional" image generator from scratch on the UCF-101 dataset using StyleGANv2 as you suggested.
Did you use specific hyperparameters to train such a model to reach the reported FID?
If so, can you share those hyperparameters?

Thanks in advance!

Sincerely,
Sihyun

Cannot run pca_stats.py

Hi,

I was able to run the pca_stats.py file using the pretrained image generator models provided by you. I was installing a few more packages to my conda environment when pca_stats stopped running altogether.

I have tried uninstalling and reinstalling the conda environment using the requirements.txt provided in the repository. This is my command -- python get_stats_pca.py --batchSize 4000 --save_pca_path pca_stats/ucf_101 --pca_iterations 250 --latent_dimension 512 --img_g_weights pretrained_checkpoints/ucf-256-fid41.6761-snapshot-006935.pt --style_gan_size 256 --gpu 0

The process just hangs forever, the GPU memory goes from 0 MB to 3 MB, and nothing else happens. I don't know what I could have done wrong. It was working before. As an additional step, I also setup the repository from scratch.

Any idea what might have happened?

Question about the way you finetune the generator

Dear authors,

I want to ask a question about how you finetune the generator. Take faceforensics as an instance, did you use all cropped frames as the finetuning dataset, or did you use several frames of an identity each for finetuning?

Thanks a lot.

Did you cut first seconds of the FaceForensics dataset?

Hi! FaceForensics contains "video starting" artifacts for its first ~0.5 seconds for many its videos (see the gif), which might produce the corresponding training artifacts. Did you remove them?

Here are random samples from FFS, cut to the first 0.5 seconds:

real_ffs_128_unstable_1s

Also, did you account for them in any way when computing FVD?

Did you use any truncation or curation for the released samples?

Hi! Could you please tell whether you used any truncation for the content or motion codes or curated the samples for these generations: https://github.com/snap-research/MoCoGAN-HD#faceforensics-1 ? I used your pretrained checkpoint, PCA stats and the pretrained G to generate samples with --n_frames_G=32 and without spatial noise. And the results feel of lower quality compared to the ones which you show in your README.md. Here is the samples I got (sorry for the external link, github for some reason does not want to upload the gif even if it is less than 10mb):

https://i.imgur.com/1QRibnD.mp4

For example, the motion diversity is not that good, i.e. the heads do not "speak". Could you tell, why there is such a difference?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.