Giter Site home page Giter Site logo

ajbrock / biggan-pytorch Goto Github PK

View Code? Open in Web Editor NEW
2.8K 52.0 472.0 5.59 MB

The author's officially unofficial PyTorch BigGAN implementation.

License: MIT License

Python 95.46% Shell 2.37% MATLAB 2.17%
biggan pytorch deep-learning neural-networks gans dogball

biggan-pytorch's Introduction

BigGAN-PyTorch

The author's officially unofficial PyTorch BigGAN implementation.

Dogball? Dogball!

This repo contains code for 4-8 GPU training of BigGANs from Large Scale GAN Training for High Fidelity Natural Image Synthesis by Andrew Brock, Jeff Donahue, and Karen Simonyan.

This code is by Andy Brock and Alex Andonian.

How To Use This Code

You will need:

  • PyTorch, version 1.0.1
  • tqdm, numpy, scipy, and h5py
  • The ImageNet training set

First, you may optionally prepare a pre-processed HDF5 version of your target dataset for faster I/O. Following this (or not), you'll need the Inception moments needed to calculate FID. These can both be done by modifying and running

sh scripts/utils/prepare_data.sh

Which by default assumes your ImageNet training set is downloaded into the root folder data in this directory, and will prepare the cached HDF5 at 128x128 pixel resolution.

In the scripts folder, there are multiple bash scripts which will train BigGANs with different batch sizes. This code assumes you do not have access to a full TPU pod, and accordingly spoofs mega-batches by using gradient accumulation (averaging grads over multiple minibatches, and only taking an optimizer step after N accumulations). By default, the launch_BigGAN_bs256x8.sh script trains a full-sized BigGAN model with a batch size of 256 and 8 gradient accumulations, for a total batch size of 2048. On 8xV100 with full-precision training (no Tensor cores), this script takes 15 days to train to 150k iterations.

You will first need to figure out the maximum batch size your setup can support. The pre-trained models provided here were trained on 8xV100 (16GB VRAM each) which can support slightly more than the BS256 used by default. Once you've determined this, you should modify the script so that the batch size times the number of gradient accumulations is equal to your desired total batch size (BigGAN defaults to 2048).

Note also that this script uses the --load_in_mem arg, which loads the entire (~64GB) I128.hdf5 file into RAM for faster data loading. If you don't have enough RAM to support this (probably 96GB+), remove this argument.

Metrics and Sampling

I believe I can fly!

During training, this script will output logs with training metrics and test metrics, will save multiple copies (2 most recent and 5 highest-scoring) of the model weights/optimizer params, and will produce samples and interpolations every time it saves weights. The logs folder contains scripts to process these logs and plot the results using MATLAB (sorry not sorry).

After training, one can use sample.py to produce additional samples and interpolations, test with different truncation values, batch sizes, number of standing stat accumulations, etc. See the sample_BigGAN_bs256x8.sh script for an example.

By default, everything is saved to weights/samples/logs/data folders which are assumed to be in the same folder as this repo. You can point all of these to a different base folder using the --base_root argument, or pick specific locations for each of these with their respective arguments (e.g. --logs_root).

We include scripts to run BigGAN-deep, but we have not fully trained a model using them, so consider them untested. Additionally, we include scripts to run a model on CIFAR, and to run SA-GAN (with EMA) and SN-GAN on ImageNet. The SA-GAN code assumes you have 4xTitanX (or equivalent in terms of GPU RAM) and will run with a batch size of 128 and 2 gradient accumulations.

An Important Note on Inception Metrics

This repo uses the PyTorch in-built inception network to calculate IS and FID. These scores are different from the scores you would get using the official TF inception code, and are only for monitoring purposes! Run sample.py on your model, with the --sample_npz argument, then run inception_tf13 to calculate the actual TensorFlow IS. Note that you will need to have TensorFlow 1.3 or earlier installed, as TF1.4+ breaks the original IS code.

Pretrained models

PyTorch Inception Score and FID We include two pretrained model checkpoints (with G, D, the EMA copy of G, the optimizers, and the state dict):

  • The main checkpoint is for a BigGAN trained on ImageNet at 128x128, using BS256 and 8 gradient accumulations, taken just before collapse, with a TF Inception Score of 97.35 +/- 1.79: LINK
  • An earlier checkpoint of the first model (100k G iters), at high performance but well before collapse, which may be easier to fine-tune: LINK

Pretrained models for Places-365 coming soon.

This repo also contains scripts for porting the original TFHub BigGAN Generator weights to PyTorch. See the scripts in the TFHub folder for more details.

Fine-tuning, Using Your Own Dataset, or Making New Training Functions

That's deep, man

If you wish to resume interrupted training or fine-tune a pre-trained model, run the same launch script but with the --resume argument added. Experiment names are automatically generated from the configuration, but can be overridden using the --experiment_name arg (for example, if you wish to fine-tune a model using modified optimizer settings).

To prep your own dataset, you will need to add it to datasets.py and modify the convenience dicts in utils.py (dset_dict, imsize_dict, root_dict, nclass_dict, classes_per_sheet_dict) to have the appropriate metadata for your dataset. Repeat the process in prepare_data.sh (optionally produce an HDF5 preprocessed copy, and calculate the Inception Moments for FID).

By default, the training script will save the top 5 best checkpoints as measured by Inception Score. For datasets other than ImageNet, Inception Score can be a very poor measure of quality, so you will likely want to use --which_best FID instead.

To use your own training function (e.g. train a BigVAE): either modify train_fns.GAN_training_function or add a new train fn and add it after the if config['which_train_fn'] == 'GAN': line in train.py.

Neat Stuff

  • We include the full training and metrics logs here for reference. I've found that one of the hardest things about re-implementing a paper can be checking if the logs line up early in training, especially if training takes multiple weeks. Hopefully these will be helpful for future work.
  • We include an accelerated FID calculation--the original scipy version can require upwards of 10 minutes to calculate the matrix sqrt, this version uses an accelerated PyTorch version to calculate it in under a second.
  • We include an accelerated, low-memory consumption ortho reg implementation.
  • By default, we only compute the top singular value (the spectral norm), but this code supports computing more SVs through the --num_G_SVs argument.

Key Differences Between This Code And The Original BigGAN

  • We use the optimizer settings from SA-GAN (G_lr=1e-4, D_lr=4e-4, num_D_steps=1, as opposed to BigGAN's G_lr=5e-5, D_lr=2e-4, num_D_steps=2). While slightly less performant, this was the first corner we cut to bring training times down.
  • By default, we do not use Cross-Replica BatchNorm (AKA Synced BatchNorm). The two variants we tried (a custom, naive one and the one included in this repo) have slightly different gradients (albeit identical forward passes) from the built-in BatchNorm, which appear to be sufficient to cripple training.
  • Gradient accumulation means that we update the SV estimates and the BN statistics 8 times more frequently. This means that the BN stats are much closer to standing stats, and that the singular value estimates tend to be more accurate. Because of this, we measure metrics by default with G in test mode (using the BatchNorm running stat estimates instead of computing standing stats as in the paper). We do still support standing stats (see the sample.sh scripts). This could also conceivably result in gradients from the earlier accumulations being stale, but in practice this does not appear to be a problem.
  • The currently provided pretrained models were not trained with orthogonal regularization. Training without ortho reg seems to increase the probability that models will not be amenable to truncation, but it looks like this particular model got a winning ticket. Regardless, we provide two highly optimized (fast and minimal memory consumption) ortho reg implementations which directly compute the ortho reg. gradients.

A Note On The Design Of This Repo

This code is designed from the ground up to serve as an extensible, hackable base for further research code. We've put a lot of thought into making sure the abstractions are the right thickness for research--not so thick as to be impenetrable, but not so thin as to be useless. The key idea is that if you want to experiment with a SOTA setup and make some modification (try out your own new loss function, architecture, self-attention block, etc) you should be able to easily do so just by dropping your code in one or two places, without having to worry about the rest of the codebase. Things like the use of self.which_conv and functools.partial in the BigGAN.py model definition were put together with this in mind, as was the design of the Spectral Norm class inheritance.

With that said, this is a somewhat large codebase for a single project. While we tried to be thorough with the comments, if there's something you think could be more clear, better written, or better refactored, please feel free to raise an issue or a pull request.

Feature Requests

Want to work on or improve this code? There are a couple things this repo would benefit from, but which don't yet work.

  • Synchronized BatchNorm (AKA Cross-Replica BatchNorm). We tried out two variants of this, but for some unknown reason it crippled training each time. We have not tried the apex SyncBN as my school's servers are on ancient NVIDIA drivers that don't support it--apex would probably be a good place to start.
  • Mixed precision training and making use of Tensor cores. This repo includes a naive mixed-precision Adam implementation which works early in training but leads to early collapse, and doesn't do anything to activate Tensor cores (it just reduces memory consumption). As above, integrating apex into this code and employing its mixed-precision training techniques to take advantage of Tensor cores and reduce memory consumption could yield substantial speed gains.

Misc Notes

See This directory for ImageNet labels.

If you use this code, please cite

@inproceedings{
brock2018large,
title={Large Scale {GAN} Training for High Fidelity Natural Image Synthesis},
author={Andrew Brock and Jeff Donahue and Karen Simonyan},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=B1xsqj09Fm},
}

Acknowledgments

Thanks to Google for the generous cloud credit donations.

SyncBN by Jiayuan Mao and Tete Xiao.

Progress bar originally from Jan Schlüter.

Test metrics logger from VoxNet.

PyTorch implementation of cov from Modar M. Alfadly.

PyTorch fast Matrix Sqrt for FID from Tsung-Yu Lin and Subhransu Maji.

TensorFlow Inception Score code from OpenAI's Improved-GAN.

biggan-pytorch's People

Contributors

ajbrock avatar alexandonian avatar cclauss avatar darthsuogles avatar ishengfang avatar jeffling avatar mikeshatch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biggan-pytorch's Issues

z_dim for 512x512 model

The paper mentions in Appendix B that 512x512 model should use 160 dimensional z. But both in this codebase and TFHub, a 128 dimensional z space is used (which is smaller than the 140 dimensional z space for 256x256 model). Which is correct? If the code is correct, should the paper be updated, and why does it use a smaller z space than 256x256 model?

MemoryError when calculate IS

I want to train the BigGAN on Places365 dataset and rewrite the dataloader. It is ok to make hdf5 file. However, when I run the calculate_inception_moments.py, It encounters the memory error. The log is below:
‘’‘
Traceback (most recent call last):
File "calculate_inception_moments.py", line 93, in
main()
File "calculate_inception_moments.py", line 89, in main
run(config)
File "calculate_inception_moments.py", line 68, in run
pool, logits, labels = [np.concatenate(item, 0) for item in [pool, logits, labels]]
File "calculate_inception_moments.py", line 68, in
pool, logits, labels = [np.concatenate(item, 0) for item in [pool, logits, labels]]
MemoryError
’‘’
Do you know how to solve the problem? Looking forward to your reply.
Thanks

FID is nan

Hi guys, thank you for the amazing work!

I have a trouble with FID. During the training, many times the log shows FID is nan. It does not happen all the time but more than a half of them.

So I wonder in which cases FID can be nan? And what can I do to prevent it?

Thank you very much!

Why channel drop?

I was checking other repository (before checking this one), and then found a strange channel drop trick.
huggingface/pytorch-pretrained-BigGAN#9

I can see you also use it here:

# Drop channels in x if necessary
if self.in_channels != self.out_channels:
x = x[:, :self.out_channels]

Could you explain why do you do this? I think it's strange to train with lager channels more than necessary and drop at inference time. Does this trick somehow help for training?

Undefined name 'self' in layers.py

flake8 testing of https://github.com/ajbrock/BigGAN-PyTorch on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./layers.py:261:14: F821 undefined name 'self'
  if 'ch' in self.norm_style:
             ^
./layers.py:262:14: F821 undefined name 'self'
    ch = int(self.norm_style.split('_')[-1])
             ^
./layers.py:265:17: F821 undefined name 'self'
  elif 'grp' in self.norm_style:
                ^
./layers.py:266:18: F821 undefined name 'self'
    groups = int(self.norm_style.split('_')[-1])
                 ^
./utils.py:1005:35: F632 use ==/!= to compare str, bytes, and int literals
  'Gattn%s' % config['G_attn'] if config['G_attn'] is not '0' else None,
                                  ^
./utils.py:1006:35: F632 use ==/!= to compare str, bytes, and int literals
  'Dattn%s' % config['D_attn'] if config['D_attn'] is not '0' else None,
                                  ^
./train_fns.py:165:28: F821 undefined name 'z_'
                           z_, y_, config['n_classes'],
                           ^
./train_fns.py:165:32: F821 undefined name 'y_'
                           z_, y_, config['n_classes'],
                               ^
./sync_batchnorm/batchnorm_reimpl.py:15:1: F822 undefined name 'BatchNormReimpl' in __all__
__all__ = ['BatchNormReimpl']
^
2     F632 use ==/!= to compare str, bytes, and int literals
6     F821 undefined name 'self'
1     F822 undefined name 'BatchNormReimpl' in __all__
9

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

  • F821: undefined name name
  • F822: undefined name name in __all__
  • F823: local variable name referenced before assignment
  • E901: SyntaxError or IndentationError
  • E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

Question regarding the MultiEpochSampler

I've been reading the codebase, and find the MultiEpochSampler looks weird to me. Don't know if I understand it correctly, but it seems to me that in function utils.get_data_loaders
train_set is a vanilla data loader that read through all the data once, namely one epoch of data.
sampler is a MultiEpochSampler whose length is len(train_set) * num_epoch.
While in training, it's a nested loop that iterates through sampler for num_epoch times. In total, the model is trained on num_epoch^2 * len(train_set) images, and the "real" epoch is actually num_epoch^2.

I'm wondering if it's on purpose or a bug. Thank you very much.

Why drop_last of DataLoader is disabled

I cut the num_works to 0 due to lack of RAM and run BigGAN_bs256x8.sh, ended up with error bellow
image
I noticed that it was processing the last batch in the first epoch, so I dug into the dataloader part, and I found that drop_last is disabled when use_multiepoch_sampler is enabled, would that cause tuple error as I meet?

Some guidelines for better gpu perfomance.

Hi,

I am trying to train on a custom dataset using your algorithm.
I try to increase batch size up to the point that my script doesn't break.
I am running a script really similar to launch_BigGAN_ch64_bs256x8.sh.
I see that the memory being allocated is twice as much as it is used (~5GB is utilised and ~10GB is allocated).
Also after step 1000 the model broke as it tried to allocate more memory (I guess for predict, but why is that?).

I use 4 Titans with 11GB each and although you say the opposite, I would like you to suggest, if you have any idea of how I can use this already really powerful system to train your model.

Thanks a lot for your code and in advance for your time!

Places365

Hi!
I'd like to ask about pretrained models for Places365. In Readme, it says that Places365 pretrained models are coming soon. Is there still a plan to make them available or not? I believe that I wouldn't be the only one who would find them useful so it would be great if you could release them.

unsupervised GAN without class as conditional input?

Hi ajbrock,

Thanks a lot for your sharing this great repo! I'm trying to train the model on CIFAR-10 first, but I found that the setting of the training process on CIFAR-10 is similar to ImageNet, which requires a latent vector z and a class label y as inputs. But in most papers, CIFAR-10 is mainly used for unsupervised GAN, which only takes a latent vector z. Therefore, is there a way to train the model on CIFAR-10 in an unsupervised manner?

By the way, when I directly run the training script of training on CIFAR-10, the log shows that the FID score sometimes turns to NaN. What's the reason behind it?

Thanks you very much!

Are SAGAN and SNGAN scripts tested?

I have tried to train the SAGAN and SNGAN using the provided scripts, and find the performance grows very slow, e.g., at 10k iteration, they both only have an IS score of around 1.00 and FID around 300, sometimes the FID score will become nan.
I wonder if these two scripts are tested, as I see in the README.md that BigGAN-deep script is not tested but these two are not mentioned.

Sync Batchnorm

Pytorch guys recently released an official SyncBatchnorm implementation. It requires a specific setup where we use torch.parallel.DistributedDataParallel(...) instead of nn.DataParallel(...) and launch a separate process for each GPU.

I wrote a small step-by-step here: https://github.com/dougsouza/pytorch-sync-batchnorm-example.

In my experiments SyncBatchnorm worked well. Also, using torch.parallel.DistributedDataParallel(...) with one process per GPU provides a huge speed up in training. The gain of adding more GPUs is almost linear, it performs a lot faster than nn.DataParallel(...). I believe you could reduce training time drastically by switching to torch.parallel.DistributedDataParallel(...).

BTW, thanks for this implementation!

Training results(IS and FID) are not good as yours with same training process

Hi ajbrock,
I was running the training code on ImageNet by using default script launch_BigGAN_bs256x8.sh. It has finished 134k iterations and here is the log file.
Screen Shot 2019-05-16 at 9 33 58 AM

Compare with the log file that you released, I got the worse results. I kept all the parameters as same as your default settings. The training is on 8xV100. Do you have any suggestion to make it better? Or what should I check to get a similar result as yours?

Thanks a lot!

Performance on small datasets?

Hi,
does anyone know how BigGAN performs on small datasets? e.g. the birds dataset (Caltech-UCSD Birds-200-2011)? In such dataset, we only have about 50 images per class.
I have tried to run BigGAN on the birds dataset, however, the result doesn't look very nice (worse than baseline models)... since BigGAN is a large model, does it need a large dataset to perform well?

training at 512x512

Hello, I am trying to train a bigGan with a custom dataset, whose resolution is 512x512.
I edited one of the provided scripts to launch the training, but I got a key error when the code builds the discriminator. I noticed that there is not discriminator architecture for 512 resolution. Could you provide a discriminator architecture for such resolution?

Thanks!

Use my own dataset

If I use my own data set, do I need to use my own data set to train an inception_v3 when I get the initial measurement?

Any difference between Attention and SAGAN?

https://github.com/heykeetae/Self-Attention-GAN/blob/master/sagan_models.py
In this code

class Self_Attn(nn.Module):
    """ Self attention Layer"""
    def __init__(self,in_dim,activation):
        super(Self_Attn,self).__init__()
        self.chanel_in = in_dim
        self.activation = activation
        
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)
        self.gamma = nn.Parameter(torch.zeros(1))

        self.softmax  = nn.Softmax(dim=-1) #
    def forward(self,x):
        """
            inputs :
                x : input feature maps( B X C X W X H)
            returns :
                out : self attention value + input feature 
                attention: B X N X N (N is Width*Height)
        """
        m_batchsize,C,width ,height = x.size()
        proj_query  = self.query_conv(x).view(m_batchsize,-1,width*height).permute(0,2,1) # B X CX(N)
        proj_key =  self.key_conv(x).view(m_batchsize,-1,width*height) # B X C x (*W*H)
        energy =  torch.bmm(proj_query,proj_key) # transpose check
        attention = self.softmax(energy) # BX (N) X (N) 
        proj_value = self.value_conv(x).view(m_batchsize,-1,width*height) # B X C X N

        out = torch.bmm(proj_value,attention.permute(0,2,1) )
        out = out.view(m_batchsize,C,width,height)
        
        out = self.gamma*out + x
        return out,attention
# A non-local block as used in SA-GAN
# Note that the implementation as described in the paper is largely incorrect;
# refer to the released code for the actual implementation.

which one do you mean is largely incorrect and actual implementation? Thanks

Runtime Error when saving model

Hello, I'm having the following run-time error when saving my model at the first model save point. Any ideas or help would be excellent. Thank you.

RuntimeError: The size of tensor a (25) must match the size of tensor b (50) at non-singleton dimension 0

More context:

Saving weights to weights/BigGAN_C100_seed1_Gch64_Dch64_bs50_nDs4_Glr2.0e-04_Dlr2.0e-04_Gnlrelu_Dnlrelu_GinitN02_DinitN02_ema/copy0...
Traceback (most recent call last):
  File "train.py", line 227, in <module>
    main()
  File "train.py", line 224, in main
    run(config)
  File "train.py", line 206, in run
    state_dict, config, experiment_name)
  File "/workspace/BigGAN-PyTorch/train_fns.py", line 140, in save_and_sample
    z_=z_)
  File "/workspace/BigGAN-PyTorch/utils.py", line 895, in sample_sheet
    o = nn.parallel.data_parallel(G, (z_[:classes_per_sheet], G.shared(y)))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 207, in data_parallel
    outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/BigGAN.py", line 248, in forward
    h = block(h, ys[index])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/layers.py", line 399, in forward
    h = self.activation(self.bn1(x, y))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/layers.py", line 325, in forward
    return out * gain + bias
r```

RuntimeError: The size of tensor a (1000) must match the size of tensor b (10) at non-singleton dimension 0

Hi, I want to fine tune BigGAN on my own dataset which has 10 classes. I have modified num_classess and made sure that the checkpoint can be loaded into the model, but when I run it, it gave me this Error. I cannot figure out what else config maybe left out. Do I need to modify the number of classes of inception_v3 model?

1/488 ( 0.00%) Traceback (most recent call last):
File "../train.py", line 228, in
main()
File "../train.py", line 224, in main
run(config)
File "../train.py", line 184, in run
metrics = train(x, y)
File "/cache/code/BigGAN-PyTorch/train_fns.py", line 58, in train
D.optim.step()
File "/root/miniconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 93, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: The size of tensor a (1000) must match the size of tensor b (10) at non-singleton dimension 0

How to understand the result given by discriminator?

Hi, I want to use the discriminator alone.

Following is my code:

import torch
from BigGAN import Discriminator
import utils
from utils import Distribution

def load():
    path ='pre-trained/138k/'
    d_state_dict = torch.load(path + 'D.pth')
    D = Discriminator(D_ch=96, skip_init=True)
    D.load_state_dict(d_state_dict)
    return D

if __name__ == '__main__':
    D = load()
    D.eval()
    D.cuda()
    x = ... //x is an image
    x = x.to('cuda')
    y_ = Distribution(torch.zeros(1, requires_grad=False))
    y_.init_distribution('categorical', num_categories=1000)
    y_ = y_.to('cuda', torch.int64, non_blocking=False, copy=False)
    y_.sample_()
    print(D(x, y_[:1]))

I have three questions:

  1. Is my code correct? Especially the preparation of y_.
  2. How to preprocess x? I use imagenet val 2012.
  3. How to understand the output of D? I find the value could be positive or negative, I am wondering to know which region means Discriminator think the input is fake/real?

Thanks for your time.

groupnorm function in layers.py

Hi, when I read the code in layers.py, I found that in L322, the input is x and self.normstyle, but there is no self.normstyle in the previous definition. Should it be self.norm_style? In the definition of groupnorm function in L259, the norm_style could have the format of "ch_32" or "grp_16", but if this norm_style is the self.norm_style in L322, it could only be a choice of "bn", "ln", "in", or "gn". Therefore, I feel quite confused about the function here. Could you give some more explanations? Thanks!

No space left on device

I'm running sh scripts/utils/prepare_data.sh and getting the following error

ubuntu@ip-172-31-13-86:~/BigGAN-PyTorch$ sh scripts/utils/prepare_data.sh
{'dataset': 'I128', 'data_root': 'data', 'batch_size': 128, 'num_workers': 4, 'chunk_size': 100, 'compression': False}
Using dataset root location data/ImageNet
Data will not be augmented...
Generating Index file I128_imgs.npz...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 7.39it/s]
Starting to load I128 into an HDF5 file with chunk size 100 and compression None...
0%| | 0/783 [00:00<?, ?it/s]Producing dataset of len 100163
Image chunks chosen as (100, 3, 128, 128)
Label chunks chosen as (100,)
83%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 653/783 [00:24<00:04, 26.39it/s]Traceback (most recent call last):
File "make_hdf5.py", line 110, in
main()
File "make_hdf5.py", line 107, in main
run(config)
File "make_hdf5.py", line 97, in run
f['imgs'][-x.shape[0]:] = x
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 632, in setitem
self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 221, in h5py.h5d.DatasetID.write
File "h5py/_proxy.pyx", line 132, in h5py._proxy.dset_rw
File "h5py/_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite
OSError: Can't prepare for writing data (file write failed: time = Sun Jul 21 20:37:20 2019
, filename = 'data/ILSVRC128.hdf5', file descriptor = 22, errno = 28, error message = 'No space left on device', buf = 0x55da1f79c868, total write size = 735016, bytes this sub-write = 735016, bytes actually written = 18446744073709551615, offset = 4123893760)

{'dataset': 'I128_hdf5', 'data_root': 'data', 'batch_size': 64, 'parallel': False, 'augment': False, 'num_workers': 8, 'shuffle': False, 'seed': 0}
Using dataset root location data/ILSVRC128.hdf5
Downloading: "https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth" to /home/ubuntu/.cache/torch/checkpoints/inception_v3_google-1a9a5a14.pth
0%| | 16384/108857766 [00:00<00:02, 49977801.26it/s]
Traceback (most recent call last):
File "calculate_inception_moments.py", line 91, in
main()
File "calculate_inception_moments.py", line 87, in main
run(config)
File "calculate_inception_moments.py", line 55, in run
net = inception_utils.load_inception_net(parallel=config['parallel'])
File "/home/ubuntu/BigGAN-PyTorch/inception_utils.py", line 262, in load_inception_net
inception_model = inception_v3(pretrained=True, transform_input=False)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torchvision/models/inception.py", line 45, in inception_v3
progress=progress)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/hub.py", line 433, in load_state_dict_from_url
_download_url_to_file(url, cached_file, hash_prefix, progress=progress)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/hub.py", line 367, in _download_url_to_file
f.write(buffer)
File "/home/ubuntu/anaconda3/lib/python3.6/tempfile.py", line 485, in func_wrapper
return func(*args, **kwargs)
OSError: [Errno 28] No space left on device

What do I need to change in order to solve this?

Singular value clamping code?

I am looking for the code that does the clamping of the singular values from the weight matrices:
image

(i.e. page 6 of the arxiv paper)

but can't seem to find it in the training loop. Does anyone know where it is?

Thanks

How to test on only one GPU?

Hi, we are a group of students and is reproducing this BigGAN model for our coursework, one question is that we only have one GPU on the Colab, and is wondering how to modify the model, BTW, we are trying to use another dataset, there are some problems too. Hope to get your reply, really appreciate it :).

Why train_transform wasn't applied in utils.get_data_loaders()

Hi,
I am building my own model on top of this code and am trying to speed up i/o speed so I am working on using hdf5 file for my custom datasets input. But in get_data_loaders function in utils, it seems like 'train_transform' was not applied and I could not find appropriate image preprocessing/transformation applied anywhere upstream or downstream in the data-feeding pipeline if using hdf5. Would you care to explain why 'train_transform' was not applied if using dataset format hdf5? Or if it was applied, would you point out exactly where it was applied? Thanks very much!

Looking forward to offical TPU-version BigGAN code~

Hi ajbrock,
Thanks for open-source BigGAN code, which benefits a lot on other explore of BigGAN. But considering the cost of time, we hope to train biggan on TPU, so we use a version of tensorflow code that looks close to yours(https://github.com/Octavian-ai/BigGAN-TPU-TensorFlow). But many of the attempts turned out to be bad. Can you continue to release the official biggan code of TPU version?

Sincerely

config:
Imagenet2012
tf-1.12.2
GCP v3-8 pod
training for up to 300k steps

Some failed results:
100k:
100k samples
180k:
180k samples
300k:
300k samples

Error encountered in parallel training

Hello, I was training with my own dataset, which has 3 categories and 10K images in each. I use the launch_BigGAN_bs256x8.sh script.
The training exits with the following error message:

157/157 ( 99.36%) (TE/ET1k: 72:21 / 391:26) Traceback (most recent call last):_real : +0.914, D_loss_fake : +0.899
  File "train.py", line 227, in <module>
    main()
  File "train.py", line 224, in main
    run(config)
  File "train.py", line 184, in run
    metrics = train(x, y)
  File "/share/vision/BigGAN-PyTorch/train_fns.py", line 42, in train
    split_D=config['split_D'])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 7 on device 7.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'z' and 'gy'

Looks like something wrong with data parallel mode. Tried with pytorch 1.0.1 and 1.2.0, the error stays the same.
Could you take a look at it, please?

Using grayscale input images instead of RGB.

Hello @ajbrock! Thank you so much for making your model available for others to use. I'm trying to re-purpose it at the moment for a research project.

I have a two-fold issue: one piece is data-related, the other architecture-related.

  1. I am trying to use a dataset of .png grayscale images produced by an analogue-to-digital converter. The image dimensions are 512x512 and there is only 1 class. I have made the following modifications in order to get the dataset loaded: (larcv is the dataset name)

In utils.py

# Convenience dicts
dset_dict = {'larcv_png': dset.ImageFolder, 'larcv_hdf5': dset.ILSVRC_HDF5,
             'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
             'I128': dset.ImageFolder, 'I256': dset.ImageFolder,
             'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5,
             'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5,
             'C10': dset.CIFAR10, 'C100': dset.CIFAR100}
imsize_dict = {'larcv_png': 512, 'larcv_hdf5': 512,
               'I32': 32, 'I32_hdf5': 32,
               'I64': 64, 'I64_hdf5': 64,
               'I128': 128,
               'I128_hdf5': 128,
               'I256': 256, 'I256_hdf5': 256,
               'C10': 32, 'C100': 32}
root_dict = {'larcv_png': 'larcv_png', 'larcv_hdf5': 'ILSVRC512.hdf5',
             'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5',
             'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5',
             'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5',
             'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5',
             'C10': 'cifar', 'C100': 'cifar'}
nclass_dict = {'larcv_png': 1, 'larcv_hdf5': 1,
               'I32': 1000, 'I32_hdf5': 1000,
               'I64': 1000, 'I64_hdf5': 1000,
               'I128': 1000, 'I128_hdf5': 1000,
               'I256': 1000, 'I256_hdf5': 1000,
               'C10': 10, 'C100': 100}
# Number of classes to put per sample sheet
classes_per_sheet_dict = {'larcv_png': 1, 'larcv_hdf5': 1,
                          'I32': 50, 'I32_hdf5': 50,
                          'I64': 50, 'I64_hdf5': 50,
                          'I128': 20, 'I128_hdf5': 20,
                          'I256': 20, 'I256_hdf5': 20,
                          'C10': 10, 'C100': 100}

The dataset does serialize and load successfully, but when I check the dimensions of the images inside of the ILSVRC_HDF5class in datasets.py using img.shape, the dimensions show as [3, 512, 512].

This leads to a size-mismatch in the forward function of G_D at the line:
D_input = torch.cat([G_z, x], 0) if x is not None else G_z where G_z.shape = [4, 1, 512, 512] and x.shape = [4, 3, 512, 512]

  1. I've made the following changes to the D_arch dictionary in order to accommodate the 512x512 images:
  arch[512] = {'in_channels' :  [1] + [ch*item for item in [1, 2, 4, 8, 8, 16, 16]],
               'out_channels' : [item * ch for item in [1, 2, 4, 4, 8, 8, 16, 16]],
               'downsample' : [True] * 7 + [False],
               'resolution' : [512, 256, 128, 64, 32, 16, 8, 4],
               'attention' : {2**i: 2**i in [int(item) for item in attention.split('_')]
                              for i in range(2,10)}}

I have also modified the last layer of the Generator to output 1-channel images:

    # output layer: batchnorm-relu-conv.
    # Consider using a non-spectral conv here
    self.output_layer = nn.Sequential(layers.bn(self.arch['out_channels'][-1],
                                                cross_replica=self.cross_replica,
                                                mybn=self.mybn),
                                    self.activation,
                                    self.which_conv(self.arch['out_channels'][-1], 1))

My questions are:

  • How can I get the images to load with only 1 channel?
  • Are the architecture modifications I've made appropriate?

Thank you so much.

fine-tunning

Hi,
Do anyone fine-tuning by means of the provided model(Source: Imagenet, target: tench (n01440764)) ? I use the same parameters to train. Before training, the result is as felloing:
tench

After one batch(note I set loss into 0 by multiplying 0. with loss):

tench1

Do Anyone have same problem?

Minimal working example for sampling from pre-trained BigGAN?

Hi ajbrock,
I am so excited that you released the Pytorch version of BigGAN. I am trying to sample some results. Could you provide a minimal working example for sampling from pre-trained BigGAN? @airalcorn2 and I wrote a piece of code for sampling, but the results look bad.
Here is our sample code.

import functools
import numpy as np
import torch
import utils

from PIL import Image

parser = utils.prepare_parser()
parser = utils.add_sample_parser(parser)
config = vars(parser.parse_args())

# update config (see train.py for explanation)
config["resolution"] = utils.imsize_dict[config["dataset"]]
config["n_classes"] = utils.nclass_dict[config["dataset"]]
config["G_activation"] = utils.activation_dict[config["G_nl"]]
config["D_activation"] = utils.activation_dict[config["D_nl"]]
config = utils.update_config_roots(config)
config["skip_init"] = True
config["no_optim"] = True
device = "cuda:7"

# Seed RNG
utils.seed_rng(config["seed"])

# Setup cudnn.benchmark for free speed
torch.backends.cudnn.benchmark = True

# Import the model--this line allows us to dynamically select different files.
model = __import__(config["model"])
experiment_name = utils.name_from_config(config)
G = model.Generator(**config).to(device)
utils.count_parameters(G)

# Load weights
G.load_state_dict(torch.load("/mnt/raid/qi/biggan_weighs/G_optim.pth"), strict=False)

# Update batch size setting used for G
G_batch_size = max(config["G_batch_size"], config["batch_size"])
(z_, y_) = utils.prepare_z_y(
    G_batch_size,
    G.dim_z,
    config["n_classes"],
    device=device,
    fp16=config["G_fp16"],
    z_var=config["z_var"],
)

G.eval()

# Sample function
sample = functools.partial(utils.sample, G=G, z_=z_, y_=y_, config=config)

with torch.no_grad():
    z_.sample_()
    y_.sample_()
    image_tensors = G(z_, G.shared(y_))


for i in range(len(image_tensors)):
    image_array = image_tensors[i].permute(1, 2, 0).detach().cpu().numpy()
    image_array = np.uint8(255 * (1 + image_array) / 2)
    image = Image.fromarray(image_array).save("./test_images/{i}.png")

Here is one of our results.
59

Thanks a lot.

Output images have holes in them

I have been training my own models with 256x256 outputs, using my own dataset, with 1 class only. The training is working as expected, but I keep noticing the outputs from the GAN have holes in them. See example below.

Initially I thought it is a matter of training for longer, or rather that the training data is not very clean. But when I tried the same with a different cleaner dataset (for a different class), I still noticed a hole in all the outputs, even after training for longer.

I am using 4 GPUs, with a batch size of 40 and --num_G_accumulations 4 --num_D_accumulations 4.

I am wondering if anyone ran in the same issue? and what could be the problem?

I included an example below: you can see the hole in the center of the image.

image

Resuming training from given 100k checkpoint collapses

Training on a single V100 GPU, resuming from the checkpoint at 100,000 iterations linked to from the README, consistently collapses straight away.

Screenshot of script to resume training from checkpoint: https://drive.google.com/open?id=1ik16IVE9G8l7dCnpCjQStVslkE9Ltg19

The differences I can see besides using just one GPU and so having a smaller batch size and more accumulations are:

  • not using parallel
  • using 0 workers
  • not using multiepoch sampler
  • not using the EMA for evaluation (to show collapse)

I'm very confused as to why this is not working - none of those changes should make a difference, as far as I can tell - any suggestions greatly appreciated!

GIF showing collapse over about 130 iterations, sampling every 5: https://giphy.com/gifs/ifMnHURSReyjyE3IfC/html5

Logs: https://drive.google.com/drive/folders/11Far9osGQ2KslKY4Gyio8AHotBGXTKI6?usp=sharing

A mismatch problem on fine-tune

Hi, I tried to fine-tune pre-trained model on my dataset, but the following error raised, which probably means a mismatch on the number of classes between imagenet(1000) and my dataset(2). So, Is there any suggestion on fine-tune for a different number of classes. Thanks.

RuntimeError: Error(s) in loading state_dict for Generator:
size mismatch for shared.weight: copying a param with shape torch.Size([1000, 128]) from checkpoint, the shape in current model is torch.Size([2, 128]).

How should I conditional generate samples

I found that your categorical distribution label is sampled during the sample period. How should I conditional on a specific label to generate samples?

Thanks for the awesome code!

OSError

Hi,i met the question: unable to open file: name = 'data/ILSVRC128.hdf5',error message = 'No such file or directory', flags = 0, o_flags = 0).

Do you have any plan to release the Discriminator weights for 256 and 512 models?

Hi ajbrock,
Do you have any plan to release the discriminator weights for 256 or 512 models? I tried to use the discriminator for 128 models and I think it is a conditional discriminator. Is there any chance to convert it to an unconditional discriminator? Please correct me if I misunderstand anything about it.
Thank so much!

Question about truncation trick in Big GAN.

I think the PyTorch implementation slightly differs from that from Tensorflow Hub w.r.t. truncation trick. Here is the situation: when I set truncation as 1, these two models (tf model v.s. pth model converted from tf model) produce same results. However, it no longer holds when truncation is less than 1.

According to the code on https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb, the truncation factor not only works on the input latent vector, but is also passed to the module by a placeholder. Differently, in this PyTorch version, truncation factor only affects the input noise z in sample.py.

I have also found that the following two settings give same outputs:
(1) With Tensorflow Hub,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))
truncation = 1.0

(2) With PyTorch,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))

But the following two settings give different syntheses:
(3) With Tensorflow Hub,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))
truncation = 0.02

(4) With PyTorch,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))

Would you please help me figure this thing out? I want to know how the official Big GAN play the truncation trick besides truncating the latent code. BTW, I experimented on the 512x512 BigGAN model (https://tfhub.dev/deepmind/biggan-512/2).

Running out of memory

Hi,

I'm running ./scripts/launch_BigGAN_bs256x8.sh and getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 7.44 GiB total capacity; 5.47 GiB already allocated; 487.56 MiB free; 1.07 GiB cached)
@ellismarte

Is there something I can configure so that that much memory doesn't get used and I don't run out of memory?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.