Comments (14)
Hi Qi,
There can be a substantial amount of variance in the time to convergence for a model (I only had time to train one with this codebase as I don't have unfettered access to that kind of compute) so it's not surprising that yours might need longer to converge/collapse--it appears to still be training.
I'd say let it run and see what IS/FID it gets to when it explodes and dies. This would also be a helpful datapoint for this repo to start getting a better sense of the variance in #itrs required =); if you wouldn't mind posting the full logfile (e.g. in a pastebin) I can take a look at them and check for any anomalies.
from biggan-pytorch.
Hi ajbrock,
Appreciate your reply. I will continue the training to see what will happen.
Of course, I would like to share my whole training log files on Google Drive(log files). If you could help to check the log files, that will be awesome.
BTW, there was a shut down during my training process since our GPUs lose the power. Could it be a reason that I didn't get the same performance as yours at similar iterations? Is it possible that you release a script which is training a 256x256 models(I plan to do it) or Is there any special notifications for training a 256x256 model?
Many thanks to you!
from biggan-pytorch.
Hmm, Looking at your logs against () this does look like it's well outside the variance I would expect. My models are also trained on 8xV100 so I don't think hardware is the issue here. When I trained my models, I had to pause training around 18-20 times (I'm only allowed to run jobs for 24 hour at a time, so across 2+ weeks, you can imagine how many times I had to relaunch!) so I'm pretty confident in the checkpointing, but there could be something that doesn't stack up on different systems.
Keeping on with debugging:
-
What version, exactly, of PyTorch, CUDA, and cuDNN are you using?
-
Can you post a screenshot of both the script you used to launch the training process, and the script you used to resume?
The main possibility that currently comes to mind is that something might be up with your copy of ImageNet, but that would be somewhat odd (the main thing would be to check that you have the full 1.2M image dataset and confirm that all class folders have ~600-1300 images in them).
As to 256x256 it should be straightforward to just prepare the 256x256 dataset (or not use the hdf5) and run with I256
or I256_hdf5
as the dataset, though be aware that on only 8xV100 you may need to turn down the batch size and turn up the # accumulations to keep memory costs palatable.
from biggan-pytorch.
Hi Andy,
Thanks for helping me check the log files. My system setting is as below:
pytorch: 1.0.1
cuda: 9.0
cudnn: 7.1.2
python: 3.6.8
I used the same scripts to run the code as yours except removing --load_in_mem
arg. Here are the screenshots of my used scripts(launch the program, resume the program ).I also checked the training set (ILSVRC2012) which I used. Every class has 730+ images at least.
BTW, I have stopped the training process since there was no sign that it became better. I redo it again and hope it works well this time. I should take advantage of your released log files and check mine every day to make sure that the training process works as expected.
Here is the plotting of my second training and the comparison with your logs and my previous log. It looks a little better than last time but not good as yours. I hope it could become better for next few days.
Best
Qi
from biggan-pytorch.
That plot looks to me like EMA isn't kicking in, or that the batchnorm stats being used alongside the EMA are stale. Can you pull the FID and IS stats for the model you've trained out to 134k iterations using the following settings?
-
The model with EMA weights, in training mode (and the biggest per-GPU batch size you can manage)
-
The model with non-EMA weights, in training mode
-
The model with non-ema weights, in test mode
You should be able to do this easily by modifying the sampling script
from biggan-pytorch.
Thanks Andy!
I've run your sampling script on the model that I had trained up to 147K iterations to get IS and FID scores.
Here is the full log.
In short, I called your sampling script with G_eval_mode=False/True
and use_ema=False/True
to produce 4 different FID/IS sets as summarized in the table below.
These numbers are at noise_variance = 1.0
.
Training mode with EMA | Training mode without EMA | Test mode without EMA | Test mode with EMA | |
---|---|---|---|---|
IS | 62.866 +/- 1.269 | 23.864 +/- 0.505 | 24.270 +/- 0.544 | 57.521 +/- 1.852 |
FID | 18.8229 | 41.8047 | 42.1819 | 20.8808 |
I believe I've been plotting the FID/IS for the EMA weights in test mode.
FYI. Here is a plot of my 2nd training attempt (simply re-running the same script on the same hardware 8 x V-100).
The 2nd attempt (red) appears better than the 1st attempt (green) but still not as good as yours (blue).
Could you help me to check it? Thank you very much for your insights!
Best
Qi
from biggan-pytorch.
Hi Andy!
I also did the experiment,and I don't change any parameters, but my experiment result was not as good as yours, the IS value was only 51,when hen I trained 140,000 steps。
from biggan-pytorch.
Hi feifei-Liu,
Thanks, this is helpful! Can you also post a link to your training logs, and the script you launched with?
I'm working on tracking down any possible differences or bugs that might be leading to this.
@qilimk This still looks as though the EMA isn't kicking in somehow, but I'm digging in to see if I can spot any bugs or differences that might be causing this. One additional request: can you load the very first image from your I128_hdf5
as a .npy
file and upload it as well? I want to make sure the preprocessing for my version of the dataset is the same.
from biggan-pytorch.
Sorry, I was running in a cluster for the first time and the results on the machine were not saved after termination . I only saved the results of the first 50,000 steps, and my FID value was of no reference value because I had done 100 categories of experiments before and I did not modify the i128_inception_moments.npz file . I didn't change the script you provided, my experiment ran on 8 M40 gpus, but I think GPU should not be a problem.Based on my previous observations, like qilimk, my results began to widen the gap with you after 20,000 steps.I tried to train with the cifar script you provided, and my IS value only reached 8.28. I don't know whether this result IS normal or not, and the effect of generating pictures IS not very good.I agree with you that there may be something wrong with EMA.In addition, I am wondering if we can make more experiments with cifar data set to see if it is the network problem.I also found that the picture of the training set is not using argument.
from biggan-pytorch.
Hi @ajbrock ,
I loaded the first image from my I128_hdf5
and save as a .npy
file.
I also found a note that upsample
may induce nondeterministic behavior when using CUDA in Pytorh document. Do you think that's why I couldn't reproduce the results?
Thanks a lot!
from biggan-pytorch.
Hi @feifei-Liu , I also trained the model on cifar10 and got IS=8.32, FID=6.45. What's your FID score on cifar10? It is really hard to compare with other papers on the performance of cifar10, as in other papers they use an unsupervised training setting, while here we use a cGAN setting for training.
from biggan-pytorch.
Hi @qilimk , similar to your experiments, I also found a gap in the performance between training mode and testing mode (G_eval_mode=true/false). Do you know why there is such a gap and what mode should we use to report the score officially?
from biggan-pytorch.
Hi @ajbrock,
I have been working to run the BigGAN code on another image data set which I created. The data set consists of two classes with 256x256 sized images. I tried to retune the training parameters due to the fact that I am using a single Titan XP (~12 GB RAM). First, I read your suggestions about previous memory issues. Then, I applied following modifications to training arguments:
- remove the 'load_in_mem' argument
- decrease the batch size
- increase the number of accumulations(num_G_accumulations and num_D_accumulations) to overcome the small batch size problem
- decrease the channel multipliers(G_ch and D_ch) but still try to keep it in high values
- add '--which_best FID' argument
- small modifications showed in the attached script
You can compare your parameters with mine here. During whole training, my IS and FID values were terrible. The results were getting better slowly but then getting worse as you can see in the log file. Could you help me please to solve the problem?
from biggan-pytorch.
hi guys, how to handle the above issues? do you reproduce the released results.
from biggan-pytorch.
Related Issues (20)
- Training with 96x96 input HOT 1
- 我长期研究和改进GAN,如果对GAN或者深度学习感兴趣的可以联系我,联系方式,wechat: lovedaixiaobaby
- How can I improve the quality of generated images?
- AttributeError: 'Distribution' object has no attribute 'dist_type' HOT 2
- why the inception scores and FIDs are not comparable with the TF version?
- Out of memory HOT 3
- requirements.txt or library versions
- Training with ImageNet 64x64 HOT 1
- sampling out of memory HOT 1
- Correct attention layer implementation HOT 4
- y_ and y variables in training function
- Query about orthogonal regularization implementation
- collapsed samples
- Trouble training it from scratch
- BigGAN-deep pre-trained
- How to train BigGAN for image generation conditioned on text description? HOT 1
- set cross_replica as true got bad result HOT 1
- Output 80x688 images
- Problems when sampling
- The model was trained. How to generate the images?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from biggan-pytorch.