Giter Site home page Giter Site logo

Comments (14)

ajbrock avatar ajbrock commented on May 28, 2024 2

Hi Qi,

There can be a substantial amount of variance in the time to convergence for a model (I only had time to train one with this codebase as I don't have unfettered access to that kind of compute) so it's not surprising that yours might need longer to converge/collapse--it appears to still be training.

I'd say let it run and see what IS/FID it gets to when it explodes and dies. This would also be a helpful datapoint for this repo to start getting a better sense of the variance in #itrs required =); if you wouldn't mind posting the full logfile (e.g. in a pastebin) I can take a look at them and check for any anomalies.

from biggan-pytorch.

qilimk avatar qilimk commented on May 28, 2024

Hi ajbrock,
Appreciate your reply. I will continue the training to see what will happen.
Of course, I would like to share my whole training log files on Google Drive(log files). If you could help to check the log files, that will be awesome.

BTW, there was a shut down during my training process since our GPUs lose the power. Could it be a reason that I didn't get the same performance as yours at similar iterations? Is it possible that you release a script which is training a 256x256 models(I plan to do it) or Is there any special notifications for training a 256x256 model?

Many thanks to you!

from biggan-pytorch.

ajbrock avatar ajbrock commented on May 28, 2024

Hmm, Looking at your logs against (image) this does look like it's well outside the variance I would expect. My models are also trained on 8xV100 so I don't think hardware is the issue here. When I trained my models, I had to pause training around 18-20 times (I'm only allowed to run jobs for 24 hour at a time, so across 2+ weeks, you can imagine how many times I had to relaunch!) so I'm pretty confident in the checkpointing, but there could be something that doesn't stack up on different systems.

Keeping on with debugging:

  1. What version, exactly, of PyTorch, CUDA, and cuDNN are you using?

  2. Can you post a screenshot of both the script you used to launch the training process, and the script you used to resume?

The main possibility that currently comes to mind is that something might be up with your copy of ImageNet, but that would be somewhat odd (the main thing would be to check that you have the full 1.2M image dataset and confirm that all class folders have ~600-1300 images in them).

As to 256x256 it should be straightforward to just prepare the 256x256 dataset (or not use the hdf5) and run with I256 or I256_hdf5 as the dataset, though be aware that on only 8xV100 you may need to turn down the batch size and turn up the # accumulations to keep memory costs palatable.

from biggan-pytorch.

qilimk avatar qilimk commented on May 28, 2024

Hi Andy,
Thanks for helping me check the log files. My system setting is as below:

pytorch: 1.0.1
cuda: 9.0
cudnn: 7.1.2
python: 3.6.8

I used the same scripts to run the code as yours except removing --load_in_mem arg. Here are the screenshots of my used scripts(launch the program, resume the program ).I also checked the training set (ILSVRC2012) which I used. Every class has 730+ images at least.

BTW, I have stopped the training process since there was no sign that it became better. I redo it again and hope it works well this time. I should take advantage of your released log files and check mine every day to make sure that the training process works as expected.
Here is the plotting of my second training and the comparison with your logs and my previous log. It looks a little better than last time but not good as yours. I hope it could become better for next few days.
new_traing_plotting_IS_mean_2019-05-21

Best
Qi

from biggan-pytorch.

ajbrock avatar ajbrock commented on May 28, 2024

That plot looks to me like EMA isn't kicking in, or that the batchnorm stats being used alongside the EMA are stale. Can you pull the FID and IS stats for the model you've trained out to 134k iterations using the following settings?

  1. The model with EMA weights, in training mode (and the biggest per-GPU batch size you can manage)

  2. The model with non-EMA weights, in training mode

  3. The model with non-ema weights, in test mode

You should be able to do this easily by modifying the sampling script

from biggan-pytorch.

qilimk avatar qilimk commented on May 28, 2024

Thanks Andy!

I've run your sampling script on the model that I had trained up to 147K iterations to get IS and FID scores.

Here is the full log.
In short, I called your sampling script with G_eval_mode=False/True and use_ema=False/True to produce 4 different FID/IS sets as summarized in the table below.
These numbers are at noise_variance = 1.0.

Training mode with EMA Training mode without EMA Test mode without EMA Test mode with EMA
IS 62.866 +/- 1.269 23.864 +/- 0.505 24.270 +/- 0.544 57.521 +/- 1.852
FID 18.8229 41.8047 42.1819 20.8808

I believe I've been plotting the FID/IS for the EMA weights in test mode.

FYI. Here is a plot of my 2nd training attempt (simply re-running the same script on the same hardware 8 x V-100).
The 2nd attempt (red) appears better than the 1st attempt (green) but still not as good as yours (blue).
new_traing_plotting_IS_mean_2019-05-27

Could you help me to check it? Thank you very much for your insights!

Best
Qi

from biggan-pytorch.

feifei-Liu avatar feifei-Liu commented on May 28, 2024

Hi Andy!
I also did the experiment,and I don't change any parameters, but my experiment result was not as good as yours, the IS value was only 51,when hen I trained 140,000 steps。

from biggan-pytorch.

ajbrock avatar ajbrock commented on May 28, 2024

Hi feifei-Liu,

Thanks, this is helpful! Can you also post a link to your training logs, and the script you launched with?
I'm working on tracking down any possible differences or bugs that might be leading to this.

@qilimk This still looks as though the EMA isn't kicking in somehow, but I'm digging in to see if I can spot any bugs or differences that might be causing this. One additional request: can you load the very first image from your I128_hdf5 as a .npy file and upload it as well? I want to make sure the preprocessing for my version of the dataset is the same.

from biggan-pytorch.

feifei-Liu avatar feifei-Liu commented on May 28, 2024

Sorry, I was running in a cluster for the first time and the results on the machine were not saved after termination . I only saved the results of the first 50,000 steps, and my FID value was of no reference value because I had done 100 categories of experiments before and I did not modify the i128_inception_moments.npz file . I didn't change the script you provided, my experiment ran on 8 M40 gpus, but I think GPU should not be a problem.Based on my previous observations, like qilimk, my results began to widen the gap with you after 20,000 steps.I tried to train with the cifar script you provided, and my IS value only reached 8.28. I don't know whether this result IS normal or not, and the effect of generating pictures IS not very good.I agree with you that there may be something wrong with EMA.In addition, I am wondering if we can make more experiments with cifar data set to see if it is the network problem.I also found that the picture of the training set is not using argument.
Image 11

from biggan-pytorch.

qilimk avatar qilimk commented on May 28, 2024

Hi @ajbrock ,
I loaded the first image from my I128_hdf5 and save as a .npy file.

I also found a note that upsample may induce nondeterministic behavior when using CUDA in Pytorh document. Do you think that's why I couldn't reproduce the results?

Screen Shot 2019-06-13 at 2 45 46 PM

Thanks a lot!

from biggan-pytorch.

songyh10 avatar songyh10 commented on May 28, 2024

Hi @feifei-Liu , I also trained the model on cifar10 and got IS=8.32, FID=6.45. What's your FID score on cifar10? It is really hard to compare with other papers on the performance of cifar10, as in other papers they use an unsupervised training setting, while here we use a cGAN setting for training.

from biggan-pytorch.

songyh10 avatar songyh10 commented on May 28, 2024

Hi @qilimk , similar to your experiments, I also found a gap in the performance between training mode and testing mode (G_eval_mode=true/false). Do you know why there is such a gap and what mode should we use to report the score officially?

from biggan-pytorch.

damlasena avatar damlasena commented on May 28, 2024

Hi @ajbrock,

I have been working to run the BigGAN code on another image data set which I created. The data set consists of two classes with 256x256 sized images. I tried to retune the training parameters due to the fact that I am using a single Titan XP (~12 GB RAM). First, I read your suggestions about previous memory issues. Then, I applied following modifications to training arguments:

  1. remove the 'load_in_mem' argument
  2. decrease the batch size
  3. increase the number of accumulations(num_G_accumulations and num_D_accumulations) to overcome the small batch size problem
  4. decrease the channel multipliers(G_ch and D_ch) but still try to keep it in high values
  5. add '--which_best FID' argument
  6. small modifications showed in the attached script

You can compare your parameters with mine here. During whole training, my IS and FID values were terrible. The results were getting better slowly but then getting worse as you can see in the log file. Could you help me please to solve the problem?

from biggan-pytorch.

AI-Science-Tech avatar AI-Science-Tech commented on May 28, 2024

hi guys, how to handle the above issues? do you reproduce the released results.

from biggan-pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.