Hello, I am trying to resume training using the pretrained weights on the CelebA-6

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I seem to have gotten a sort of time out error message: <div class="snippet-clipbo

Error when resuming training on Diffusion-StyleGAN2 for CelebA-64 dataset about diffusion-gan HOT 12 CLOSED

zhendong-wang commented on July 20, 2024

Error when resuming training on Diffusion-StyleGAN2 for CelebA-64 dataset

from diffusion-gan.

Comments (12)

aaaapineapple commented on July 20, 2024 1

@starsong98 May I ask how you resolved it？

from diffusion-gan.

menyouhua1 commented on July 20, 2024 1

@aaaapineapple Hello,have you solved the problem?

from diffusion-gan.

Zhendong-Wang commented on July 20, 2024

Hi starsong98, I just tried on my end. The checkpoint works well and could be resumed for further training. Do you make sure that your CelebA is 64x64 resolution?

Here is the command that I used
python train.py --outdir=training-runs --data=/home/zdwang/datasets/celeba64.zip --gpus=2 --cfg stl10 --kimg 100000 --ts_dist uniform --resume pretrained/diffusion-stylegan2-celeba64.pkl

from diffusion-gan.

starsong98 commented on July 20, 2024

I checked just now, it is indeed 64 x 64 resolution.
I think the model architecture just does not match, but I could be wrong.

from diffusion-gan.

starsong98 commented on July 20, 2024

Thank you! Your command did the trick. Looks like it was some of the other arguments instead that were interfering.

But now the "Evaluating metrics..." part is taking really long for some reason. Could it be because I do not have Tensorboard installed in that environment?

from diffusion-gan.

Zhendong-Wang commented on July 20, 2024

Emmm, Tensorboard should not impact the evaluation time. Pytorch version could sometimes matter. Usually the first evaluation could take longer for set up evaluation stage, but the later ones will be faster.

from diffusion-gan.

starsong98 commented on July 20, 2024

I seem to have gotten a sort of time out error message:

Traceback (most recent call last):
  File "train.py", line 533, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 528, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "Documents/ai504/project1/Diffusion-GAN-main/diffusion-stylegan2/train.py", line 359, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "Documents/ai504/project1/Diffusion-GAN-main/diffusion-stylegan2/training/training_loop.py", line 437, in training_loop
    value = phase.start_event.elapsed_time(phase.end_event)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/cuda/streams.py", line 177, in elapsed_time
    return super(Event, self).elapsed_time(end_event)
RuntimeError: Both events must be recorded before calculating elapsed time.

Does this mean that it took too long to compute the metrics?

from diffusion-gan.

Zhendong-Wang commented on July 20, 2024

NO, this is because your --kimg is smaller than the kimg has been trained. You should be able to know how many images trained from the first line output after training starts. kimg represents that the total numbers images will be trained.

The current resuming records the number of images trained. But you can modify this to make it forget it. Then the kimg will be the number of images futher trained.

from diffusion-gan.

starsong98 commented on July 20, 2024

I see. Thank you.

from diffusion-gan.

aaaapineapple commented on July 20, 2024

NO, this is because your --kimg is smaller than the kimg has been trained. You should be able to know how many images trained from the first line output after training starts. kimg represents that the total numbers images will be trained.

The current resuming records the number of images trained. But you can modify this to make it forget it. Then the kimg will be the number of images futher trained.

hello!
Should I modify the value of kimg or do I need to do some processing，

from diffusion-gan.

Jiangshouyu1 commented on July 20, 2024

@menyouhua1 Hello, i meet the same problem, have you solved it?

from diffusion-gan.

Jiangshouyu1 commented on July 20, 2024

@aaaapineapple Hello, i meet the same problem, have you solved it?

from diffusion-gan.

Error when resuming training on Diffusion-StyleGAN2 for CelebA-64 dataset about diffusion-gan HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent