Giter Site home page Giter Site logo

Comments (12)

aaaapineapple avatar aaaapineapple commented on July 20, 2024 1

@starsong98 May I ask how you resolved it?

from diffusion-gan.

menyouhua1 avatar menyouhua1 commented on July 20, 2024 1

@aaaapineapple Hello,have you solved the problem?

from diffusion-gan.

Zhendong-Wang avatar Zhendong-Wang commented on July 20, 2024

Hi starsong98, I just tried on my end. The checkpoint works well and could be resumed for further training. Do you make sure that your CelebA is 64x64 resolution?

Here is the command that I used
python train.py --outdir=training-runs --data=/home/zdwang/datasets/celeba64.zip --gpus=2 --cfg stl10 --kimg 100000 --ts_dist uniform --resume pretrained/diffusion-stylegan2-celeba64.pkl

from diffusion-gan.

starsong98 avatar starsong98 commented on July 20, 2024

I checked just now, it is indeed 64 x 64 resolution.
I think the model architecture just does not match, but I could be wrong.

from diffusion-gan.

starsong98 avatar starsong98 commented on July 20, 2024

Thank you! Your command did the trick. Looks like it was some of the other arguments instead that were interfering.

But now the "Evaluating metrics..." part is taking really long for some reason. Could it be because I do not have Tensorboard installed in that environment?

from diffusion-gan.

Zhendong-Wang avatar Zhendong-Wang commented on July 20, 2024

Emmm, Tensorboard should not impact the evaluation time. Pytorch version could sometimes matter. Usually the first evaluation could take longer for set up evaluation stage, but the later ones will be faster.

from diffusion-gan.

starsong98 avatar starsong98 commented on July 20, 2024

I seem to have gotten a sort of time out error message:

Traceback (most recent call last):
  File "train.py", line 533, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 528, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "Documents/ai504/project1/Diffusion-GAN-main/diffusion-stylegan2/train.py", line 359, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "Documents/ai504/project1/Diffusion-GAN-main/diffusion-stylegan2/training/training_loop.py", line 437, in training_loop
    value = phase.start_event.elapsed_time(phase.end_event)
  File "anaconda3/envs/diffusiongan2/lib/python3.7/site-packages/torch/cuda/streams.py", line 177, in elapsed_time
    return super(Event, self).elapsed_time(end_event)
RuntimeError: Both events must be recorded before calculating elapsed time.

Does this mean that it took too long to compute the metrics?

from diffusion-gan.

Zhendong-Wang avatar Zhendong-Wang commented on July 20, 2024

NO, this is because your --kimg is smaller than the kimg has been trained. You should be able to know how many images trained from the first line output after training starts. kimg represents that the total numbers images will be trained.

The current resuming records the number of images trained. But you can modify this to make it forget it. Then the kimg will be the number of images futher trained.

from diffusion-gan.

starsong98 avatar starsong98 commented on July 20, 2024

I see. Thank you.

from diffusion-gan.

aaaapineapple avatar aaaapineapple commented on July 20, 2024

NO, this is because your --kimg is smaller than the kimg has been trained. You should be able to know how many images trained from the first line output after training starts. kimg represents that the total numbers images will be trained.

The current resuming records the number of images trained. But you can modify this to make it forget it. Then the kimg will be the number of images futher trained.

hello!
Should I modify the value of kimg or do I need to do some processing,

from diffusion-gan.

Jiangshouyu1 avatar Jiangshouyu1 commented on July 20, 2024

@menyouhua1 Hello, i meet the same problem, have you solved it?

from diffusion-gan.

Jiangshouyu1 avatar Jiangshouyu1 commented on July 20, 2024

@aaaapineapple Hello, i meet the same problem, have you solved it?

from diffusion-gan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.