Giter Site home page Giter Site logo

snap-research / cat Goto Github PK

View Code? Open in Web Editor NEW
179.0 9.0 20.0 80.52 MB

[CVPR 2021] Teachers Do More Than Teach: Compressing Image-to-Image Models (CAT)

Home Page: https://dejqk.github.io/GAN_CAT/

License: Other

Python 95.45% Shell 4.55%
deep-learning compression image-to-image cyclegan pix2pix gaugan pytorch gan

cat's People

Contributors

alanspike avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cat's Issues

Question on input dimensions of GKA

Hello, what is the optimal dimension of gka similarity?
It seem that .view(X.size(0),-1) gives us n X hwc, but for batchwise/spatialwise similarity in the paper nhw X c is used.
Both variants seems reasonable (at least both gave 1 for similarity between tensor and itself), but their behaviour is very different.

I don't see `--teacher_netG ` or `--student_netG` option in distill_options.

Hello,
I was running img-to-img translation notebook example and when I tried to train, I got this error below.

--teacher_netG inception_9blocks --student_netG inception_9blocks \
                                   ^
SyntaxError: invalid syntax

and this is the training block
training block

NUM_EPOCHS = 50 

!python distill.py --dataroot ./database/ukiyoe2photo \
--log_dir logs/cycle_gan/ukiyoe2photo/inception/student/2p6B \
--restore_teacher_G_path ukiyo_teacher_iter68000_net_G_B.pth \
--restore_pretrained_G_path ukiyo_teacher_iter68000_net_G_B.pth \
--restore_D_path ukiyo_teacher_iter68000_net_D_B.pth \
--real_stat_path ukiyo_A.npz \
--dataset_mode unaligned \
--distiller inception \
--gan_mode lsgan \
--nepochs NUM_EPOCHS --nepochs_decay NUM_EPOCHS \ 
--netG inception_9blocks \
--pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
--ndf 64 \
--num_threads 2 \
--eval_batch_size 2 \
--batch_size 10 \
--gpu_ids 0 \
--norm_affine \
--norm_affine_D \
--channels_reduction_factor 6 \
--kernel_sizes 1 3 5 \
--lambda_distill 2.8 \
--lambda_recon 1000 \
--prune_cin_lb 16 \
--target_flops 2.6e9 \
--distill_G_loss_type ka \
--netD multi_scale \
--crop_size 512,256 \
--preprocess resize_and_crop \
--load_size 600 \
--save_epoch_freq 1 \
--save_latest_freq 500 \
--direction BtoA \
--norm_student batch \
--padding_type_student zero \
--norm_affine_student \
--norm_track_running_stats_student \

I fixed it to --netG inception_9blocks, but it still shows the same error.
I checked script files in the repo and all the options are still --teacher_netG.
Could you let me know how to make it work?

Errors when trying to resume distilling.

  1. I have the image pairs for training ready on the /local_path dir.
    After successfully train the teacher with the command line:
!python train.py --dataroot /local_path \
  --model pix2pix \
  --log_dir /local_path/logs/teacher \
  --netG inception_9blocks \
  --lambda_recon 10 \
  --nepochs 500 --nepochs_decay 1000 \
  --norm batch \
  --norm_affine \
  --norm_affine_D \
  --norm_track_running_stats \
  --channels_reduction_factor 6 \
  --preprocess none \
  --kernel_sizes 1 3 5 \
  --save_epoch_freq 50 --save_latest_freq 20000 \
  --direction AtoB \
  --real_stat_path /local_path/out_stat.npz
  1. I got a foder full of model checkpoints on /local_path/checkpoints. I was able to resume the train session with something like:
!python train.py --dataroot /local_path \
  --model pix2pix \
  --log_dir /local_path/logs/teacher \
  --netG inception_9blocks \
  --lambda_recon 10 \
  --nepochs 0 --nepochs_decay 750 \
  --norm batch \
  --norm_affine \
  --norm_affine_D \
  --norm_track_running_stats \
  --channels_reduction_factor 6 \
  --preprocess none \
  --kernel_sizes 1 3 5 \
  --save_epoch_freq 50 --save_latest_freq 20000 \
  --direction AtoB \
  --real_stat_path /local_path/out_stat.npz \
  --epoch_base 750 \
  --iter_base 300001 \
  --restore_G_path /local_path/logs/teacher/checkpoints/latest_net_G.pth \
  --restore_D_path /local_path/logs/teacher/checkpoints/latest_net_D.pth

After training, the results on the eval/(it_number)/fake folder are acceptable.

  1. Then, I was able to run the distiller with the command:
 !python distill.py --dataroot /local_path \
  --distiller inception \
  --log_dir /local_path/logs/student \
  --restore_teacher_G_path /local_path/logs/teacher/checkpoints/best_net_G.pth \
  --restore_pretrained_G_path /local_path/logs/teacher/checkpoints/best_net_G.pth \
  --restore_D_path /local_path/logs/teacher/checkpoints/best_net_D.pth \
  --real_stat_path /local_path/out_stat.npz \
  --nepochs 500 --nepochs_decay 750 \
  --save_latest_freq 25000 --save_epoch_freq 25 \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 24 \
  --eval_batch_size 2 \
  --gpu_ids 0 \
  --norm batch \
  --norm_affine \
  --norm_affine_D \
  --norm_track_running_stats \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --direction AtoB \
  --lambda_distill 2.0 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka

I had to stop the session before it finished, again different checkpoint models were saved on the folder /local_path/logs/student/checkpoints. Including pth files for G,D, optim-0,optim-1,A-0,A1,A2 and A3
Progress seems OK on the local_path/logs/student/eval folder

  1. I tried to resume distilling with the command line:
!python distill.py --dataroot /local_path \
  --distiller inception \
  --log_dir /local_path/logs/student \
  --restore_teacher_G_path /local_path/logs/teacher/checkpoints/best_net_G.pth \
  --restore_pretrained_G_path /local_path/logs/student/checkpoints/latest_net_G.pth \
  --restore_D_path /local_path/logs/student/checkpoints/latest_net_D.pth \
  --restore_student_G_path /local_path/logs/student/checkpoints/latest_net_G.pth\
  --pretrained_student_G_path /local_path/logs/student/checkpoints/latest_net_G.pth\
  --restore_A_path /local_path/logs/student/checkpoints/latest_net_A \
  --restore_O_path /local_path/logs/student/checkpoints/latest_optim \
  --real_stat_path /local_path/out_stat.npz \
  --nepochs 0 --nepochs_decay 325 \
  --save_latest_freq 25000 --save_epoch_freq 25 \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 24 \
  --eval_batch_size 2 \
  --gpu_ids 0 \
  --norm batch \
  --norm_affine \
  --norm_affine_D \
  --norm_track_running_stats \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --direction AtoB \
  --lambda_distill 2.0 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --epoch_base 925 \
  --iter_base 370000

But now I get this error:

Load network at /local_path/logs/student/checkpoints/latest_net_G.pth
Traceback (most recent call last):
  File "distill.py", line 13, in <module>
    trainer = Trainer('distill')
  File "/content/CAT/trainer.py", line 80, in __init__
    model.setup(opt)
  File "/content/CAT/distillers/base_inception_distiller.py", line 260, in setup
    self.load_networks(verbose)
  File "/content/CAT/distillers/inception_distiller.py", line 203, in load_networks
    super(InceptionDistiller, self).load_networks()
  File "/content/CAT/distillers/base_inception_distiller.py", line 368, in load_networks
    self.opt.restore_student_G_path, verbose)
  File "/content/CAT/utils/util.py", line 139, in load_network
    net.load_state_dict(weights)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for InceptionGenerator:
	Missing key(s) in state_dict: "down_sampling.1.bias", "down_sampling.2.weight", "down_sampling.2.bias", "down_sampling.2.running_mean", "down_sampling.2.running_var", "down_sampling.2.num_batches_tracked", "down_sampling.4.bias", "down_sampling.5.weight", "down_sampling.5.bias", "down_sampling.5.running_mean", "down_sampling.5.running_var", "down_sampling.5.num_batches_tracked", "down_sampling.7.bias"

... lots of other missing layers, then

	size mismatch for down_sampling.1.weight: copying a param with shape torch.Size([16, 3, 7, 7]) from checkpoint, the shape in current model is torch.Size([24, 3, 7, 7]).
	size mismatch for down_sampling.4.weight: copying a param with shape torch.Size([16, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([48, 24, 3, 3]).
	size mismatch for down_sampling.7.weight: copying a param with shape torch.Size([210, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([96, 48, 3, 3]).

... lots of other size mismatches.

Seems to me that there is a mismatch between the network that was created internally and the one that is being used to fill it with the previously trained model. Not sure if it is a bug or if something is wrong in the command line I am using to resume.

Any help will be appreciated.

distilling freezes with large dimension

Hello,

Since my last CycleGAN model is too blurred, I brought a new dataset with 1024x1024.
Now I am testing the pipeline and found that distilling freezes with my specified dimension.

  --preprocess resize_and_crop \
  --load_size 1024 \
  --crop_size 1024,1024 \

It didn't work with none preprocess as well.

 --preprocess none

this is the last log that I can see. distilling doesn't proceed after this point.

...
features.8.res_ops.0.1.1.weight
features.8.res_ops.1.1.1.weight
features.8.res_ops.2.1.1.weight
features.8.dw_ops.0.0.1.weight
features.8.dw_ops.1.0.1.weight
features.8.dw_ops.2.0.1.weight
scale range: [0.9916747808456421, 1.014320731163025]

I've tested with some other options, like

  1. without those options(with default options)
  2. with the size of 500
  --preprocess resize_and_crop \
  --load_size 500 \
  --crop_size 500,500 \

and distilling worked with those cases.

I wonder if it froze because it is too large or something I misunderstood with the sizing.
Also, I saw the options used 256x256 size as default, and other tutorial scripts used the same sizing, I wonder if I still need to go with this sizing with high-resolution face-changing filters. I am training face filters that make people smile.

Just in case, this is the whole option I used for distill.py

!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/inception/student/2p6B \
  --restore_teacher_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_D_A.pth \
  --real_stat_path real_stat/face2smile_B.npz \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 80 \
  --eval_batch_size 4 \
  --batch_size 80 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --save_epoch_freq 1 \
  --save_latest_freq 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student \
  --preprocess resize_and_crop \
  --load_size 1024\
  --crop_size 500,500 \
  --nepochs 1 --nepochs_decay 0

restore options and recommaned epoch number

Hello,

Thanks to your help, I am training a cycleGAN with face images for a snap lens.
I found some restore options from the tutorial Jupiter notebook,

--restore_teacher_G_path ukiyo_teacher_iter68000_net_G_B.pth \
--restore_pretrained_G_path ukiyo_teacher_iter68000_net_G_B.pth \
--restore_D_path ukiyo_teacher_iter68000_net_D_B.pth \

but when I tried to find them in the options folder, I could only find --restore_O_path option from train_options.py
so my questions are

  1. right options for resuming training/ distilling
--epoch_base XXX \ # last epoch + 1?
--iter_base XXX \ # last iteration + 1?
--nepochs XXX \ # total goal epochs?
--nepochs_decay XXX \ #total goal epochs with lr decay?

#and are these options below only for distilling?
--restore_teacher_G_path xxx.pth \
--restore_pretrained_G_path xxx.pth \
--restore_D_path xxx.pth
  1. nepochs / nepochs_decay recommendation
    When I check the scripts folder, I see --nepochs 500 --nepochs_decay 500.
    I am training with 4GPUs but still takes quite a bit.
    I wonder how many epochs are recommended for desirable results.

Key Error in training pix2pix with unaligned data

I have been trying to train a pix2pix model on ukiyoe2photo dataset as given in this notebook by SnapML. The create_eval_dataloader function is giving a key error with the following log.

Traceback (most recent call last):
File "train.py", line 14, in
trainer.start()
File "/content/CAT/trainer.py", line 328, in start
(epoch, total_iter))
File "/content/CAT/trainer.py", line 272, in evaluate
metrics = self.model.evaluate_model(iter)
File "/content/CAT/models/pix2pix_model.py", line 507, in evaluate_model
self.set_input(data_i)
File "/content/CAT/models/pix2pix_model.py", line 439, in set_input
self.real_A = input['A' if AtoB else 'B'].to(self.device)
KeyError: 'B'

Seems like the data produced by the eval_dataloader contains only 2 keys: 'A' and 'A_paths'. Apparently it has no 'B' and 'B_paths' field.
It would be great if you may help me with it.

自定义数据集

您好,大佬 麻烦请教下 可以用自己的数据集来训练压缩模型吗? 如果可以的话,文件.npz应该如何生成呢? 谢谢您

`get_real_stat.py` for custom dataset

Hello,

I am testing CycleGAN as the readme documents.
It seems I need npz files for training, and when I see the command here

!python get_real_stat.py \
--dataroot database/horse2zebra \
--output_path real_stat/horse2zebra_A.npz \
--direction AtoB

and when I see the flag --dataroot it requires trainA, trainB, valA, valB and plus train, val, etc folders .

    parser.add_argument(
        '--dataroot',
        required=True,
        help=
        'path to images (should have subfolders trainA, trainB, valA, valB, train, val, etc)'
    )

I see CycleGAN requires usually 4 folders like trainA, trainB, testA(valA), testB(valB) but I am not sure what train, val, etc folders are for.
Could you explain how to set those folders? or is there a way to train without .npz files?

`restore` options to resume `distill.py`

Hello,

I got this error for onnx exporting.

Traceback (most recent call last):
  File "/home/ubuntu/CAT/onnx_export.py", line 13, in <module>
    exporter = Exporter()
  File "/home/ubuntu/CAT/onnx_exporter.py", line 59, in __init__
    model.netG_student.load_state_dict(
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for InceptionGenerator:
	size mismatch for down_sampling.7.weight: copying a param with shape torch.Size([234, 16, 3, 3]) from checkpoint, the shape in current model is torch.Size([230, 16, 3, 3]).
	size mismatch for down_sampling.7.bias: copying a param with shape torch.Size([234]) from checkpoint, the shape in current model is torch.Size([230]).
...
and so many mismatches

It is so weird since I checked it worked before and actually exported models.
Let me share my commands for distilling and exporting.

!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/inception/student/2p6B \
  --restore_teacher_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_B_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_B_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_B_net_D_A.pth \
  --real_stat_path real_stat/face2smile_B.npz \
  --nepochs 500 --nepochs_decay 500 \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 80 \
  --eval_batch_size 4 \
  --batch_size 80 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --save_epoch_freq 1 \
  --save_latest_freq 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student
!python3 onnx_export.py --dataroot database/face2smile \
  --log_dir onnx_files/cycle_gan/face2smile/inception/student/2p6B \
  --restore_teacher_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_G_A.pth \
  --pretrained_student_G_path logs/cycle_gan/face2smile/inception/student/2p6B/checkpoints/best_net_G.pth \
  --real_stat_path real_stat/face2smile_B.npz \
   --dataset_mode unaligned \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --gpu_ids 0 \
  --norm_affine \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --ndf 64 \
  --batch_size 8 \
  --eval_batch_size 2 \
  --num_threads 8 \
  --norm_affine_D \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --distiller inception \
  --gan_mode lsgan \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student

AssertionError: Invalid device id

Hello, I encountered errors in model training. What should I do? Do you have any suggestions? Thank you very much!
“AssertionError: Invalid device id”

Error when distilling CycleGan

Hi again,
I am now trying to use CAT to train and distill a CycleGan transformation.
The training worked well but I am about to start distilling with:

 !python distill.py --dataroot /local_path/ \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir /local_path/student \
  --restore_pretrained_G_path /local_path/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_teacher_G_path /local_path/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_D_path /local_path/teacher/checkpoints/best_A_net_D_A.pth \
  --real_stat_path /local_path/out_stat_B.npz\
  --nepochs 500 --nepochs_decay 500 \
  --save_latest_freq 25000 --save_epoch_freq 25 \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 24 \
  --eval_batch_size 2 \
  --gpu_ids 0 \
  --norm batch \
  --norm_affine \
  --norm_affine_D \
  --norm_track_running_stats \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --prune_cin_lb 16 \
  --target_flops 6.6e9 \
  --distill_G_loss_type ka

But I keep getting a runtime Error( similar to the one fixed on Issue #11 -> #11)

Load network at /local_path/teacher/checkpoints/best_A_net_G_A.pth
Traceback (most recent call last):
  File "distill.py", line 13, in <module>
    trainer = Trainer('distill')
  File "/content/drive/MyDrive/CAT/trainer.py", line 80, in __init__
    model.setup(opt)
  File "/content/drive/MyDrive/CAT/distillers/base_inception_distiller.py", line 260, in setup
    self.load_networks(verbose)
  File "/content/drive/MyDrive/CAT/distillers/inception_distiller.py", line 192, in load_networks
    super(InceptionDistiller, self).load_networks(prune_continue)
  File "/content/drive/MyDrive/CAT/distillers/base_inception_distiller.py", line 362, in load_networks
    verbose)
  File "/content/drive/MyDrive/CAT/utils/util.py", line 139, in load_network
    net.load_state_dict(weights)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for InceptionGenerator:
	Missing key(s) in state_dict: "down_sampling.2.running_mean", "down_sampling.2.running_var",...

Lots of other missing keys.

Please let me know if there is something wrong or missing in my command line, (BTW the teacher training produced 4 Generator networks, which one should I use to distill the A to B generator? )

Module Error in pix2pix distillation

I have built a pix2pix model, fully functional using custom code. I downloaded this pre-trained one I made but, I am not able to compress it due to the following error.

RuntimeError: Error(s) in loading state_dict for InceptionGenerator:
Missing key(s) in state_dict: "down_sampling.1.weight", "down_sampling.2.weight", "down_sampling.2.bias", .....
Unexpected key(s) in state_dict: "final_conv.weight", "final_conv.bias", "encoders.0.conv.weight","encoders.1.bn.bias", "encoders.1.bn.running_mean",......

Just wanted to know if the distillation code works only for the architecture trained using the code in models/pix2pix_model.py and trainer.py. The code I used to train the pix2pix GAN is here.

custom cyclegan training epoch and data preparation

Hello,

I trained a CycleGAN with CelebA dataset.
My goal is to make not smiling faces smile.
I checked this dataset with another cycleGAN repo, and it worked pretty well with 300 epochs.
So I trained with CAT cycleGAN --nepochs 160 --nepochs_decay 160 for training and distilling.
I read that other scripts such as horse2zebra has --nepochs 500 --nepochs_decay 500, so I guess my training epochs were too small, but I'd like to ask your opinion first.
image
image

So I can only see a little sign that the CycleGAN blurred the male's mouth. and another concern is the filter is too blurred, seems like low resolution.
so my questions are

Do you think that is due to the small number of epochs I trained?

Let me share the commands I used for training.
Or I wonder if my get _real_stat command is wrong

!python get_real_stat.py --dataroot database/face2smile/valB --output_path real_stat/face2smile_B.npz --direction AtoB --dataset_mode single
!python get_real_stat.py --dataroot database/face2smile/valA --output_path real_stat/face2smile_A.npz --direction BtoA --dataset_mode single
!python train.py --dataroot database/face2smile \
  --model cycle_gan \
  --log_dir logs/cycle_gan/face2smile/inception/teacher \
  --netG inception_9blocks \
  --real_stat_A_path real_stat/face2smile_A.npz \
  --real_stat_B_path real_stat/face2smile_B.npz \
  --batch_size 16 \
  --restore_G_A_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/140_net_G_A.pth\
  --restore_D_A_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/140_net_D_A.pth\
  --restore_G_B_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/140_net_G_B.pth\
  --restore_D_B_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/140_net_D_B.pth\
  --epoch_base 141 \
  --nepochs 10 --nepochs_decay 140 \
  --num_threads 16 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5
!python distill.py --dataroot database/face2smile \
  --log_dir logs/cycle_gan/face2smile/inception/student/2p6B \
  --restore_teacher_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/inception/teacher/checkpoints/best_A_net_D_A.pth \
  --real_stat_path real_stat/face2smile_B.npz \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --nepochs 160 \
  --nepochs_decay 160 \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 2 \
  --eval_batch_size 2 \
  --batch_size 4 \
  --gpu_ids 0 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --crop_size 512,256 \
  --preprocess resize_and_crop \
  --load_size 600 \
  --save_epoch_freq 1 \
  --save_latest_freq 500 \
  --direction BtoA \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student

so now I am trying to train more like below, do you think I could get working results with these?

 --epoch_base 261 \ #best_net_epoch
 --nepochs 240 --nepochs_decay 500 \

About the blurring.

My dataset dimension is 178x218 and centered with head.
000671
and I see the train model input dimension is 256x256.
Do I need to do some data preparation for that? like upscaling training data.
I saw the distilling options for horse2zebra

  --crop_size 512,256 \
  --preprocess resize_and_crop \
  --load_size 600 \

in my case, I probably want to get face crop texture only, so I am thinking of using default options like the below.

  --crop_size 286 \
  --preprocess resize_and_crop \
  --load_size 256, 256 \

Do you think this is the right approach?

Sorry for so many questions, but I wanted to make my model work and just listed all the possible questions that might be related.
Please let me know if you have any opinion on how to make the model work better.

Best,
Youjin

Distilling doesn't work as expected.

Hello,

Since the last question, #24, I tried 512x512 resolution training for both teacher and student models.
I found that the teacher model in 512x512 works fine, but student training is not working.
I wonder if I can get some hints why

Tfake img
image
Sfake img (274/1000 epoch)
image

training options

!python train.py --dataroot database/face2smile \
  --model cycle_gan \
  --log_dir logs/cycle_gan/face2smile/teacher_512 \
  --netG inception_9blocks \
  --real_stat_A_path real_stat_512/face2smile_A.npz \
  --real_stat_B_path real_stat_512/face2smile_B.npz \
  --batch_size 4 \
  --num_threads 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --save_latest_freq 10000 --save_epoch_freq 5 \
  --epoch_base 176 --iter_base 223395 \
  --nepochs 324 --nepochs_decay 500 \
  --preprocess scale_width --load_size 512 \
!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/student_512 \
  --restore_teacher_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_D_A.pth \
  --real_stat_path real_stat_512/face2smile_B.npz \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 32 \
  --eval_batch_size 4 \
  --batch_size 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --preprocess scale_width --load_size 512 \
  --save_epoch_freq 2 --save_latest_freq 1000 \
  --nepochs 500 --nepochs_decay 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student

get_real_stat.py for custom dataset

I am working on a pix2pix model and ran this code to get real_stats.npz file with the folder structure specified in get_real_stat.py

!python get_real_stat.py \
--dataroot database/sketch/trainA \
--dataset_mode single \
--output_path real_stat/sketch_stats_train_A.npz \
--direction BtoA \
--phase train
!python get_real_stat.py \
--dataroot database/sketch/trainB \
--dataset_mode single \
--output_path real_stat/sketch_stats_train_B.npz \
--direction BtoA \
--phase train

I am facing the error
Traceback (most recent call last): File "get_real_stat.py", line 146, in <module> main(opt) File "get_real_stat.py", line 40, in main tensors = torch.cat(tensors, dim=0) RuntimeError: Sizes of tensors must match except in dimension 0. Got 256 and 560 in dimension 2 (The offending index is 909)


Should I edit the dataloader for this or is it anything else ? Any help is appreciated, please.

ONNX Issue

Hi, I tried to run this model on iOS (using core ml) works fine, but when I did it for Android with onnx, the inference time went high to 180ms, whereas it runs around 20-30 on ios.

I tired to look into onnx export and found there are some issues.

Has anyone faced the same and how to fix this?

CUDA out of memory

Update: I wonder if repo update is possible with torch.amp or mixed precision applied.

Hello,

I'd like to ask, with my dataset and machine, if it is normal to see out of memory or if I might have some programming issue.

RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

I using AWS p3.8xlarge(4 Tesla V100s), trying to train a CycleGAN with 5055 images of 1024x1024 resolution.

I checked that resized dataset of 512x512 works with batch size 4, but with 1024x1024, even batch size 1 doesn't work.

I think we need p4d.24xlarge for this project, but it's hard to get the instance due to the lack of zone capacity.

possible tries are:
-reduce num of the dataset (but I think 5055 images are still small for training) My colleague thinks the model loads the whole dataset at once, and that's the reason we need to reduce the dataset.
-find a memory leak?

any comments or hints are appreciated.

below is the log for reference.

train.py --dataroot database/face2smile \
>   --model cycle_gan \
>   --log_dir logs/cycle_gan/face2smile/teacher_1080 \
>   --netG inception_9blocks \
>   --real_stat_A_path real_stat_1080/face2smile_A.npz \
>   --real_stat_B_path real_stat_1080/face2smile_B.npz \
>   --batch_size 1 \
>   --num_threads 1 \
>   --gpu_ids 0,1,2,3 \
>   --norm_affine \
>   --norm_affine_D \
>   --channels_reduction_factor 6 \
>   --kernel_sizes 1 3 5 \
>   --save_latest_freq 10000 --save_epoch_freq 5 \
>   --nepochs 1 --nepochs_decay 0 \
>   --preprocess none
----------------- Options ---------------
                active_fn: nn.ReLU                       
              active_fn_D: nn.LeakyReLU                  
             aspect_ratio: 1.0                           
               batch_size: 4                             	[default: 1]
                    beta1: 0.5                           
                 channels: None                          
channels_reduction_factor: 6                             	[default: 1]
          cityscapes_path: database/cityscapes-origin    
                crop_size: 256, 256                      
                 dataroot: database/face2smile           	[default: None]
             dataset_mode: unaligned                     
                direction: AtoB                          
          display_winsize: 256                           
                 drn_path: drn-d-105_ms_cityscapes.pth   
             dropout_rate: 0                             
               epoch_base: 1                             
          eval_batch_size: 1                             
                 gan_mode: lsgan                         
                  gpu_ids: 0,1,2,3                       	[default: 0]
                init_gain: 0.02                          
                init_type: normal                        
                 input_nc: 3                             
                  isTrain: True                          	[default: None]
                iter_base: 1                             
             kernel_sizes: [1, 3, 5]                     	[default: [3, 5, 7]]
                 lambda_A: 10.0                          
                 lambda_B: 10.0                          
          lambda_identity: 0.5                           
           load_in_memory: False                         
                load_size: 286                           
                  log_dir: logs/cycle_gan/face2smile/teacher_1080	[default: logs]
                       lr: 0.0002                        
           lr_decay_iters: 50                            
                lr_policy: linear                        
         max_dataset_size: -1                            
                    model: cycle_gan                     	[default: pix2pix]
     moving_average_decay: 0.0                           
moving_average_decay_adjust: False                         
moving_average_decay_base_batch: 32                            
               n_layers_D: 3                             
                      ndf: 64                            
                  nepochs: 1                             	[default: 100]
            nepochs_decay: 0                             	[default: 100]
                     netD: n_layers                      
                     netG: inception_9blocks             
                      ngf: 64                            
                  no_flip: False                         
                     norm: instance                      
              norm_affine: True                          	[default: False]
            norm_affine_D: True                          	[default: False]
             norm_epsilon: 1e-05                         
            norm_momentum: 0.1                           
             norm_student: instance                      
 norm_track_running_stats: False                         
              num_threads: 32                            	[default: 4]
                output_nc: 3                             
             padding_type: reflect                       
                    phase: train                         
                pool_size: 50                            
               preprocess: none                          	[default: resize_and_crop]
               print_freq: 100                           
         real_stat_A_path: real_stat_1080/face2smile_A.npz	[default: None]
         real_stat_B_path: real_stat_1080/face2smile_B.npz	[default: None]
         restore_D_A_path: None                          
         restore_D_B_path: None                          
         restore_G_A_path: None                          
         restore_G_B_path: None                          
           restore_O_path: None                          
          save_epoch_freq: 5                             	[default: 20]
         save_latest_freq: 10000                         	[default: 20000]
                     seed: 233                           
           serial_batches: False                         
               table_path: datasets/table.txt            
          tensorboard_dir: None                          
----------------- End -------------------
train.py --dataroot database/face2smile --model cycle_gan --log_dir logs/cycle_gan/face2smile/teacher_1080 --netG inception_9blocks --real_stat_A_path real_stat_1080/face2smile_A.npz --real_stat_B_path real_stat_1080/face2smile_B.npz --batch_size 4 --num_threads 32 --gpu_ids 0,1,2,3 --norm_affine --norm_affine_D --channels_reduction_factor 6 --kernel_sizes 1 3 5 --save_latest_freq 10000 --save_epoch_freq 5 --nepochs 1 --nepochs_decay 0 --preprocess none
dataset [UnalignedDataset] was created
The number of training images = 5055
data shape is: channel=3, height=1024, width=1024.
initialize network with normal
initialize network with normal
initialize network with normal
initialize network with normal
dataset [SingleDataset] was created
dataset [SingleDataset] was created
/home/ubuntu/.local/lib/python3.9/site-packages/torchvision/models/inception.py:80: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  warnings.warn('The default weight initialization of inception_v3 will be changed in future releases of '
model [CycleGANModel] was created
---------- Networks initialized -------------
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_A] Total number of parameters : 8.154 M
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_B] Total number of parameters : 8.154 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_A] Total number of parameters : 2.767 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_B] Total number of parameters : 2.767 M
-----------------------------------------------
start_epoch: 1
end_epoch: 1
total_iter: 1
current memory allocated: 265.4296875
max memory allocated: 265.4296875
cached memory: 276.0
will set input data
Traceback (most recent call last):
  File "/data/CAT/train.py", line 14, in <module>
    trainer.start()
  File "/data/CAT/trainer.py", line 159, in start
    model.optimize_parameters(total_iter)
  File "/data/CAT/models/cycle_gan_model.py", line 295, in optimize_parameters
    self.forward()
  File "/data/CAT/models/cycle_gan_model.py", line 235, in forward
    self.rec_A = self.netG_B(self.fake_B)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/CAT/models/modules/inception_architecture/inception_generator.py", line 141, in forward
    res = self.up_sampling(res)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/padding.py", line 173, in forward
    return F.pad(input, self.padding, 'reflect')
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 4014, in _pad
    return torch._C._nn.reflection_pad2d(input, pad)
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

training question

Hello again,

I am training CycleGAN with ffhq dataset, trying to make a filter that makes the user smile.
I checked with my training recently, but the result wasn't satisfying.
So, I'd like to ask for advice from you to enhance the performance of the model.

Initially, I trained with 2500 unaligned pairs of smiling/not smiling 1024x1024 photos and got fid_B: 88.244 with --nepochs 500 --nepochs_decay 500
While training the teacher model, I found the best model is found on epoch 181 / 1000. I kinda kept training till the 841st epoch but couldn't find the better one.
For distilling, I got fid: 87.976 at epoch 270 with these options --nepochs 500 --nepochs_decay 500

Screen Shot 2022-09-13 at 4 33 05 PM
Screen Shot 2022-09-13 at 4 33 14 PM

So I added 2500 more unaligned pairs to the dataset and am training again now.
I am eager to make this filter look great. Any advisory is appreciated.

Also, one more question is, is it right to resume with the latest_net, not the best_net I have?
With the same learning rate options, I guess resuming from the best net, or 180_net again, is just repeating what I've trained from the last training.
Please let me know if my assumption is correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.