Giter Site home page Giter Site logo

Problem in reproducing fid score about ml-gmpi HOT 9 CLOSED

apple avatar apple commented on August 20, 2024
Problem in reproducing fid score

from ml-gmpi.

Comments (9)

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

I guess the ffhq256x256.zip file format is same as preprocessed one.

This one I am unsure as I have never tried the link you referred to before.

I trained with this data and checked the training score.

This is weird. Do you mind sharing the hardware setup? E.g., the number of GPUs, etc. For GAN's training, the batch size matters a lot. If possible, please use the same batch size as specified in the paper.

I really want to refer your research but.. dataset problem is not easy for me...

I am sorry and I definitely understand that this is a headache. BTW, I recently encountered a tool, i.e., rclone, and it can interact with GDrive well (on the remote server, etc). Please refer to the documentation for how to set it up. The only thing that needs some effort is to set up a Google Cloud API, which is not hard by following its instruction.

Hope these help.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

Hi, @Xiaoming-Zhao!

I'm facing a similar issue. I trained my model with the following configuration: 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002}, using the ffhq256x256 dataset from Kaggle. However, my FID score keeps increasing after the first 1000 steps, as shown in the graph below:

fid

In the original paper, the model was trained with a batch size of 64, so I also tried that, but the FID score still increased. Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps. However, this time, even though I started with a lower FID score of 18, it kept increasing after around 1000 steps.

I'm wondering if using ffhq256x256 data from Kaggle instead of real 256-sized data could be causing overfitting. Because I think the size of the Kaggle set is a little smaller. Are there any other possible reasons for this behavior?

Thanks!

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

Hi @parkjh688, I need several more information if possible.

In the original paper, the model was trained with a batch size of 64, so I also tried that

How many GPUs did you use to train GMPI? The reason I am asking is that the batch_size specified in the curriculums.py is for batch size per GPU. The batch size of 64 stated in the paper comes from 8 (per GPU) x 8 (#GPUs) = 64.

Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps.

What is the dataset you used for this training? And how many GPUs were you using?

I'm wondering if using ffhq256x256 data from Kaggle instead of real 256-sized data could be causing overfitting.

One caveat I could see is that the Kaggle dataset uses a different resizing method from the one the pre-trained StyleGAN2 uses. Specifically:
a. The Kaggle one uses bicubic as specified on the webpage.
b. The StyleGAN2 is trained on images downscaled with lanczos (see here).

So maybe you want to process the dataset following the official instruction to have a double check.

I trained my model with the following configuration: 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002}

I noticed that you have batch_split of 16. This means that you will split the 64 images into 16 mini-batches and accumulate gradients 16 times. Theoretically, this should be fine. However, I would recommend using only one single forward pass if possible to avoid any hidden issues.

Hope these help.

from ml-gmpi.

parkjh688 avatar parkjh688 commented on August 20, 2024

The reason I am asking is that the batch_size specified in the curriculums.py is for batch size per GPU. The batch size of 64 stated in the paper comes from 8 (per GPU) x 8 (#GPUs) = 64.

Oh, I didn't know that. I used 6 GPUs and my configuration in curriculums.py was 256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002} this. Then if I want to train the model with 64 batch size I have to train it with 4 GPUs and the configuration will be 256: {'batch_size': 16, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002} like this. 16 (per GPU) x 4 (#GPUs) = 64

What is the dataset you used for this training? And how many GPUs were you using?

I use this kaggle dataset..

Thanks!

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

Got you. So the Kaggle dataset is indeed able to reproduce FID based on your previous statement

Previously, when I trained with a batch size of 8, the FID score was around 20 and remained stable until the end of 5000 steps.

Then I would recommend reducing the batch_split to see whether this is the culprit for the weird FID curve you showed before. As I mentioned before, theoretically, having batch_split = 16 should be fine but I am not sure whether there will be some hidden issues there.

Hope this helps.

from ml-gmpi.

howtowhy avatar howtowhy commented on August 20, 2024

Hello, thank you for your detailed help.
I downloaded the 1024 image and preprocessing to 256 with script.
And I used following option with 8 GPU

"res_dict": {
256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002},
512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

"res_dict_learnable_param": {
    256: {'batch_size': 64, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 16, 'gen_lr': 0.002, 'disc_lr': 0.002},
    512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

But the fid goes up and the result was like this.
Could you advice for this situation?

image

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

Do you mind trying the default configuration:

"res_dict": {
256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
},

See the discussion above in this issue. Essentially:

  1. We use batch_size of 8 as it is batch size per GPU.
    a. One caveat is that a larger batch size does not always mean better results. Though a larger batch size could contribute to the generator's learning, it could also provide the discriminator with more power to break the balance between the generator and the discriminator.
    b. I have not tried with a batch size of 64 x 8 = 512 therefore I am not sure whether this could work.
  2. Maybe reduce the batch_split from 16 to 1 if your GPU memory allows it. Or maybe 2. Theoretically, having batch_split = 16 should be fine but I am not sure whether there will be some hidden issues there.

Hope these help.

from ml-gmpi.

howtowhy avatar howtowhy commented on August 20, 2024

Hello! Thank you for your kind help.
I run the script with 8 GPU and batchsize you suggested.

"res_dict": {
        256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    },

    "res_dict_learnable_param": {
        256: {'batch_size': 8, 'num_steps': 32, 'img_size': 256, 'tex_size': 256, 'batch_split': 1, 'gen_lr': 0.002, 'disc_lr': 0.002},
        512: {'batch_size': 4, 'num_steps': 32, 'img_size': 512, 'tex_size': 512, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
        1024: {'batch_size': 4, 'num_steps': 32, 'img_size': 1024, 'tex_size': 1024, 'batch_split': 2, 'gen_lr': 0.002, 'disc_lr': 0.002},
    },

The fid score with 256 was 18.92-21.52. (with one peak pertubation)
But the paper suggest its fid to 11.4
Is there any problem I missed?
Thank you for your quick help so much.

image

from ml-gmpi.

Xiaoming-Zhao avatar Xiaoming-Zhao commented on August 20, 2024

This curve looks reasonable to me. I am not sure about the peak but I guess it may due to some randomness.

Regarding the FID: FID score largely depends on the number of images used to compute the score. The more images you use, the large probability you will obtain a lower score.

However, FID with plenty of images is costly to compute. Therefore, during training, we use a small number of images to get a sense of the FID trend:

During the full evaluation, we use 50k fake and real images as stated in the paper and this follows StyleGAN's papers:

N_IMGS=50000

Hope this resolves your confusion.

from ml-gmpi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.