Question about the convergence of the Deit-base model about deit HOT 6 CLOSED

facebookresearch commented on August 15, 2024 8

Question about the convergence of the Deit-base model

from deit.

Comments (6)

changlin31 commented on August 15, 2024 2

Valuable question, I'm here to support your points with similar but slightly different results I got for Deit small model:

80.0% with 1024 batchsize (128 x 8, default is 256 x 4), no fluctuations.

79.0% with 2048 batchsize (256 x 8). Acc went down around the 5-th epoch and recovers later, but it did not achieve better result in the end.

from deit.

TouvronHugo commented on August 15, 2024 2

Hi @chunfuchen and @changlin31 ,

Thanks for your questions and observations,

In my experiments I observed that transformers are more sensitive than convnets to the variation of the hparams but in my case small variations of the parameters did not lead to very big performances discrepancy.

For the Base model I did not observe in the first epochs any divergence of the validation accuracy but I do not validate every epoch so I probably didn't see it.
I don't know if the validation accuracy is a good metric to measure the divergence. Do you have the same behavior with the training loss ?

Concerning your tests with different batch sizes:
Maybe that our scaling strategy (similar to what is done with convnets) is not optimal. Maybe you need to try another learning rate.

Concerning the number of warmup epochs I think that going from 5 to 10 is not a small variation. In this case, it is probably necessary to adapt the lr and the total number of epochs.

I hope I have answered your questions, do not hesitate if you have any other questions,

Best,

Hugo

from deit.

chunfuchen commented on August 15, 2024

@TouvronHugo Thanks for your reply.

Regarding the batch size, I will test different learning rate scaling.

Regarding the warmup epochs, since I find this divergence (almost), I thought the curve could be more smooth if I just warm it up a little bit longer to resolve this behavior. Then, it does not diverge anymore but the performance is not good.

The training loss has the same trend. Here are the plots

The first 50 epochs

Furthermore, for the deit-small models, I also tried with 2k and 4k batch size, but the default setting leads to diverging (this is reasonable as the learning rate increased.), thus, I changed the warmup epochs to 15 and 30 respectively, and both reach ~80.0% accuracy.

from deit.

chunfuchen commented on August 15, 2024

@TouvronHugo
And may I have another question? do you see the same thing happened when using the proposed distillation approaches?

from deit.

cxxgtxy commented on August 15, 2024

Great work! and thanks for sharing the codes.

I am trying to re-train Deit base model but I encountered some issues.
May I ask for your insights?

I can reproduce the reported results 81.8% with all default setting; however, the performance degrades a lot if I change two very minor hyperparameters

Change batch size to 512 (default is 1024), and learning rate is automatically scaled based on your codes.

Keep batch size to 1024 but increase the warmup epochs to 10 (default is 5).

Here is the test accuracy over epochs

The orange line is the default setting. (81.8%)
The blue line is batch size 512. (78.8%)
The green line is using 10 epochs for warmup. (79.2%)

Testing accuracy curve

Zoom in for the first 50 epochs

For the default setting, it seems that the model is going to diverge around the 6-th epoch but it recovers later, and then it eventually achieve pretty good results. (81.8%)
However, when using smaller batch size or warmup for additional 5 epochs, the performance degrades ~3%

I wonder that do you observe the same trend? and do you have any insights into why two small changes I made will affect so much?

My env:
pytorch 1.7, timm 0.3.2, torchvision 0.8

Thanks.

I met with the same problem.

from deit.

TouvronHugo commented on August 15, 2024

Hi @chunfuchen ,
Good question,
Distillation stabilizes the training, I think it's probably less sensitive.
Best,
Hugo

from deit.

Question about the convergence of the Deit-base model about deit HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent