Giter Site home page Giter Site logo

Comments (6)

changlin31 avatar changlin31 commented on August 15, 2024 2

Valuable question, I'm here to support your points with similar but slightly different results I got for Deit small model:

80.0% with 1024 batchsize (128 x 8, default is 256 x 4), no fluctuations.

79.0% with 2048 batchsize (256 x 8). Acc went down around the 5-th epoch and recovers later, but it did not achieve better result in the end.

from deit.

TouvronHugo avatar TouvronHugo commented on August 15, 2024 2

Hi @chunfuchen and @changlin31 ,

Thanks for your questions and observations,

In my experiments I observed that transformers are more sensitive than convnets to the variation of the hparams but in my case small variations of the parameters did not lead to very big performances discrepancy.

For the Base model I did not observe in the first epochs any divergence of the validation accuracy but I do not validate every epoch so I probably didn't see it.
I don't know if the validation accuracy is a good metric to measure the divergence. Do you have the same behavior with the training loss ?

Concerning your tests with different batch sizes:
Maybe that our scaling strategy (similar to what is done with convnets) is not optimal. Maybe you need to try another learning rate.

Concerning the number of warmup epochs I think that going from 5 to 10 is not a small variation. In this case, it is probably necessary to adapt the lr and the total number of epochs.

I hope I have answered your questions, do not hesitate if you have any other questions,

Best,

Hugo

from deit.

chunfuchen avatar chunfuchen commented on August 15, 2024

@TouvronHugo Thanks for your reply.

Regarding the batch size, I will test different learning rate scaling.

Regarding the warmup epochs, since I find this divergence (almost), I thought the curve could be more smooth if I just warm it up a little bit longer to resolve this behavior. Then, it does not diverge anymore but the performance is not good.

The training loss has the same trend. Here are the plots
deit-base-train-loss

The first 50 epochs
deit-base-train-loss-zoom-in

Furthermore, for the deit-small models, I also tried with 2k and 4k batch size, but the default setting leads to diverging (this is reasonable as the learning rate increased.), thus, I changed the warmup epochs to 15 and 30 respectively, and both reach ~80.0% accuracy.

from deit.

chunfuchen avatar chunfuchen commented on August 15, 2024

@TouvronHugo
And may I have another question? do you see the same thing happened when using the proposed distillation approaches?

from deit.

cxxgtxy avatar cxxgtxy commented on August 15, 2024

Great work! and thanks for sharing the codes.

I am trying to re-train Deit base model but I encountered some issues.
May I ask for your insights?

I can reproduce the reported results 81.8% with all default setting; however, the performance degrades a lot if I change two very minor hyperparameters

  1. Change batch size to 512 (default is 1024), and learning rate is automatically scaled based on your codes.
  2. Keep batch size to 1024 but increase the warmup epochs to 10 (default is 5).

Here is the test accuracy over epochs

The orange line is the default setting. (81.8%)
The blue line is batch size 512. (78.8%)
The green line is using 10 epochs for warmup. (79.2%)

Testing accuracy curve
deit-base

Zoom in for the first 50 epochs
zoom-in

For the default setting, it seems that the model is going to diverge around the 6-th epoch but it recovers later, and then it eventually achieve pretty good results. (81.8%)
However, when using smaller batch size or warmup for additional 5 epochs, the performance degrades ~3%

I wonder that do you observe the same trend? and do you have any insights into why two small changes I made will affect so much?

My env:
pytorch 1.7, timm 0.3.2, torchvision 0.8

Thanks.

I met with the same problem.

from deit.

TouvronHugo avatar TouvronHugo commented on August 15, 2024

Hi @chunfuchen ,
Good question,
Distillation stabilizes the training, I think it's probably less sensitive.
Best,
Hugo

from deit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.