Comments (6)
Valuable question, I'm here to support your points with similar but slightly different results I got for Deit small model:
80.0% with 1024 batchsize (128 x 8, default is 256 x 4), no fluctuations.
79.0% with 2048 batchsize (256 x 8). Acc went down around the 5-th epoch and recovers later, but it did not achieve better result in the end.
from deit.
Hi @chunfuchen and @changlin31 ,
Thanks for your questions and observations,
In my experiments I observed that transformers are more sensitive than convnets to the variation of the hparams but in my case small variations of the parameters did not lead to very big performances discrepancy.
For the Base model I did not observe in the first epochs any divergence of the validation accuracy but I do not validate every epoch so I probably didn't see it.
I don't know if the validation accuracy is a good metric to measure the divergence. Do you have the same behavior with the training loss ?
Concerning your tests with different batch sizes:
Maybe that our scaling strategy (similar to what is done with convnets) is not optimal. Maybe you need to try another learning rate.
Concerning the number of warmup epochs I think that going from 5 to 10 is not a small variation. In this case, it is probably necessary to adapt the lr and the total number of epochs.
I hope I have answered your questions, do not hesitate if you have any other questions,
Best,
Hugo
from deit.
@TouvronHugo Thanks for your reply.
Regarding the batch size, I will test different learning rate scaling.
Regarding the warmup epochs, since I find this divergence (almost), I thought the curve could be more smooth if I just warm it up a little bit longer to resolve this behavior. Then, it does not diverge anymore but the performance is not good.
The training loss has the same trend. Here are the plots
Furthermore, for the deit-small models, I also tried with 2k and 4k batch size, but the default setting leads to diverging (this is reasonable as the learning rate increased.), thus, I changed the warmup epochs to 15 and 30 respectively, and both reach ~80.0% accuracy.
from deit.
@TouvronHugo
And may I have another question? do you see the same thing happened when using the proposed distillation approaches?
from deit.
Great work! and thanks for sharing the codes.
I am trying to re-train Deit base model but I encountered some issues.
May I ask for your insights?I can reproduce the reported results 81.8% with all default setting; however, the performance degrades a lot if I change two very minor hyperparameters
- Change batch size to 512 (default is 1024), and learning rate is automatically scaled based on your codes.
- Keep batch size to 1024 but increase the warmup epochs to 10 (default is 5).
Here is the test accuracy over epochs
The orange line is the default setting. (81.8%)
The blue line is batch size 512. (78.8%)
The green line is using 10 epochs for warmup. (79.2%)Zoom in for the first 50 epochs
For the default setting, it seems that the model is going to diverge around the 6-th epoch but it recovers later, and then it eventually achieve pretty good results. (81.8%)
However, when using smaller batch size or warmup for additional 5 epochs, the performance degrades ~3%I wonder that do you observe the same trend? and do you have any insights into why two small changes I made will affect so much?
My env:
pytorch 1.7, timm 0.3.2, torchvision 0.8Thanks.
I met with the same problem.
from deit.
Hi @chunfuchen ,
Good question,
Distillation stabilizes the training, I think it's probably less sensitive.
Best,
Hugo
from deit.
Related Issues (20)
- What are the hyperparameters for DeiT-III (epoch 400 or 600)?
- The ablation experiment of DeiT HOT 2
- how to implement cosub training use deit-III
- how to implement cosub training use deit-III HOT 2
- DeiT depth 24 (CaiT - TABLE 1) HOT 2
- ImageNet21K data preparation for pre-training HOT 5
- batch_size flag HOT 2
- Code for cosub
- How to launch a training of CAIT models ?
- TracerWarning
- Hi,Why can't I find deit_tiny_distilled_patch16_224 in hubconf
- Checkpoints of IN21K pretrained deit III
- ViT-B Training for DeiT HOT 2
- Slow Training HOT 2
- random.seed(seed) in line 205 is commented
- Inclusion of Transformers Need Registers
- Training
- Question about different seeds per gpu with DDP
- Gradient accumulation code
- Will you be releasing the accuracy of the official deit III framework trained tiny version on IN1k?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deit.