Comments (8)
Said in the notebook:
We employ the fixed budget of 200 epochs and reduce the learning rates by 10 after 150 epochs.
from adabound.
That doesn't explain to me how come ALL models make these incredibly huge improvements in a single epoch. To be honest... it just looks wrong to me.
from adabound.
Well, no offense but I think it's more about basic knowledge in the field of machine learning, and we don't really need to make a discussion here.
You may refer to this vedio by Andrew Ng to gain quick insight about lr decay. Or just search learning rate decay
in Google and there're already many great posts introducing this technique.
It is broadly used in many machine learning papers/projects nowadays.
from adabound.
@Luolc I am aware of learning rate decay. This is why I think it is extremely weird that in all approaches you get at exactly epoch 150 a huge improvement.
This seems to me to indicate a bad initial learning rate (converging to a local optima)?
Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?
I just wanted to warn you that it seems very odd to have such a huge jump relatively late in optimization and I was hoping there was an explanation for it other than a bad initial learning rate.
Thanks.
from adabound.
Ok I get what you mean.
Regarding the initial lr, for each optimizer, we conducted a grid search to find the best hyperparameters. For each independent settings, we tested 3~5 times. Indeed, hundreds of runnings were done before we see the final visualization now. I am sure that we've already set the best lr we could ever find (at least the best in the grid). More details can be found in the experiment section of the paper.
As mentioned in the demo, the training code is heavily based on this broadly used code base for testing deep CNNs on CIFAR-10. As our best result for SGD even achieves a higher number than that reported in the original repo (~0.4%), I think we did make a successful training.
Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?
I don't think it's approprate to say whether useful or useless. If you may refer to Figure 6(a) in the paper, the learning curves of SGDs with other initial lr are even much worse than what we see in the notebook. So could we say it is the least useless one we may find?
There might be a better decay strategy like making the decay happend earlier --- I totally agree but it's not we are concerning about. What we need is to guarantee the same decay strategy is applied on all the optimizers to make it fair for comparison, rather than finding a best decay strategy.
Finally, I don't think it is a huge jump or odd behavior. I've seen many similar figures in plenty of papers. For example the SWATS.
from adabound.
@kootenpv You perhaps not understand that the model parameters will be updated many times in one epoch.
from adabound.
@siaimes I understand that obviously, but why would it be such a steep change exactly at the 150th epoch - it looks to me like something is just wrong (bad parameters before the 150th). It does make more sense in @Luolc's explanation that these settings turn out to be "less than optimal" for this particular dataset.
from adabound.
@kootenpv I've recently done some more toy experiments on CIFAR-10 and gain deeper insights now.
FYI: we can employ the lr decay earlier at ~75 epoch and achieve similar results after ~100 epoch.
Decaying at 150 epoch is not the best settings considering the time cost, but not affecting the final results. Since the purpose of the paper is not finding SoTA, it's ok due to the fairness among different optimizers.
p.s. I'd like to close this issue if there's no further doubt.
from adabound.
Related Issues (20)
- Why python 3.6 requirement? HOT 1
- The provided new optimizer is sensitive on tiny batchsize HOT 4
- Nan loss in RCAN model HOT 9
- AttributeError: no attribute 'base_lrs' HOT 10
- Don't work properly with higher lr
- update pip package please~ HOT 2
- Be careful when using adaptive gradient methods HOT 3
- Can you provide the code and parameters of related LSTM experiments~
- grammar police... "as well as adam"
- Merge with Ranger or over9000
- LSTM hyparameters for language modeling
- About clip (α / √Vt, ηl, ηu) in the paper
- Learning rate changing
- Pytorch 1.6 warning HOT 1
- When did the optimizer switch to SGD? HOT 1
- Can this deal with complex numbers?
- Question about the code HOT 2
- Tensorflow version coming when? HOT 3
- lr_scheduler affect the actual learning rate HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adabound.