Comments (2)
Hi Huaijun,
Thanks for the question. You are right that the best finetuning HPs are usually different from ones used for pretraining because of the differences in datasets and batch sizes. It's an on-going work to explore the best way to transfer hyperparameters during finetuning because of the importance of regularization in that regime. You might be able to transfer finetuning HPs by using two pretrained mup models of different sizes, but in our experience it doesn't work as well as pretraining, for the reason mentioned above.
Hope this helps!
from mup.
Many thanks for your explanation!
from mup.
Related Issues (20)
- Is it possible to also scale the depth of the model? HOT 5
- _rescale_parameters() inconsistent with the paper for the tied embedding scenario? HOT 2
- µTransfer across batch size && weight decay setting
- Some questions about the implementation of muP.
- Interpreting jitter in coordcheck HOT 2
- FSDP support? HOT 3
- Usage with torch.compile in Pytorch 2? HOT 2
- dim_feedforward
- Unclear `assert_hidden_size_inf` triggers HOT 1
- About Learning rate decay HOT 2
- Questions for training gpt-2 using mup HOT 6
- Reproducing the validation accuracy vs learning rates curve on ResNet HOT 1
- coord_check for model that returns loss function directly
- Reproducing Figure 1 using 'examples/Transformer/main.py'
- mu parametrization for gated-mlp and group-query attention
- Increasing coord check for the network output HOT 2
- MuP for Mamba
- Not getting perf improvements from muP at ~1.5B scale
- MuP for RNNs
- How to use with SSL methods like DINOv2?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mup.