Comments (12)
Hi!
-
This sounds like a precision issue or a bug in your code. You shouldn't have to adjust the warmup ratio if you are training for the same number of steps. Are the batches ordered in the same way in these runs?
-
It doesn't matter since you can still model an arbitrary probability. Zero init can be helpful since it gets rid of the effect of the initial GP.
Hope this helps!
from mup.
Thanks for your reply!
Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?
from mup.
from mup.
Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>
I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.
I'll have a double check on my code, but please also let me know if anything looks wired in this table!
from mup.
from mup.
No, the model was trained in float32
from mup.
from mup.
yes, it was exactly a gpt-2 with large depth, and batch_size = 24 (too small?)
from mup.
@TevenLeScao has been working on GPT-2 with Megatron-deepspeed as well. @TevenLeScao does anything here ring any bells for you for what the issue could be?
from mup.
@thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.
A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)
from mup.
from mup.
OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.
…
On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.> wrote: @thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.>
I see. Thank you!
from mup.
Related Issues (20)
- Is it possible to also scale the depth of the model? HOT 5
- _rescale_parameters() inconsistent with the paper for the tied embedding scenario? HOT 2
- µTransfer across batch size && weight decay setting
- Some questions about the implementation of muP.
- Interpreting jitter in coordcheck HOT 2
- FSDP support? HOT 3
- Usage with torch.compile in Pytorch 2? HOT 2
- dim_feedforward
- Unclear `assert_hidden_size_inf` triggers HOT 1
- About Learning rate decay HOT 2
- Questions for training gpt-2 using mup HOT 6
- Reproducing the validation accuracy vs learning rates curve on ResNet HOT 1
- coord_check for model that returns loss function directly
- Reproducing Figure 1 using 'examples/Transformer/main.py'
- mu parametrization for gated-mlp and group-query attention
- Increasing coord check for the network output HOT 2
- MuP for Mamba
- Not getting perf improvements from muP at ~1.5B scale
- MuP for RNNs
- How to use with SSL methods like DINOv2?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mup.