I'm trying mup on a deep transformer structure and have the following questions: <

Hi! This sounds like a precision issue or a bug

Are you using low precision? <a

Is this a pre layernorm transformer? <span class="email-hidden-tog

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Questions on learning schedule and binary classification about mup HOT 12 CLOSED

microsoft commented on August 24, 2024

Questions on learning schedule and binary classification

from mup.

Comments (12)

edwardjhu commented on August 24, 2024

Hi!

This sounds like a precision issue or a bug in your code. You shouldn't have to adjust the warmup ratio if you are training for the same number of steps. Are the batches ordered in the same way in these runs?
It doesn't matter since you can still model an arbitrary probability. Zero init can be helpful since it gets rid of the effect of the initial GP.

Hope this helps!

from mup.

FlamingHorizon commented on August 24, 2024

Thanks for your reply!

Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?

from mup.

thegregyang commented on August 24, 2024

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon ***@***.***> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

from mup.

FlamingHorizon commented on August 24, 2024

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>

I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.

I'll have a double check on my code, but please also let me know if anything looks wired in this table!

from mup.

thegregyang commented on August 24, 2024

Are you using low precision?

…

On Tue, Mar 14, 2023, 10:42 PM FlamingHorizon ***@***.***> wrote: Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon *@*.*> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.*> Cc: Subscribed *@*.*> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40 <#40>) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment) <#40 (comment)>>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU <https://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU>. You are receiving this because you are subscribed to this thread.Message ID: @.*> I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge. I'll have a double check on the code, but please also let me know if something looks wired in this table! [image: image] <https://user-images.githubusercontent.com/16420121/225200186-1cb46452-4edd-4be4-898b-960c932ab502.png> — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMWHHM4VKEJ7C7DUUXARWATW4E3BLANCNFSM6AAAAAAVV5FDEU> . You are receiving this because you commented.Message ID: ***@***.***>

from mup.

FlamingHorizon commented on August 24, 2024

No, the model was trained in float32

from mup.

thegregyang commented on August 24, 2024

Is this a pre layernorm transformer?

…

On Tue, Mar 14, 2023, 10:45 PM FlamingHorizon ***@***.***> wrote: No, the model was trained in float32 — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMWHHMZQDNZ7X3UGSVNMLQLW4E3MLANCNFSM6AAAAAAVV5FDEU> . You are receiving this because you commented.Message ID: ***@***.***>

from mup.

FlamingHorizon commented on August 24, 2024

yes, it was exactly a gpt-2 with large depth, and batch_size = 24 (too small?)

from mup.

thegregyang commented on August 24, 2024

@TevenLeScao has been working on GPT-2 with Megatron-deepspeed as well. @TevenLeScao does anything here ring any bells for you for what the issue could be?

from mup.

FlamingHorizon commented on August 24, 2024

@thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.

A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)

from mup.

thegregyang commented on August 24, 2024

OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.

…

On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon ***@***.***> wrote: @thegregyang <https://github.com/thegregyang> I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from mup.

FlamingHorizon commented on August 24, 2024

OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.
…
On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.> wrote: @thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.>

I see. Thank you!

from mup.

Questions on learning schedule and binary classification about mup HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent