Giter Site home page Giter Site logo

Comments (12)

edwardjhu avatar edwardjhu commented on August 24, 2024

Hi!

  1. This sounds like a precision issue or a bug in your code. You shouldn't have to adjust the warmup ratio if you are training for the same number of steps. Are the batches ordered in the same way in these runs?

  2. It doesn't matter since you can still model an arbitrary probability. Zero init can be helpful since it gets rid of the effect of the initial GP.

Hope this helps!

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

Thanks for your reply!

Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly?

from mup.

thegregyang avatar thegregyang commented on August 24, 2024

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

Yes, the larger model should never do worse than the small model in muP, especially if the entire training procedure (number of steps, etc) is fixed. From: FlamingHorizon @.> Date: Tuesday, March 14, 2023 at 9:04 PM To: microsoft/mup @.> Cc: Subscribed @.> Subject: Re: [microsoft/mup] Questions on learning schedule and binary classification (Issue #40) Thanks for your reply! Yes, the batches are ordered in the same way. The models are trained in full float32. A "silly" question: did you mean that if a model converges in small width, it is guaranteed to converge in larger width if mu-transferred correctly? — Reply to this email directly, view it on GitHub<#40 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMWHHM5F3TJ3RSWFGN5TCALW4EPT5ANCNFSM6AAAAAAVV5FDEU. You are receiving this because you are subscribed to this thread.Message ID: @.>

I see, thanks. The following is some results I got yesterday for training loss @ 10k steps. I was tuning lr + output_mult + initializer_range. It seems that "the wider, the better" holds true if the model does not diverge.

I'll have a double check on my code, but please also let me know if anything looks wired in this table!

image

from mup.

thegregyang avatar thegregyang commented on August 24, 2024

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

No, the model was trained in float32

from mup.

thegregyang avatar thegregyang commented on August 24, 2024

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

yes, it was exactly a gpt-2 with large depth, and batch_size = 24 (too small?)

from mup.

thegregyang avatar thegregyang commented on August 24, 2024

@TevenLeScao has been working on GPT-2 with Megatron-deepspeed as well. @TevenLeScao does anything here ring any bells for you for what the issue could be?

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

@thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer.

A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process)

from mup.

thegregyang avatar thegregyang commented on August 24, 2024

from mup.

FlamingHorizon avatar FlamingHorizon commented on August 24, 2024

OK I'm glad you solved your issue. We didn't do any experiment modifying pretrained networks. In general I don't expect our understanding of training from scratch carries over unmodified to training from checkpoint (that is trained without muP). So in summary I unfortunately have no advice for your situation.

On Wed, Mar 22, 2023, 2:18 PM FlamingHorizon @.> wrote: @thegregyang https://github.com/thegregyang I found the problem, it was because the default lr_scheduler configuration of my codebase broke the re-group of optimizer. A further question: if I intend to load an existing checkpoint before starting, does it make sense to re-normalize the parameters of matrix-like tensors to be zero-mean? Did you do experiments that deal with such situations (and I saw in your paper that non-gaussian initialization also slows down the process) — Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMWHHMZMSDVPLBB7WGBBWGDW5JVPFANCNFSM6AAAAAAVV5FDEU . You are receiving this because you were mentioned.Message ID: @.>

I see. Thank you!

from mup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.