Giter Site home page Giter Site logo

Comments (7)

williamFalcon avatar williamFalcon commented on May 6, 2024 1

@AS-researcher6 merged to master. should be correct now. pls verify

from lightning.

AS-researcher6 avatar AS-researcher6 commented on May 6, 2024 1

All good. Thanks for the fixes!
Working on a SLURM cluster myself so may submit pull requests for expanded Trainer functionality and bring up more things in the future.

from lightning.

williamFalcon avatar williamFalcon commented on May 6, 2024

@AS-researcher6 good find, i see the bug. the division is done once the step is applied, but the division should be on line 926.

Current order:

  1. clip
  2. step
  3. zero_grad
  4. divide accumulated loss by nb accumulated batches

Correct order:

  1. divide accumulated loss by nb accumulated batches
  2. clip
  3. step
  4. zero_grad

Submitting the change.
Mind sanity checking PR #88

from lightning.

AS-researcher6 avatar AS-researcher6 commented on May 6, 2024

Gladly! Not sure if you're done fixing things but I don't see the commit where the loss is divided by self.accumulate_grad_batches before the loss.backward(). Otherwise the original version where self.batch_loss_value += loss.item() is good as long as the self.batch_loss_value is not averaged after.

from lightning.

williamFalcon avatar williamFalcon commented on May 6, 2024

look at the PR #88
line 928

from lightning.

AS-researcher6 avatar AS-researcher6 commented on May 6, 2024

Sorry, I don't think it's quite right yet. Accounting for multiple batches needs to happen in the actual loss itself before backpropagation. Otherwise, its like N accumulated additions of loss.backward() rather than N accumulated additions of (1/N * loss).backward()
After line 898 and before loss.backward() on the 903 if statement, there needs to be a line that's loss = loss / self.accumulate_grad_batches. Line 922 should still be self.batch_loss_value += loss.item(). Line 928 can be removed entirely, since the averaging has already been accounted for.

from lightning.

williamFalcon avatar williamFalcon commented on May 6, 2024

@AS-researcher6 i see. updated. Was following the wrong approach before.

How about now?

from lightning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.