Giter Site home page Giter Site logo

About resume checkpoint about latte HOT 2 OPEN

kaiw7 avatar kaiw7 commented on September 1, 2024
About resume checkpoint

from latte.

Comments (2)

maxin-cn avatar maxin-cn commented on September 1, 2024

Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?

So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks

Hi, ema is not initialized randomly; it synchronizes the parameters of the model. Please see here.

from latte.

kaiw7 avatar kaiw7 commented on September 1, 2024

Hi, I noticed that there maybe some issues when resuming the latest checkpoints. In the training script, only the 'EMA' checkpoint is saved each time and the 'model' checkpoint is not saved. If the running job is interrupted, the 'EMA' checkpoint of recent training steps is loaded to initialize the model.state_dict. I am not sure if it is correct, because generally the 'model' checkpoint should be loaded for model.state_dict. In addition, when to re-run the scrip for resuming training, the 'EMA' is again initialized with random model state_dict. I am wondering if the 'EMA' should be initialized with the 'EMA' checkpoint of recent training steps?
So, I would like to make sure if there are differences between the whole training (without interruption) and resuming the training due to the interruption? Many thanks

Hi, ema is not initialized randomly; it synchronizes the parameters of the model. Please see here.

Sorry, I mean the ema do not load the saved ema checkpoint
image

resume_from_checkpoint is a TODO function, which is not perfect yet. So check it carefully if you want to use it.

from latte.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.