checkpoint event-handler (using ignite for loading/saving model variables) about monai HOT 17 CLOSED

project-monai commented on August 20, 2024

checkpoint event-handler (using ignite for loading/saving model variables)

from monai.

Comments (17)

ericspod commented on August 20, 2024 1

@vfdev-5 More to that I would say the way Engine is now is great, I can't think of what to add here for this issue of ours. Thanks for the help.

from monai.

Nic-Ma commented on August 20, 2024 1

Hi @ericspod ,

I already updated the PR according to @vfdev-5 's sample code with this patch:
2ce4a21

def started_callback(self, engine):
    engine.state.train_time = {'TRAIN_START': time.time()}

def completed_callback(self, engine):
    engine.state.train_time['TRAIN_END'] = time.time()

def epoch_started_callback(self, engine):
    engine.state.train_time['EPOCH_START'] = time.time()

......

def exception_raised_callback(self, engine, e):
    engine.state.train_time['TRAIN_END'] = time.time()

Hi @vfdev-5 ,

Thanks very much for your quick response and detailed sample code!
I think we will only use Ignite official code(not include contrib) for the current stage as it's stable.
And I plan to unify TensorBoard, ScreenPrinter, etc. into a stats_handler category later.

Could you guys please help review the PR again?
Thanks in advance.

from monai.

Nic-Ma commented on August 20, 2024

Hi @wyli and @yanchengnv ,

If we don't develop our own workflows and use Ignite directly, it already contains a simple implementation:
from ignite.handlers import Checkpoint, DiskSaver
You can check this example: https://github.com/pytorch/ignite/blob/master/examples/mnist/mnist_save_resume_engine.py
Does it satisfy our requirements?
Thanks.

from monai.

ericspod commented on August 20, 2024

I looked at that code and considered using it but there's a few behavioural things I didn't like so much. It will save a certain number of checkpoints but then delete old ones as it goes, this wasn't ideal for my own use case. It also doesn't save the networks using Torchscript to produce portable objects that can be loaded independently of the code base. I wanted this behaviour to be sure I had reusable networks that were robust to code base changes so I have my own implementation I can port over.

from monai.

vfdev-5 commented on August 20, 2024

Sorry for jumping into this conversation.

It will save a certain number of checkpoints but then delete old ones as it goes, this wasn't ideal for my own use case

@ericspod you would like to save all checkpoints without removing anything ? Maybe I can provide such option to Checkpoint.

It also doesn't save the networks using Torchscript to produce portable objects that can be loaded independently of the code base.

I see, this also makes sense. We had previously save_as_state_dict argument which stored the object instead of its state_dict but we changed the default behavior to storing state_dict as it is recommended here.
But I can think about how to put back this into ignite and saving either object or scripted model...

from monai.

ericspod commented on August 20, 2024

@vfdev-5 Saving every checkpoint at the end of an epoch and not throwing any out is one thing that would be nice, the Torchscript output was just something that suited my workflow so as an option it would be nice too. I'm sure a lot of people's networks aren't compatible with Torchscript's foibles so it should be an option. What I also was working on and didn't get around to finishing was a SessionSaver class that did these things but also saved out the engine's state so that it could be restored at a later time and restarted from where it left off. This would mean storing some of the state object, omitting large tensors but keeping everything else, then having a restore method on the SessionSaver. This is maybe more involved than what you were planning to have in Ignite.

from monai.

vfdev-5 commented on August 20, 2024

@ericspod thanks for details!

SessionSaver class that did these things but also saved out the engine's state so that it could be restored at a later time and restarted from where it left off

I recently finished a PR on a similar thing like that: saving and restoring training. Please, see here and one of examples. This option is not yet in a stable release, but in nighlty releases. Please, tell me whether this could work for your project...

from monai.

ericspod commented on August 20, 2024

@vfdev-5 That looks like it would be what I would want, it works by assuming that the Engine objects are pickable. It could clash with things that aren't amenable to that as noted about data loaders, or if large tensors get stored in State. The Torchscript storing can be done as an event instead. Thanks for pointing it out!

from monai.

vfdev-5 commented on August 20, 2024

@ericspod just a small comment on "it works by assuming that the Engine objects are pickable". Engine now has state_dict and load_state_dict and they perform the serialization/deserialization as in pytorch nn.Module. In the state_dict there are only basic things to restore the Engine: “seed”, “epoch_length”, “max_epochs” and “iteration”. And Checkpoint stores only the objects that have state_dict method.

IMO, Engine is somehow not responsible for user's dataloader, so if data provider is fully random, probably, it would be impossible to reproduce exactly the same dataflow several times.
When resuming from an epoch or iteration Engine tries to a) skip indices for torch DataLoader or b) otherwise iterate over the samples until necessary iteration (modulo epoch length).
To work with data streams, I think it is up to user to provide some random state "syncronization"... (e.g epoch-wise)

All that being said, if you could provide some example on what could clash, maybe we can think about how to improve the library...

from monai.

ericspod commented on August 20, 2024

@vfdev-5 OK that definitely helps explain what's going on. I think you're right that Engine isn't responsible for the data pipeline, if it's only choosing certain things to retain it would skip member of State like output. I had been storing input and output tensors as member of State for each training iteration, which perhaps is feature abuse. If I were to subtype Engine I would add the members of State I did want to save to _state_dict_all_req_keys I suppose.

from monai.

vfdev-5 commented on August 20, 2024

@ericspod thanks for such great feedback :)

from monai.

Nic-Ma commented on August 20, 2024

Hi @ericspod and @vfdev-5 ,

Thanks for your detailed discussion.
I developed a PR based on Ignite checkpoint for this task: #32
Could you please help review it when you are available?
I used engine.state.metrics to store timer_handler data.
In order to store more rich training/validation sharable data(especially for StatsLogger), I think we can discuss how to extend engine.state in later development.
Thanks.

from monai.

ericspod commented on August 20, 2024

@Nic-Ma I would say that timer_handler and other data items should be members of engine.state, metrics should be exclusively for metric values as Ignite uses it.

from monai.

Nic-Ma commented on August 20, 2024

@ericspod I agree with you on the engine.state.metrics.
That's why I said we need to discuss how to extend ignite.state to store more sharable data.
I just put the timer data in metrics for the first discussion.
Do you have any suggestions? I want to add 2 fields: train_info, val_info.
Thanks.

from monai.

vfdev-5 commented on August 20, 2024

@Nic-Ma there is almost nothing to do to add new fields to engine.state :

# 1) initialize them at the begining: `Events.STARTED`
@trainer.on(Events.STARTED)
def init_state_fields(engine):
    engine.state.train_info = {}
    engine.state.val_info = {}

# 2) Add values in your handlers

def somewhere_in_handler(engine):
    engine.state.train_info['time'] = elapsed

Just for your information, I also planned to add the following class to ignite: pytorch/ignite#63 (comment) (associated issue: pytorch/ignite#589)
Let me know whether it could be useful for you. Thanks

from monai.

vfdev-5 commented on August 20, 2024

@Nic-Ma you are welcome !

Could you guys please help review the PR again?

I think I'll be little of help, as I would have done this a bit different without introducing TrainHandler interface and coding CheckpointHandler as a callback interface. I would code it as a helper method to setup ignite's Checkpoint, something similar to here. But again it is my point of view, which may not hold for some of your reasons...

PS: if you would like to perform a segmentation task, there is an example of training NN on Pascal VOC12 in distributed+apex config & logging to MLflow/Polyaxon + TensorBoard + python configuration system (which may looks strange at the begining but has a lot of flexibility) which may help you to simplify building of some parts of your toolkit.

from monai.

Nic-Ma commented on August 20, 2024

Hi @ericspod and @wyli ,

According to our latest MVP plan, I committed another PR #35 to add a simple checkpoint handler.
Could you please help review it again?
I feel it's the only missing task of the MVP demo program.
Thanks.

from monai.

checkpoint event-handler (using ignite for loading/saving model variables) about monai HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent