Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Standardized approach for job checkpoint/restart/continuation about atomate2 HOT 12 OPEN

materialsproject commented on June 12, 2024

Standardized approach for job checkpoint/restart/continuation

from atomate2.

Comments (12)

jmmshn commented on June 12, 2024 2

Yeah, I think we should have a quick chat about this as a group next week?

from atomate2.

mjwen commented on June 12, 2024 1

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

from atomate2.

jmmshn commented on June 12, 2024 1

In this case, this would be a continuation that could not use the contcar, correct?

So in my experience calculations rarely timeout without any ionic step. Defect calculations are really bad because they:

Need HSE -> which sometimes requires many many electronic steps to get the first converged wavefunction
Needs to take many ionic steps because the ionic relaxation tends to be pretty extreme

On most reasonable clusters you'll actually get quite a few ionic steps in before walltime.

from atomate2.

jmmshn commented on June 12, 2024

I also posted this on the forum but I don't think that shows up here:
https://matsci.org/t/atomate2-restart-for-long-running-hse-jobs/42345

from atomate2.

mkhorton commented on June 12, 2024

Ok, I wanted to recap two possible approaches. Caveat that some of this is a carry over from FireWorks discussions, so may not be appropriate for jobflow.

(Numbering the following to make it more easy to reference in replies.)

Support for an individual job to enter/exit a checkpointed state
- Advantage maybe that this more clearly communicates the state of the job, and avoids a situation whereby a jobflow could become excessively long with multiple continuations.
Standardized interface for dynamic addition of continuation jobs
- Advantage here is that it is conceptually simpler, and perhaps fits better with the jobflow model.

For both approaches, there needs to be a standardized way to initiate a checkpoint (e.g., the approach we trialed previously was listening for a SIGUSR1 which would warn of an approaching walltime, since this is supported by several HPC systems), a way to then verify that the request to checkpoint has completed successfully, and to continue from the checkpoint.

Prior art:

FireWorks checkpoint PR materialsproject/fireworks#423 (not merged, but functionally complete up until entering the checkpoint state)
A Custodian approach which is code-specific, which allows a job to automatically continue if it detects the presence of checkpoint files in the launch directory, but was not integrated into FireWorks

Some subtleties to think about:

Do we need to distinguish between a checkpointed state where it is required that the continuation happens on the same HPC system, e.g. due to large files present that cannot or might be cumbersome to include in the jobstore? (This might favor approach 1.) And, conversely, those continuations (e.g. storing a CONTCAR) that could be portable between HPC systems?
Do we need to make any special considerations for tools like MANA in the design, which may be available in the future?
Do we need to make any special consideration for HPC systems which have flex queues, and have mechanisms to auto re-submit a job if it does not complete?

Questions 6 + 7 I think are likely not relevant here, but I mention for completeness.

My own view here is that standardizing on a pattern and documenting that is more important than the specific approach taken, and that it is very important we get this right. Workshop takeaways were varied, but essentially we're not the only people having this issue, and it's a priority.

from atomate2.

jmmshn commented on June 12, 2024

@mkhorton so I think there are two different problems to solve:

copying contcar to poscar when restarting relaxation runs.
checkpointing of the actual vasp calculation

The majority of compute time that I have been wasting for the last couple of years has been on long-running relaxation jobs where restarting a failed relaxation always starts from the initial structure so I think all relaxations jobs that did not finish can be considered in a "checkpointed" state even though they don't engage in with any formal checkpointing system other than the CONTCAR file.
Solving 1. would fix the problem will lengthy relaxations but would not really help MD runs (unless they can be stitched together?)
But that would basically solve all of my problems now.

from atomate2.

mkhorton commented on June 12, 2024

I hear you, if we just want to concern ourselves with contcars/relaxations, it's a much easier problem to solve (questions of wasted compute due to badly-progressing optimizations aside), and dynamic addition of additional jobs seems the way to go. But I think the question remains of how to formalize this, what pattern to adopt, etc?

from atomate2.

jmmshn commented on June 12, 2024

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs so maybe we just need a restart flag like ISTART in VASP that dictates how the calculation should behave given different available data in the directory. This can be something added to all the VASP makers.

EX.

For relaxation, you will just copy over CONTCAR to POSCAR if it is available (some additional consideration for what we call INCAR.orig might be needed ... but maybe not since we have already put so much effort to make all the builders provenance agnostic)
For long-running MD simulations, you might want to store the previous history in the task somewhere and just start from the most recent structure.

Just some suggestions but this is clearly a tough problem, but from experience, this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations) so super interested in fixing this.

from atomate2.

mkhorton commented on June 12, 2024

I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.

I think this is an important point too -- e.g., if there's a relaxation that does not converge, does this still get entered into a database? There are both pros/cons to having unconverged (or even failed) entries in a database but it makes the scheme and builds more complicated. I know we have partial support for this already. The alternative with an explicit checkpointed state is that these jobs do not get parsed.

from atomate2.

mkhorton commented on June 12, 2024

This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs so maybe we just need a restart flag like ISTART in VASP that dictates how the calculation should behave given different available data in the directory.

We're looking for a jobflow-based solution first and foremost, but for the discussion of how this integrates with FireWorks, I would suggest we would want a different command than just rerun_fws, e.g. we would want continuations to happen automatically without explicit user intervention.

from atomate2.

mkhorton commented on June 12, 2024

this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations)

In this case, this would be a continuation that could not use the contcar, correct? e.g. it would specifically need to checkpoint via stopcar mid-electronic relaxation. Therefore we encounter issues described in my point 5. above?

from atomate2.

davidwaroquiers commented on June 12, 2024

Yeah, I think we should have a quick chat about this as a group next week?

Hi @mkhorton @utf @jmmshn @mjwen I'd be happy to participate to the discussion about this. If that's possible, do not hesitate to contact me by mail to setup a meeting.

David

from atomate2.

Standardized approach for job checkpoint/restart/continuation about atomate2 HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent