Comments (12)
Yeah, I think we should have a quick chat about this as a group next week?
from atomate2.
I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.
from atomate2.
In this case, this would be a continuation that could not use the
contcar
, correct?
So in my experience calculations rarely timeout without any ionic step. Defect calculations are really bad because they:
- Need HSE -> which sometimes requires many many electronic steps to get the first converged wavefunction
- Needs to take many ionic steps because the ionic relaxation tends to be pretty extreme
On most reasonable clusters you'll actually get quite a few ionic steps in before walltime.
from atomate2.
I also posted this on the forum but I don't think that shows up here:
https://matsci.org/t/atomate2-restart-for-long-running-hse-jobs/42345
from atomate2.
Ok, I wanted to recap two possible approaches. Caveat that some of this is a carry over from FireWorks discussions, so may not be appropriate for jobflow.
(Numbering the following to make it more easy to reference in replies.)
- Support for an individual job to enter/exit a checkpointed state
- Advantage maybe that this more clearly communicates the state of the job, and avoids a situation whereby a jobflow could become excessively long with multiple continuations.
- Standardized interface for dynamic addition of continuation jobs
- Advantage here is that it is conceptually simpler, and perhaps fits better with the jobflow model.
For both approaches, there needs to be a standardized way to initiate a checkpoint (e.g., the approach we trialed previously was listening for a SIGUSR1 which would warn of an approaching walltime, since this is supported by several HPC systems), a way to then verify that the request to checkpoint has completed successfully, and to continue from the checkpoint.
Prior art:
- FireWorks checkpoint PR materialsproject/fireworks#423 (not merged, but functionally complete up until entering the checkpoint state)
- A Custodian approach which is code-specific, which allows a job to automatically continue if it detects the presence of checkpoint files in the launch directory, but was not integrated into FireWorks
Some subtleties to think about:
- Do we need to distinguish between a checkpointed state where it is required that the continuation happens on the same HPC system, e.g. due to large files present that cannot or might be cumbersome to include in the jobstore? (This might favor approach 1.) And, conversely, those continuations (e.g. storing a CONTCAR) that could be portable between HPC systems?
- Do we need to make any special considerations for tools like MANA in the design, which may be available in the future?
- Do we need to make any special consideration for HPC systems which have flex queues, and have mechanisms to auto re-submit a job if it does not complete?
Questions 6 + 7 I think are likely not relevant here, but I mention for completeness.
My own view here is that standardizing on a pattern and documenting that is more important than the specific approach taken, and that it is very important we get this right. Workshop takeaways were varied, but essentially we're not the only people having this issue, and it's a priority.
from atomate2.
@mkhorton so I think there are two different problems to solve:
- copying contcar to poscar when restarting relaxation runs.
- checkpointing of the actual vasp calculation
The majority of compute time that I have been wasting for the last couple of years has been on long-running relaxation jobs where restarting a failed relaxation always starts from the initial structure so I think all relaxations jobs that did not finish can be considered in a "checkpointed" state even though they don't engage in with any formal checkpointing system other than the CONTCAR file.
Solving 1. would fix the problem will lengthy relaxations but would not really help MD runs (unless they can be stitched together?)
But that would basically solve all of my problems now.
from atomate2.
I hear you, if we just want to concern ourselves with contcars/relaxations, it's a much easier problem to solve (questions of wasted compute due to badly-progressing optimizations aside), and dynamic addition of additional jobs seems the way to go. But I think the question remains of how to formalize this, what pattern to adopt, etc?
from atomate2.
I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.
This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs
so maybe we just need a restart flag like ISTART
in VASP that dictates how the calculation should behave given different available data in the directory. This can be something added to all the VASP makers.
EX.
- For relaxation, you will just copy over CONTCAR to POSCAR if it is available (some additional consideration for what we call INCAR.orig might be needed ... but maybe not since we have already put so much effort to make all the builders provenance agnostic)
- For long-running MD simulations, you might want to store the previous history in the task somewhere and just start from the most recent structure.
Just some suggestions but this is clearly a tough problem, but from experience, this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations) so super interested in fixing this.
from atomate2.
I feel a big part is how to formalize it such that multiple jobflow jobs can be regarded as one, and later builders do not need to worry about how to stitch them together.
I think this is an important point too -- e.g., if there's a relaxation that does not converge, does this still get entered into a database? There are both pros/cons to having unconverged (or even failed) entries in a database but it makes the scheme and builds more complicated. I know we have partial support for this already. The alternative with an explicit checkpointed state is that these jobs do not get parsed.
from atomate2.
This also gets complicated by the fact that (I assume) most "restarts" will be initialized from lpad rerun_wfs so maybe we just need a restart flag like ISTART in VASP that dictates how the calculation should behave given different available data in the directory.
We're looking for a jobflow-based solution first and foremost, but for the discussion of how this integrates with FireWorks, I would suggest we would want a different command than just rerun_fws
, e.g. we would want continuations to happen automatically without explicit user intervention.
from atomate2.
this is the thing that turns some defects people off of using Atomate (since they are all running HSE supercell calculations)
In this case, this would be a continuation that could not use the contcar
, correct? e.g. it would specifically need to checkpoint via stopcar
mid-electronic relaxation. Therefore we encounter issues described in my point 5. above?
from atomate2.
Yeah, I think we should have a quick chat about this as a group next week?
Hi @mkhorton @utf @jmmshn @mjwen I'd be happy to participate to the discussion about this. If that's possible, do not hesitate to contact me by mail to setup a meeting.
David
from atomate2.
Related Issues (20)
- Database and data storage HOT 4
- Move inputsets to pymatgen? HOT 52
- Deduplicate doc strings HOT 14
- BUG: A blank `~/.atomate2.yaml` file causes a crash HOT 1
- Capitalization of run types inconsistent with Emmet ?
- Structure metadata and additional properties in output documents HOT 8
- Default magmom behavior causes magnetic ordering to reset between relax -> static HOT 5
- Better secrets handling with Pydantic 2 HOT 1
- Output document for large MD output files HOT 8
- Problem with python 3.8 support with newer releases of dependencies HOT 3
- BUG: calcs_reversed in the cp2k TaskDocument schema is actually not in reverse order
- Incompatibility with pydantic-settings HOT 3
- BUG: package version compatibility issue arises with pydantic >= 2.0 HOT 2
- Remove `__all__` to discourage wildcard imports HOT 1
- `mock_vasp` should default to assert all INCAR tags match reference files HOT 6
- Are `bandgap_tol` or `bandgap_override` used anywhere in the MP workflows? HOT 1
- BUG: Schema overflow in Lobster workflow HOT 10
- BUG: BandStructure not in data store anymore HOT 4
- More tutorials HOT 12
- Minting new Atomate2 version that is pydantic2 compatible HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from atomate2.