Giter Site home page Giter Site logo

custodian's People

Contributors

andrew-s-rosen avatar computron avatar czhengsci avatar dependabot-preview[bot] avatar esoteric-ephemera avatar espottesmith avatar fyalcin avatar gpetretto avatar hanmeitang avatar hautierg avatar jageo avatar janosh avatar jmmshn avatar matthewkuner avatar michaelwolloch avatar mkhorton avatar montoyjh avatar naik-aakash avatar nwinner avatar pre-commit-ci[bot] avatar rdguha1995 avatar rkingsbury avatar samblau avatar sdacek avatar shyamd avatar shyuep avatar sivonxay avatar tschaume avatar wmdrichards avatar xhqu1981 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

custodian's Issues

Better EDWAV error fix

The current fix for the EDWAV error is to change the smearing scheme to ISMEAR = 0, which is not a very reliable fix. The origin of this error appears to be a bug in VASP when it is compiled with the default -O2 level of optimization in some versions of the intel compiler. Changing the optimization level to -O1 appears to reliably fix the issue.

I suggest adding a warning or optional switch to an O1-version of VASP to fix the bug, instead of changing the smearing. For what its worth, EDWAV errors only show up when running ALGO = A or ALGO = D, which can't be run with the common ISMEAR = -5 scheme anyways.

Small changes should in converge_example.py

The key of "filename" in setting dict should change to "file", otherwise will raise ValueError("Unrecoginized format...").

if job_number < 2 and not converged:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1}}},
                    {"filename": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

            #switch to RMM-DIIS once we are near the
            #local minimum (assumed after 2 runs of CG)
            else:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1, "IBRION": 1}}},
                    {"filename": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

Above code should change to:

if job_number < 2 and not converged:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1}}},
                    {"file": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

            #switch to RMM-DIIS once we are near the
            #local minimum (assumed after 2 runs of CG)
            else:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1, "IBRION": 1}}},
                    {"file": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

[VASP] real_optlay assumes LREAL = Auto when this is not necessarily the case

Currently, the real_optlay INCAR swap implicitly assumes the user has set LREAL = Auto, but there is no guarantee this is the case. We wouldn't want LREAL = False to be switched by accident. Also, we probably shouldn't be switching to LREAL = True for large systems regardless -- that goes against the recommendations in the VASP manual.

Will be closed once #182 is merged.

VaspErrorHandler not handling IBZKPT: tetrahedron method fails

For me, VaspErrorHandler doesn't trigger on IBZKPT: tetrahedron method fails for NKPT<4. NKPT = 3 even though it's supposed to handle this:

error_msgs = {
"tet": [
"Tetrahedron method fails",
"Fatal error detecting k-mesh",
"Fatal error: unable to match k-point",
"Routine TETIRR needs special values",
"Tetrahedron method fails (number of k-points < 4)",
"BZINTS",
],

with open(self.output_filename, "r") as f:
for line in f:
l = line.strip()
for err, msgs in VaspErrorHandler.error_msgs.items():
if err in self.errors_subset_to_catch:
for msg in msgs:
if l.find(msg) != -1:
# this checks if we want to run a charged
# computation (e.g., defects) if yes we don't
# want to kill it because there is a change in
# e-density (brmix error)
if err == "brmix" and "NELECT" in incar:
continue
self.errors.add(err)
error_msgs.add(msg)
for msg in error_msgs:

if self.errors.intersection(["tet", "dentet"]):
if vi["INCAR"].get("KSPACING"):
# decrease KSPACING by 20% in each direction (approximately double no. of kpoints)
actions.append(
{
"dict": "INCAR",
"action": {"_set": {"KSPACING": vi["INCAR"].get("KSPACING") * 0.8}},
}
)

Could the reason be the different cases?

  • Tetrahedron method fails
  • tetrahedron method fails

Feature request: Nonsensical/non-existent parameter check for INCAR files

VASP by default does not have a way of checking if the user entered in the name of an INCAR parameter correctly and therefore will still run even if there are nonsensical parameters in the file. e.g. If one were to set a value for "NBAND" instead of "NBANDS" in the INCAR, VASP will ignore any the former. Another example is if someone sets METAGGA=Scan in ver. 5.4.1 even though this metagga parameter does not exist in this version, nothing will happen. This can make troubleshooting confusing sometimes if a user gets results that are not aligned with their expectations and don't know why.

I don't think custodian has such checks yet, but would it be possible to implement it?

Yaml logs

How about adding an option to write logs to custodian.yaml instead of custodian.json? More readable IMO and fits more logs onto one screen. Yaml is slower to write presumably but won't be the bottleneck.

Monitor capability of handlers for VaspJob is broken

Jobs experiencing an error (e.g. aliasing) will complete before custodian corrects them (and then restarts them). Seems to be caused by p.communicate() in VaspJob.run(). Is there any reason this is necessary? I'm not sure what bug it was supposedly fixing

Failed to terminate an outdated calculation but create a new one at same time (within a single firework).

System

  • Custodian version: 0.8.8
  • Python version: 2.7.11
  • OS version: Red Hat Enterprise Linux Server release 6.3 (Santiago)
  • MPenv version: sjtu branch
  • pymatgen version: 3.5.0 (develop model)
  • Fireworks version: 1.2.7 (develop model)
  • MPWorks version: 0.1dev0.1
  • VASP version: 5.4.1.05Feb16 (also tried 5.3.3)

Summary

  1. vasp calculation with custodian failed to terminate an calculation outdated job when the now job has already created.
  2. to create 5 jobs at most belonging to a single firework.

After detecting error, the job shall stop and incar shall be modified and new calculation shall be resubmitted.But right now old calculations were not terminated though new calculation with modified incar submitted, which leads to overloading computer nodes.

I have no idea, but I want know which is the relevant code of "terminate_func"?

Example code and Error message is showed in #25

VASP called by custodian is slower than directly called

System

  • Custodian version: 2020.4.27
  • Python version: 3.6
  • OS version: RedHat 8.1

Summary

The univerisity have a migration from Ubuntu or Linux to RedHat, and I met a problem that VASP is slower if I call it from custodian, compared to call it directly by "mpirun -np 16 vasp_std". Is there any possible reasons for this kind of weird behaviour?

Example code

from custodian.custodian import Custodian
from custodian.vasp.handlers import VaspErrorHandler, \
    UnconvergedErrorHandler
from custodian.vasp.jobs import VaspJob
import os

vasp = 'vasp_std'
node = os.environ['NCPUS']
vasp_cmd = ['mpirun',"-np", str(node), vasp] # 
handlers = [VaspErrorHandler()]
jobs = VaspJob(vasp_cmd, auto_npar=False, auto_gamma=False )
c = Custodian(handlers, [jobs], max_errors=10)
c.run()

With the same input files, OUTCARs after one hour running:

  1. Custodian - Only the first iteration finished
First call to EWALD:  gamma=   0.147
 Maximum number of real-space cells 5x 5x 1
 Maximum number of reciprocal cells 2x 2x 7

    FEWALD:  cpu time    0.2542: real time    0.2548


--------------------------------------- Iteration      1(   1)  ---------------------------------------


    POTLOK:  cpu time    0.2242: real time    0.2320
    SETDIJ:  cpu time    0.2767: real time    0.2774

  1. VASP called directly- Iteration 28 finished

  energy without entropy =     -126.53388816  energy(sigma->0) =     -126.64889039


--------------------------------------------------------------------------------------------------------




--------------------------------------- Iteration      1(  28)  ---------------------------------------


    POTLOK:  cpu time    0.2173: real time    0.2239
    SETDIJ:  cpu time    0.0101: real time    0.0102

Any suggestions? Thanks.

installation error

unable to install it on nersc

Got error like this.

$ python setup.py develop
Traceback (most recent call last):
File "setup.py", line 11, in
long_desc = f.read()
File "/global/u1/m/mliu/ml_web/virtenv_ml_web/lib/python2.7/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3323: ordinal not in range(128)

Include CONTCAR in VASP_BACKUP_FILES

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

Summary

  • The list of VASP_BACKUP_FILES doesn't include the CONTCAR
  • Some of the custodian error corrections claim they are trying to compensate for bad ionic step updates (I am investigating the PotimHandler in particular)
  • It would be nice to be able to directly verify that a bad ionic update happened by looking at the what the CONTCAR looked like vs the POSCAR. Otherwise I can't tell if the error detection's claim of bad ionic update is correct or not.
  • More generally, the CONTCAR will help me debug where the structure started and ended in the run and doesn't take much space.

Suggested solution (if known)

VASP_BACKUP_FILES = {"CUSTODIAN", "INCAR", "KPOINTS", "POSCAR", "OUTCAR", "OSZICAR",
                     "vasprun.xml", "vasp.out", "std_err.txt"}

JSONSerializable in monty?

Should JSONSerializable be moved to monty? It looks more general than custodian and seems to fit in with "MSONable"

Possible improvements: control number of errors and exception raised

Dear all, I opened this issue not to signal a problem, but rather to propose a couple of improvements to the current custodian implementation. These ideas emerged since we would like to use custodian as a base for error handling in a new project and we think that these functionalities would help us in our use cases.

Number of errors

The first modification is about the introduction of a finer control on the number errors before stopping the job. In particular I think that it would be nice to have the possibility to decide the maximum number of times a correction from a single error handler should be applied.

I can imagine a series of cases where this could be useful:

  • a correction for an error message that is only meaningful if applied once. Applying again the same correction could be useless, so stop the job immediately.
  • in the case of the restart, one may want to control exactly how many times a calculation should be restarted before giving an error. This could be handled with the total number of errors, but in general it is not possible to predict how many other errors could be present.
  • the VASP CheckpointHandler suggests that this handler should be used alone and with a high value for max_errors. However I suppose that the limitation of being the single error handler could be lifted if it is possible to control the maximum number of tolerated corrections for each of the other handlers.

Of course something like this could still be implemented at the level of each single error handler, but this will likely lead to the writing of a lot of redundant code, while it seems a sufficiently general feature to be present in the core implementation.

I have prepared a rough implementation of this modification here: gpetretto@3ff38fc

Such a modification should have no impact on existing code, as by default everything should still keep working as before. The only possible problem could be if someone has implemented some hanlders outside the official custodian implementation with names clashing with the new ones introduced here, but this seems a relatively remote possibility.

Here is a snippet of code demonstrating possible use cases: https://gist.github.com/gpetretto/b14a1addf53a8e99893fca24228e1d4e

More informative exceptions

I suppose that more often than not Custodian runs inside some other python code (e.g. fireworks), but, as far as I can see, what happens inside custodian is not really transparent from the outer layers. To give a more specific example, we would like to take advantage of the exact final state of the error handler/validator that caused the program to stop, to take further actions. This may involve a dynamical evolution of the workflow, like adding an intermediate step calculation before rerunning the current job.
I would like to stress that here I am not suggesting to introduce some kind of integration with other python packages (being it fireworks or anything else), but just to let out more information when errors happen for whoever may be interest in having it.

It seems that currently some information could be extracted from the Custodian instance (from the log_run, total_errors and errors_current_job), but this implies a bit of investigation, and maybe of guessing about what really happened inside custodian (e.g. is there any way one could know which validator made the calculation fail, aside from parsing the text of the error message?).
Here is one possible modification of the current behavior that would already allow to directly receive some useful information
gpetretto@9597595
https://gist.github.com/gpetretto/13f8f00994d62844cadeae812559bfb9
This takes advantage of the validator attribute in CustodianError, that is set, but it seems to be never used. Also, since the new error class CustodianRuntimeError subclasses RuntimeError, this should preserve almost entirely the current behavior, with the only difference being the name of the exception in the message. In addition it will provide additional information about what happened inside custodian.

However, if a bit more freedom could be allowed, I would prefer introducing a more detailed hierarchy of CustodianError errors and let these exception directly emerge from Custodian, so that one could be able to catch and analyze more in details each case differently. For example something like this:

RuntimeError
 +-- CustodianError
      +-- ValidationError
      +-- HandlerError
      +-- ReturnCodeError
      +-- MaxErrorsError
      +-- MaxErrorsPerJobError

Notice that I suggest to keep it a subclass of RuntimeError to ensure that any existing code that relies on catching RuntimeError will keep working.

Here is a possible implementation:
gpetretto@d2e0105
https://gist.github.com/gpetretto/45506abee33f68b103639399b2c77064

I would be interested in hearing comments and knowing if these kind of modifications could be of interest for you. If that's the case I can clean up the implementations and open pull requests.

Thanks

v1.0.0 not compatible with pmg4.0

System

  • Custodian version: 1.0.0
  • Python version: 3.5.1 & 2.7
  • OS version: OS X 10.11.5 & Ubuntu (Circle CI)

Summary

A few deprecated methods in pymatgen NOT removed in 1.0.0 release.

Example code

>>> from custodian.vasp.jobs import VaspJob

Error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhideng/.pyenv/versions/3.5.1/Python.framework/Versions/3.5/lib/python3.5/site-packages/custodian/vasp/jobs.py", line 26, in <module>
    from pymatgen.io.smart import read_structure
ImportError: No module named 'pymatgen.io.smart'

Suggested solution (if known)

Release a new stable version.

half_kpts option and float values of kpoints

System

  • Custodian version: master
  • Python version: 2.7

Summary

  • I am running custodian with half_kpts option for Si. The job fails almost immediately
  • The downstream error looks like this: "/global/project/projectdirs/m2439/matmethods_test/pymatgen/pymatgen/io/vasp/inputs.py\", line 1154, in from_string\n kpts = [int(i) for i in lines[3].split()]\nValueError: invalid literal for int() with base 10: '4.0'\n"
  • This is caused by the KPOINTS file written by custodian which looks like this
Automatic kpoint scheme
0
Gamma
4.0 4.0 4.0

Note the floating points. Note also that the original KPOINTS file was 8 8 8, no floats. So it is custodian that is adding the floating point.

Error message

See also this custodian.json fragment which shows it overriding the KPOINTS with float:

[
    {
        "corrections": [], 
        "job": {
            "settings_override": [
                {
                    "action": {
                        "_set": {
                            "comment": "Automatic kpoint scheme", 
                            "usershift": [
                                0, 
                                0, 
                                0
                            ], 
                            "labels": null, 
                            "tet_number": 0, 
                            "tet_connections": null, 
                            "@module": "pymatgen.io.vasp.inputs", 
                            "nkpoints": 0, 
                            "coord_type": null, 
                            "kpts_weights": null, 
                            "@class": "Kpoints", 
                            "tet_weight": 0, 
                            "kpoints": [
                                [
                                    4.0, 
                                    4.0, 
                                    4.0
                                ]
                            ], 
                            "generation_style": "Gamma"
                        }
                    }, 
                    "dict": "KPOINTS"
                }
            ], 

Let me know if you need more info

Two tests broken by recent TravisCI changes in vasp/tests/test_jobs.py

When updating TravisCI build parameters to allow for OpenBabel-dependent tests to run, two tests in Vasp's test_jobs.py began to fail despite being seemingly unrelated to the changes. More specifically, test_setup in both VaspJobTest and VaspNEBJobTest had problems related to multiprocessing.cpu_count(). I commented the failing sections out and would appreciate if someone with more Vasp experience could fix them. Thanks!

Unfreezef and Restart option for FEFF calculation

Current FEFF nonconvergence error correction relies on the adjusting of the maximum iterations and convergence accelerator factor. An alternative option, i.e. RESTART card, could be implemented for more robust error correction. If RESTART is specified, FEFF will start the SCF calculation of the potentials from an existing "pot.bin" file, this way one can continue an earlier SCF calculation and saves SCF potential calculation time.

Custodian should support validators

As per Will's suggestion. To implement scheme for checkers that do not actually correct anything, but validates something at the end of a job.

[VASP] Add handler for PSMAXN warnings and associated failures

Summary

  • VASP failure associated with PSMAXN warning is not recognized as an error by Custodian, triggering misleading ValidationError

Details

With certain combinations of ENCUT, LREAL, and pseudopotentials, VASP issues the warning

WARNING: PSMAXN for non-local potential too small

In some cases vasp still runs successfully, in other cases it will fail, e.g.

WARNING: PSMAXN for non-local potential too small
 LDA part: xc-table for Pade appr. of Perdew
 POSCAR, INCAR and KPOINTS ok, starting setup
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
...

It appears that Custodian does not recognize the above type of failure as an error. As a result, _run_job() will attempt to validate the output of the calculation (which never ran in the first place and therefore never generated output) and raise a ValidationError .

Traceback (most recent call last):
  File "/global/u2/r/rsking84/.conda/envs/cms/code/fireworks/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/global/u2/r/rsking84/.conda/envs/cms/code/atomate/atomate/vasp/firetasks/run_calc.py", line 211, in run_task
    c.run()
  File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 378, in run
    self._run_job(job_n, job)
  File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 502, in _run_job
    raise ValidationError(s, True, v)
custodian.custodian.ValidationError: Validation failed: VasprunXMLValidator

The ValidationError is very difficult to troubleshoot without running vasp manually. In this situation, the contents of vasp.out are empty and std_err.txt contains only

srun: fatal: Can not execute vasp_std

Suggested solution

Information about the PSMAXN warning and associated failures is scarce, but there appear to be several possible fixes:

  1. Set LREAL=FALSE (expand the basis set in reciprocal space instead of real space)
  2. Sort the pseudopotentials such that the one with the highest ENMAX appears first in the POTCAR
  3. Lower the ENCUT value

I have had the most success with Option 1.

Option 2 has not solved the issue for me and is only applicable if the user does not specify ENCUT in the INCAR file (I think; see docs ). It is also not clear whether this fix is still relevant to the latest versions of VASP.

I'm not yet familiar enough with the architecture of Custodian to know the best way to address this, but it seems to me that, at a minimum, an error handler to catch this type of failure would be valuable. Even better would be to modify LREAL to FALSE on the fly.

Further reading on troubleshooting the VASP PSMAXN warnings:

https://cms.mpi.univie.ac.at/vasp-forum/viewtopic.php?f=3&t=8370

the reason most probably is that you join 2 potentials with very different cutoff, with the POTCAR with the SMALL cutoff (U) being the first in the list. This potentials is used to determine PSMAXN.
please
1) switch the 2 atoms in POSCAR and POTCAR (ie give the atoms such that those with the hardest potentials are first
2) OR use O_s (soft O, low cutoff)

https://cms.mpi.univie.ac.at/vasp-forum/viewtopic.php?t=14811

The warning means that PSMAXN is too small for the required cutoff energy (ENMAX) the first of the atoms given in POTCAR.
Either use a harder potential or decrease ENMAX.

Solved it by setting LREAL=FALSE

https://www.researchgate.net/post/Relaxation_in_metal_using_vasp2

"PSMAXN for non-local potential too small"
Try lowering your ENCUT parameter (how large is it, and what are the defaults in your POTCAR?), this error indicates that you go out of bounds for an array related to the potential, which is related to the cutoff energy.

http://materials.duke.edu/AFLOW/README_AFLOW.TXT)

PSMAXN
PSMAXN errors. By default aflow tries to go around PSMAXN warnings by restarting VASP with reducingly
lower ENMAX until everything is set. This can be done by tuning the INCAR schemes.

Example not working - serialisation issue?

Hi,

I tried the example from your website but was not successful, so I looked into the unit tests and updated the example correspondingly, but still I have some issues with the serialisation.
ExampleJob:

class ExampleJob(Job):

    def __init__(self, jobid, params=None):
        if params is None:
            params = {"initial": 0, "total": 0}
        self.jobid = jobid
        self.params = params

    def setup(self):
        self.params["initial"] = 0
        self.params["total"] = 0

    def run(self):
        sequence = [random.uniform(0, 1) for i in range(100)]
        self.params["total"] = self.params["initial"] + sum(sequence)

    def postprocess(self):
        pass

    @property
    def name(self):
        return "ExampleJob{}".format(self.jobid)

Error Handler:

class ExampleHandler(ErrorHandler):

    def __init__(self, params):
        self.params = params

    def check(self):
        return self.params["total"] < 50

    def correct(self):
        self.params["initial"] += 1
        return {"errors": "total < 50", "actions": "increment by 1"}

This works:

njobs = 100
params = {"initial": 0, "total": 0}
c = Custodian([ExampleHandler(params)],
              [ExampleJob(i, params) for i in range(njobs)],
               max_errors=njobs)
output = c.run()

This does not:

njobs = 100
c = Custodian([ExampleHandler({"initial": 0, "total": 0})],
              [ExampleJob(i, {"initial": 0, "total": 0}) for i in range(njobs)],
               max_errors=njobs)
output = c.run()

Custodian does not print output on jobs that are killed

When running VASP with custodian (using fireworks), custodian errors are printed to the run directory after execution is completed. This is an issue, however, in cases when lots of errors result in a job hitting walltime or when a job gets pre-empted.

Perhaps the custodian log could be updated/written after each correction? This would give preserve the information even if the job is killed suddenly.

Failed to terminate an outdated job and create a new one at same time.

System

  • Custodian version: 0.8.8
  • Python version: 2.7.11
  • OS version: Red Hat Enterprise Linux Server release 6.3 (Santiago)
  • MPenv version: sjtu branch
  • pymatgen version: 3.5.0 (develop model)
  • Fireworks version: 1.2.7 (develop model)
  • MPWorks version: 0.1dev0.1
  • VASP version: 5.4.1.05Feb16 (also tried 5.3.3)

Summary

  1. vasp calculation with custodian failed to terminate an outdated job when the now job has already created.
  2. to create 5 jobs at most belonging to a single firework.

Example code

from mpworks.submission.submission_mongo import SubmissionMongoAdapter
from pymatgen.matproj.snl import StructureNL
from pymatgen.core.structure import Structure
from pymatgen.transformations.standard_transformations import SubstitutionTransformation

# set submission db
sma         = SubmissionMongoAdapter.auto_load()
# read structure from POSCAR file.
bs_struc    = Structure.from_file("POSCAR")
# some substitution, then bs_strucโ†’re_struc
snl         = StructureNL(re_struc, 'KeLiu <[email protected]>')
sma.submit_snl(snl, '[email protected]', parameters=None)

then go_submissions and qlaunch

qlaunch -r rapidfire --nlaunches infinite -m 4 --sleep 100 -b 10000

Error message

fireworks: Cu1_La1_Te2--GGA_optimize_structure_(2x)
Cu1_La1_Te2--GGA_opt-11690.error

INFO:custodian.custodian:Run started at 2016-04-20 14:24:05.227027 in /lustre/home/umjzhh-1/launcher/layered_material/ycute2/substitution_01stRun/block_2016-04-18-11-18-03-912778/launcher_2016-04-20-01-37-05-453043.
INFO:custodian.custodian:Custodian running on Python version 2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:08:32)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'too_few_bands'], u'actions': [{u'action': {u'_set': {u'NBANDS': 28}}, u'dict': u'INCAR'}]}
INFO:root:Backing up run to error.2.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.PositiveEnergyErrorHandler object at 0x2af3687fcf90>, u'errors': [u'Positive energy'], u'actions': [{u'action': {u'_set': {u'ALGO': u'Normal'}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 2. Errors thus far = 2.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.3.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'eddrmm'], u'actions': [{u'action': {u'_set': {u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 3. Errors thus far = 3.
INFO:root:Running mpirun -n 32 vasp
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 23
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 14
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 21
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 12
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 22
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 16
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 18
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 29
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 20
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 25
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 31
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 24
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.4.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'zpotrf', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISYM': 0, u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}, {u'action': {u'_set': {u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 4. Errors thus far = 4.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.5.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 5. Errors thus far = 5.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.6.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Max errors reached.
ERROR:custodian.custodian:MaxErrors
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-04-20 15:47:38.863389.
INFO:custodian.custodian:Run completed. Total time taken = 1:23:33.636362.
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/rocket.py", line 213, in run
    m_action = t.run_task(my_spec)
  File "/lustre/home/umjzhh-1/kl_me2/codes/MPWorks/mpworks/firetasks/custodian_task.py", line 115, in run_task
    custodian_out = c.run()
  File "/lustre/home/umjzhh-1/kl_me2/codes/custodian/custodian/custodian.py", line 221, in run
    .format(self.total_errors, ex))
RuntimeError: 6 errors reached: (CustodianError(...), u'MaxErrors'). Exited...
INFO:rocket.launcher:Rocket finished

the relevant SLURM status:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
11690        Cu1_La1_T+        cpu acct-umjz+         32  COMPLETED      0:0
11690.batch       batch            acct-umjz+         32  COMPLETED      0:0
11690.0       pmi_proxy            acct-umjz+          2  COMPLETED      0:0
11690.1       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.2       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.3       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.4       pmi_proxy            acct-umjz+          2     FAILED      7:0

Suggested solution (if known)

  • I have no idea, but I want know which is the relevant code of "terminate_func"?

Files

<If input files are needed for the error, please copy and paste them here.>

Check POSCAR under input folder

The VaspCustodianTask function from fireworks_vasp.tasks seems to be checking if there is a poscar under the folder or not.

This becomes a problem when I ran it for the NEB input sets, where all poscars are written in 00~0x sub folders. And there is no poscar under the input file-folder.

VaspCustodianTask(vasp_cmd=['aprun', '-n', '48', '/global/homes/r/rongzq/bin/vasp_5.3.3/vasp_hopper/vasp', '>&', 'stdout'], handlers='all', custodian_params=custodian_params)

Potential issue with MaxForceErrorHandler

I believe that there is a problem with the way the MaxForceErrorHandler is implemented.
The correct method only acts on the EDIFFG key, leaving the EDIFF unchanged. This might be fine if this handler is called only once and if there is a large enough factor between EDIFF and EDIFFG. However, for example I ended up in a situation in which this handler was applied multiple times, causing the value of EDIFFG to go below that of EDIFF, which is clearly a bad configuration.

In general I would say that it would preferable to keep the same ratio between EDIFF and EDIFFG. i.e. EDIFF could be multiplied by 0.5 as well. If this seems unsuitable, at the very least it should be checked that EDIFFG does not go below EDIFF.

Custodian Monitor Not Working with terminate_on_nonzero_returncode=True

System

  • Custodian version: master branch, revision af15bda and 5f0e8b
  • Python version: 3.5.2
  • OS version: Cray XC40

Summary

This test compares two revisions in the master branch: 1) "5f0e8b9" which is after adding the return code exception raising code, 2) "af15bda" which before adding return code logic. The "run_vasp" is command is used to perform the test. No workflows messed up with the test, only a simple SLURM script and custodian package were involved.

The most important message is from "run.log" file.

  1. With revision "5f0e8b9":

After "{'errors': ['eddrmm']...", the custodian says "Job return code is 137. Terminating...".

  1. With revision "af15bda":

After "After "{'errors': ['eddrmm']...", the custodian says "Starting job no. 1 (VaspJob) attempt no. 2". There are other issues which are specific to my environment and are already solved, which I don't think will affect the conclusions. The bottom line is that with the old code is trying to fix the error while the new code exits prematurely.

Example code

module load vasp/5.4.1
run_vasp -c "srun -n 32 vasp_std" static

Error message

"Job return code is 137. Terminating..." vs "Starting job no. 1 (VaspJob) attempt no. 2".

Suggested solution (if known)

  • When both Monitors and terminate_on_nonzero_returncode=True are requested, disable either of them.
  • Document that the Monitor won't work if terminate_on_nonzero_returncode is set to true, otherwise, the users will have the illusion that they still have Custodian Monitor to fix VASP errors.

Files

cus_5f0e8b9_after_rc.tar.gz

cus_af15bda_before_rc.tar.gz

Updating subspacematrix error handling with VASP

In the custodian/vasp/handlers.py script, there is an error-handling method for the "subspacematrix" error (i.e. "WARNING: Sub-Space-Matrix is not hermitian in DAV"). As it stands right now, the custodian file tries to change LREAL when this error occurs. I have found that in systems I have studied, this error often occurs when PREC is not set to Accurate. I have attached an example set of input files. It runs with PREC = Accurate but has the subspace error after 7-10 SCF iterations when PREC is not set (i.e. it's the default value of Normal). It therefore may be desirable to add a PREC switch for this error message. Note that this may be dependent on the particular VASP build. I've tested this on VASP 5.4.1 for the record.
INCAR.txt
KPOINTS.txt
POSCAR.txt
POTCAR.txt

Temporary modifications to input sets

Sometimes conjugate gradient doesn't work well for ionic relaxation in VASP. In some cases (mostly with high degrees of freedom, and hard+soft vibrational modes), I have found that the velocity quench algorithm (surprisingly) works better (even though damped MD doesn't help). The simple solution would be to switch the algo with the UnconvergedError handler. However, I think it may be be more effective to switch the algo back to the original (usually conjugate gradient) after the 1st relaxation completes.

Currently, this is difficult within the Custodian framework, as it is a state machine. My proposition is to allow error handlers to prepend actions to a list that runs during job post-processing.

Does anyone have alternative suggestions, or think that this is outside the scope of what Custodian should be able to handle?

Check returncode to raise CustodianError() leading some trouble

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

System

  • Custodian version: master
  • Python version: 2.7
  • OS version: CentOS release 6.8

Summary

  • the default terminate_on_nonzero_returncode is True.
  • _run_job will raise a CustodianError() if returns a nonzero return code (in L395).
  • something wrong gets in custodian.py L319~L323, it seems like missing total_errors, i am not sure.

Example code

terminated sub slrum job with MPWorks

Error message

$ more Eu1_S2_Sn1--GGA_opti-294848.error
INFO:custodian.custodian:Run started at 2016-12-19 21:03:28.011467 in /lustre/home/umjzhh-1/launcher/block_2016-10-12-18-29-00-023399/launcher_2016-12-19-13-03-16-629901.
INFO:custodian.custodian:Custodian running on Python version 2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Hostname: node077, Cluster: unknown
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running srun -v -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Terminate the job step using scancel --signal=KILL 294848.0
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af91dd3d0d0>, u'errors': [u'brmix', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISTART': 1}}, u'dict': u'INCAR'}, {u'action': {u'_set':
{u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:Job return code is 137. Terminating...
ERROR:custodian.custodian:Job return code is 137. Terminating...
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-12-19 21:13:39.441558.
INFO:custodian.custodian:Run completed. Total time taken = 0:10:11.430091.
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/fireworks/fireworks/core/rocket.py", line 224, in run
    m_action = t.run_task(my_spec)
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/mpworks/mpworks/firetasks/custodian_task.py", line 136, in run_task
    custodian_out = c.run()
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/custodian/custodian/custodian.py", line 323, in run
    .format(self.total_errors, ex))
RuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 137. Terminating...'). Exited...
INFO:rocket.launcher:Rocket finished

Suggested solution (if known)

  • I just comment L395, and everything turn back to be fine.

Files

<If input files are needed for the error, please copy and paste them here.>

<contents of file 1>

ALGO=All method for correcting unconverged electronic loops

System

  • Custodian version: master

Summary

  • If a job has not reached electronic converged (UnconvergedErrorHandler), the fix procedure in custodian is:
                new_settings = {"ISTART": 1,
                                "ALGO": "Normal",
                                "NELMDL": -6,
                                "BMIX": 0.001,
                                "AMIX_MAG": 0.8,
                                "BMIX_MAG": 0.001} 

Recently, I've had pretty good luck with just setting ALGO=All when a job is unconverged rather than making all these changes (This is in fact how UnconvergedHandler already deals with SCAN failures, but I think it's more general than that). If it were just up to my qualitative assessment, I would first try ALGO=All; if that still doesn't work, to try making all the various changes above.

Downside: I don't have any hard facts to support my case.

Perhaps @shyamd or @montoyjh or @mkhorton or @tschaume can support or deny.

If we do want to switch to trying ALGO=All first, I'd be happy to do the implementation. So it's just a matter of deciding.

ERROR: no process found

Dear All,

Im using custodian, with python 3.6.8 in a a linux machine. The queue system is PBS.

When I launch the PBS script, instead of running the VASP binary, I run a python script instead that initializes a Custodian object and launches the job with 'vasp_cmd' being the command that I usually use to directly run VASP in a PBS script.

At first it runs, but then I get the following in the error log.

ERROR:custodian.custodian:
{ 'actions': [ { 'action': { '_set': { 'POTIM': 0.30000000000000004}},
'dict': 'INCAR'}],
'errors': [ 'brions'],
'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2acf80726b00>}
/software/bin/vasp: no process found

The file error1.tar.gz contains an empty std_err file. The OSZICAR file contains a completely normal behavior but its cut in the middle of the 3rd ionic step.

I dont know why Custodian runs well at first, and then decides to stop the ionic relaxation in the middle of a scf iteration, and when rerunning does not find the binary.

Does anybody know whats going on?

Best regards,

Oier.

CIC Energigune

VASP fix: ZHEGV error occurs when small # of atoms and too many cores

Currently, when the ZHEGV error appears in VASP, Custodian switches ALGO to All. This is the right approach, but for small systems (e.g. elementals with only 1 or 2 atoms/cell), oftentimes ZHEGV appears because too many cores are requested. With supercomputers having more and more cores per node these days, this comes up more frequently. The solution is just to decrease the number of cores for the job, but rather than trying to play around with that, Custodian could issue a warning to the user if len(structure) is below some cutoff. It would require some testing to figure out what a good value for this might be. My gut feeling is perhaps 5 atoms/cell.

Intent to add remote logging capabilities

We've having a lot of different users have VASP jobs fail for various reasons, and we'd like the capability to be able to aggregate Custodian errors so that we can more easily tell if there are errors that are occurring more often than would be expected (that is, that they reflect an underlying weakness in our input sets or error handlers, rather than user error).

I'd like to propose we integrate Sentry support via their Python client.

By default, this will result in no change to Custodian by default, except an additional optional dependency. However, if you set a SENTRY_DSN environment variable, we can set it up to log and aggregate errors automatically.

If there are no objections to this, I will go ahead and implement.

I think a similar addition to log stack traces in FireWorks when jobs fizzle might also be useful.

docstring of "StoppedRunHandler" looks incorrect

The docstring of "StoppedRunHandler" looks to be copied from a version of the CheckpointHandler docstring. I think it needs to be updated to pertain to StoppedRunHandler. I would do it myself, but I never used these handlers so can't comment intelligently on what they do.

If VASP does not even start running custodian doesn't raise error

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

System

  • Custodian version: 1.0.3 (master)
  • Python version: 2.7.12

Summary

  • Custodian doesn't raise error if VASP doesn't even run for any reason.
    ** example of importance: when files like OUTCAR are copied from the previous step, even if this step doesn't run, from OUTCAR and vasprun.xml, it looks like the job is already finished!
  • The problem might be in Custodian._run_job() method: after "p = job.run()" custodian just assumes that VASP started running while it might not even start due to an error like this.

Error message

(this example is on NERSC in std_err.txt):

srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: Unable to confirm allocation for job 3178405: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.

but this may happen due to various reasons regardless of what this error message says.

Suggested solution (if known)

I don't exactly know how VASP exits when it can't even start but the machine itself raises various errors like the one copied here. A check for this may be added before p = job.run()

Some errors are not detected in VASP6

System

  • Custodian version: latest master branch as of 2020-11-01
  • Python 3.8
  • VASP 6.1.1

Summary

VaspErrorHandler fails to detect certain error categories in VASP 6 that were detected properly in VASP 5. This appears to be a result of formatting changes associated with the error messages. My understanding is that custodian reads stdout line-by-line, but some of the error messages have been reformatted in such a way that they now break across multiple lines and hence are not caught.

See example below for the point_group error. I have also observed this for inv_rot_mat message, and I suspect it could affect other handlers as well.

Error message

VASP6:

 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     VERY BAD NEWS! internal error in subroutine IBZKPT: Error: point        |
|     group operation missing 2                                               |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

VASP5:

 VERY BAD NEWS! internal error in subroutine IBZKPT:
 Error: point group operation missing       2

The point_group error looks for the entire message on a single line:

"point_group": ["Error: point group operation missing"],

Suggested solution (if known)

It's not clear to me what the best solution is. Ideally we would move away from reading one line at a time and instead detect the entire string, regardless of where a line break occurs, but that may be difficult in practice. Alternatively, perhaps we need to shorten the error messages that are detected so that custodian will catch them in both VASP5 and VASP6.

Fix DriftErrorHandler or remove it from default handlers

So drift in forces is obviously important to catch, since it means that you might be misled into thinking you are more force converged than you actually are. However, it is not a "failure" in the traditional sense: the calculation can still finish and be otherwise accurate. One could imagine a warning in, say, the atomate drone if ingesting a calculation where drift > desired force convergence, and warn there, instead of causing outright failures.

Currently, the DriftErrorHandler fails to actually fix drift issues, and instead causes a large quantity of jobs to fail since it's in the default handler group... I've seen 1.4k issues in the last month.

Tagging @shyamd since I know it's your creation -- do you have any suggestions to fix drift issues, beyond the strategies in the handler?

ImportError for VASP jobs using pymatgen>=2019.9.12

System

  • Custodian version: 2019.8.24
  • Pymatgen version: 2019.9.16
  • Python version: 3.7
  • OS version: RHEL7

Summary

  • Custodian job for VASP fails with ImportError when trying to import VaspInput from pymatgen (2019.9.16)

Error message

Traceback (most recent call last):
  File "/home/users/stamma58/.conda/envs/custodian/bin/cstdn", line 10, in <module>
    sys.exit(main())
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 113, in main
    args.func(args)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 82, in run
    c = Custodian.from_spec(d)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 322, in from_spec
    cls_ = load_class(d["jb"])
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 302, in load_class
    mod = __import__(modname, globals(), locals(), [classname], 0)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/vasp/jobs.py", line 13, in <module>
    from pymatgen.io.vasp import VaspInput, Incar, Poscar, Outcar, Kpoints, Vasprun
ImportError: cannot import name 'VaspInput' from 'pymatgen.io.vasp' (/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/pymatgen/io/vasp/__init__.py)

Suggested solution (if known)

Starting from pymatgen version 2019.9.12 VaspInput is no longer available from the pymatgen.io.vasp namespace. Suggested solution:

  • Change imports of the from from pymatgen.io.vasp import VaspInput to from pymatgen.io.vasp.inputs import VaspInput

[Bug] Skipping over monitors at the end of job.

System

  • Custodian version: https://github.com/nwinner/custodian branch cp2k. Issue does not relate to this branch, but I am technically using handlers that exist in my fork, rather than already exist in the main repo.
  • Python version: 3.7

Summary

Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check

It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error, then instead of running a final check with all handlers, only ones with is_monitor=False will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error.

Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.

Example code

This code snippet is taken from custodian.py:448

            # While the job is running, we use the handlers that are
            # monitors to monitor the job.
            if isinstance(p, subprocess.Popen):
                if self.monitors:
                    n = 0
                    while True:
                        n += 1
                        time.sleep(self.polling_time_step)
                        if p.poll() is not None:
                            break
                        terminate = self.terminate_func or p.terminate
                        if n % self.monitor_freq == 0:
                            has_error = self._do_check(self.monitors,
                                                       terminate)
                        if terminate is not None and terminate != p.terminate:
                            time.sleep(self.polling_time_step)
                else:
                    p.wait()
                    if self.terminate_func is not None and \
                            self.terminate_func != p.terminate:
                        self.terminate_func()
                        time.sleep(self.polling_time_step)

                zero_return_code = p.returncode == 0

            logger.info("{}.run has completed. "
                        "Checking remaining handlers".format(job.name))
            # Check for errors again, since in some cases non-monitor
            # handlers fix the problems detected by monitors
            # if an error has been found, not all handlers need to run
            if has_error:
                self._do_check([h for h in self.handlers
                                if not h.is_monitor])
            else:
                has_error = self._do_check(self.handlers)

non-ideal error message for max_errors hit

System

  • master branch (including latest fix to returncode), Py27, Linux

Summary

  • In my specific run, the same error comes up twice in a row (EDDRM)
  • The second time it happens, custodian says that it hits max_errors (I think this part is OK) and raises a custodian error to exit.
  • What happens next is that this results in a non-zero return code, which intercepts that message and then uses it raise a "nonzero returncode error".
  • Thus the final output message of nonzero return code is less helpful to debugging runs than the original text of max errors hit.

Error message

The stack trace I get back is:

Traceback (most recent call last):\n  File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n    m_action = t.run_task(my_spec)\n  File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n    c.run()\n  File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n    .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n

You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.

Files

The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609

Suggested solution (if known)

  • Actually on first glance I am not even sure why this is happening. As far as I can tell, when line 323 throws an exception, the line of code about return code validation should never even run.

improvement to max errors with multi-job

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

System

  • Custodian version: master

Summary

  • If I have a 2-step structure relaxation, and the first relaxation hits MAX_ERRORs, the 2nd job still seems to perform some of the maintenance steps of "copying files to relax1 extension".
  • This makes it more difficult to realize that the 2nd job never actually ran. e.g., the directory contains both "OUTCAR.relax1" and "OUTCAR". Usually, that means that OUTCAR is for the 2nd job and OUTCAR.relax1 is for the first job, and the second job failed. In this case, the two files are identical since the first job hit max errors.
  • see uploaded file error.tar.gz (CHGCAR/WAVECAR/PROCAR removed) for an example where the first relaxation failed due to max errors

Files

error.tar.gz

file_delete of CHGCAR/WAVECAR and nscf runs

System

Most recent custodian run on NERSC

Summary

  • Many VASP custodian handlers have a _file_delete action of CHGCAR and WAVECAR as part of their operation as part of the VASP "recommendation"
  • However, non-SCF runs (where the CHGCAR is held fixed) requires these files to be present. If they are deleted and VASP is restarted, the job will immediately fail.

Error message

 reading WAVECAR
 WARNING: chargedensity file is incomplete
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10

Suggested solution (if known)

Handlers that delete the CHGCAR/WAVECAR should first check whether ICHARG > 10. If it is, then do not perform the deletion of CHGCAR/WAVECAR.

Files

An example run directory of this error will be available for the next couple of weeks if needed. Let me know.

ISTART in run_vasp

What's the use case for setting this in run_vasp? It causes NELMDL issues if LWAVE=False. I think the vasp default would be better.

Can't use VaspErrorHandler in IBRION=-1 or ISIF=5,6,7 runs

"rot_matrix" and "pricel" errors currently use a PerturbStructureTransformation (which is incompatible with these modes).
In my experience, "rot_matrix" is usually caused by difficulty with finding symmetry in even monkhorst grids, and is fixed by using an odd grid (though for some reason ISYM=0 doesn't fix it).

I think "pricel" should be fixed by setting ISYM=0.

Is there any objection to me making these changes?

[VASP] NonConvergingErrorHandler for meta-GGAs/hybrids

Currently, the SCF algorithm ladder in NonConvergingErrorHandler is tuned for GGAs and is not appropriate for meta-GGAs or hybrids. For instance, meta-GGAs should rarely be used with Algo = VeryFast or Fast (Algo = All is generally recommended and is the default in MPScanRelaxSet). For hybrids, Algo = Fast or VeryFast should never be used according to the VASP manual, as hybrids don't support these algorithms even though no warning is printed: https://www.vasp.at/wiki/index.php/LHFCALC.

Will be closed when #179 is merged.

Terminate Function Seems Not Working Properly With SLURM

This issue only matters when using Vasp Monitors and only when running jobs in a SLURM environmental. It seems to originate from the Popen.terminate() python API. If a job terminates normally, all the function works properly. However, once a VASP job is killed by Popen.terminate(), which would be true if the Monitor finds an error, new VASP job will never be able to run. The error message is "Unable to create job step: Job/step already completing or completed". The power of custodian is fully prohibited in this situation.

How do you know whether you are affected? If the success rate of recent calculation degrades, probably you are already affected.

I reproduced the error using a simple python script without custodian and asked NERSC staff to fix it. They confirmed the issue and promised to investigate it. However, it has already been two weeks I didn't get any update. As a precaution, I think we should consider the possibility they are not able fix it in short term.

I have an alternative to walk-around this issue in my fork. The point is to use "scancel" to kill VASP job. This method releases the resource successfully while Popen.terminate() doesn't. It seems to be working fine for the last two weeks. However, it makes the custodian code more complicated and environment dependent. I am not sure whether it is a good idea to merge this piece of code to the main repository.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.