materialsproject / custodian Goto Github PK

View Code? Open in Web Editor NEW

130.0 130.0 102.0 97.82 MB

A simple, robust and flexible just-in-time job management framework in Python.

License: MIT License

Python 100.00%

custodian's People

Contributors

Stargazers

Watchers

Forkers

navnidhirajput davidwaroquiers image-tester matk86 xingfenghe cnk118 montoyjh cci-smoketests dcossey014 kimh aykol tallakahath xhqu1981 altynbekm mjgrav2001 traviskemper pombredanne willtu albalu dbroberg mkhorton captaindasheng qiwen77 mbercx mason-datamaterials mahdidavari czhengsci emberx ad1v7 hipeter sivonxay zbwang jackmaninov jkglasbrenner samblau utf gpetretto wenqiuhao mrauha jmmshn yimingchen-eng ciflord lingtikong rees-c nwinner janklinux mattmcdermott munrojm astamminger srshivani fieldplay acrutt rkingsbury cooperhp dixitmudit hitliaomq stephen-quiton jageo jic198 ab5424 sayred1 michaelwolloch jacksund orionarcher molmd htz1992213 lishchgithub hongzhentian malik-ust naik-aakash ganfisher tllu sohaibumr yyx5048 jialay jiha6 haidi-ustc yury-lysogorskiy rdguha1995 zuhalcakir kshumilov trellixvulnteam meihaojie ligerzero-ai bit-part-young hpatel1567 matthewkuner fyalcin utksi xivh r1j1t esoteric-ephemera zezhong-zhang evelynmitchell lamdalamda nancysnals zulissimeta andrew-s-rosen nhew1994

custodian's Issues

Better EDWAV error fix

The current fix for the EDWAV error is to change the smearing scheme to ISMEAR = 0, which is not a very reliable fix. The origin of this error appears to be a bug in VASP when it is compiled with the default -O2 level of optimization in some versions of the intel compiler. Changing the optimization level to -O1 appears to reliably fix the issue.

I suggest adding a warning or optional switch to an O1-version of VASP to fix the bug, instead of changing the smearing. For what its worth, EDWAV errors only show up when running ALGO = A or ALGO = D, which can't be run with the common ISMEAR = -5 scheme anyways.

Small changes should in converge_example.py

The key of "filename" in setting dict should change to "file", otherwise will raise ValueError("Unrecoginized format...").

if job_number < 2 and not converged:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1}}},
                    {"filename": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

            #switch to RMM-DIIS once we are near the
            #local minimum (assumed after 2 runs of CG)
            else:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1, "IBRION": 1}}},
                    {"filename": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

Above code should change to:

if job_number < 2 and not converged:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1}}},
                    {"file": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

            #switch to RMM-DIIS once we are near the
            #local minimum (assumed after 2 runs of CG)
            else:
                settings = [
                    {"dict": "INCAR",
                     "action": {"_set": {"ISTART": 1, "IBRION": 1}}},
                    {"file": "CONTCAR",
                     "action": {"_file_copy": {"dest": "POSCAR"}}}]

[VASP] real_optlay assumes LREAL = Auto when this is not necessarily the case

Currently, the real_optlay INCAR swap implicitly assumes the user has set LREAL = Auto, but there is no guarantee this is the case. We wouldn't want LREAL = False to be switched by accident. Also, we probably shouldn't be switching to LREAL = True for large systems regardless -- that goes against the recommendations in the VASP manual.

Will be closed once #182 is merged.

VaspErrorHandler not handling IBZKPT: tetrahedron method fails

For me, VaspErrorHandler doesn't trigger on IBZKPT: tetrahedron method fails for NKPT<4. NKPT = 3 even though it's supposed to handle this:

custodian/custodian/vasp/handlers.py

Lines 63 to 72 in d3cb039

    
           error_msgs = { 
        
               "tet": [ 
        
                   "Tetrahedron method fails", 
        
                   "Fatal error detecting k-mesh", 
        
                   "Fatal error: unable to match k-point", 
        
                   "Routine TETIRR needs special values", 
        
                   "Tetrahedron method fails (number of k-points < 4)", 
        
                   "BZINTS", 
        
               ],

custodian/custodian/vasp/handlers.py

Lines 151 to 166 in d3cb039

    
           with open(self.output_filename, "r") as f: 
        
               for line in f: 
        
                   l = line.strip() 
        
                   for err, msgs in VaspErrorHandler.error_msgs.items(): 
        
                       if err in self.errors_subset_to_catch: 
        
                           for msg in msgs: 
        
                               if l.find(msg) != -1: 
        
                                   # this checks if we want to run a charged 
        
                                   # computation (e.g., defects) if yes we don't 
        
                                   # want to kill it because there is a change in 
        
                                   # e-density (brmix error) 
        
                                   if err == "brmix" and "NELECT" in incar: 
        
                                       continue 
        
                                   self.errors.add(err) 
        
                                   error_msgs.add(msg) 
        
           for msg in error_msgs:

custodian/custodian/vasp/handlers.py

Lines 178 to 186 in d3cb039

    
           if self.errors.intersection(["tet", "dentet"]): 
        
               if vi["INCAR"].get("KSPACING"): 
        
                   # decrease KSPACING by 20% in each direction (approximately double no. of kpoints) 
        
                   actions.append( 
        
                       { 
        
                           "dict": "INCAR", 
        
                           "action": {"_set": {"KSPACING": vi["INCAR"].get("KSPACING") * 0.8}}, 
        
                       } 
        
                   )

Could the reason be the different cases?

Tetrahedron method fails
tetrahedron method fails

Feature request: Nonsensical/non-existent parameter check for INCAR files

VASP by default does not have a way of checking if the user entered in the name of an INCAR parameter correctly and therefore will still run even if there are nonsensical parameters in the file. e.g. If one were to set a value for "NBAND" instead of "NBANDS" in the INCAR, VASP will ignore any the former. Another example is if someone sets METAGGA=Scan in ver. 5.4.1 even though this metagga parameter does not exist in this version, nothing will happen. This can make troubleshooting confusing sometimes if a user gets results that are not aligned with their expectations and don't know why.

I don't think custodian has such checks yet, but would it be possible to implement it?

Yaml logs

How about adding an option to write logs to custodian.yaml instead of custodian.json? More readable IMO and fits more logs onto one screen. Yaml is slower to write presumably but won't be the bottleneck.

Monitor capability of handlers for VaspJob is broken

Jobs experiencing an error (e.g. aliasing) will complete before custodian corrects them (and then restarts them). Seems to be caused by p.communicate() in VaspJob.run(). Is there any reason this is necessary? I'm not sure what bug it was supposedly fixing

why isn't "copy_magmom" defaulted to True in double_relaxation_run()?

Custodian contains a built-in function to generate double relaxation vasp jobs. Is there a reason to not default the copy_magmom to True in this case?

Similar for full_opt_run (although that is not used by Materials Project)

Failed to terminate an outdated calculation but create a new one at same time (within a single firework).

System

Custodian version: 0.8.8
Python version: 2.7.11
OS version: Red Hat Enterprise Linux Server release 6.3 (Santiago)
MPenv version: sjtu branch
pymatgen version: 3.5.0 (develop model)
Fireworks version: 1.2.7 (develop model)
MPWorks version: 0.1dev0.1
VASP version: 5.4.1.05Feb16 (also tried 5.3.3)

Summary

vasp calculation with custodian failed to terminate an calculation outdated job when the now job has already created.
to create 5 jobs at most belonging to a single firework.

After detecting error, the job shall stop and incar shall be modified and new calculation shall be resubmitted.But right now old calculations were not terminated though new calculation with modified incar submitted, which leads to overloading computer nodes.

I have no idea, but I want know which is the relevant code of "terminate_func"?

Example code and Error message is showed in #25

VASP called by custodian is slower than directly called

System

Custodian version: 2020.4.27
Python version: 3.6
OS version: RedHat 8.1

Summary

The univerisity have a migration from Ubuntu or Linux to RedHat, and I met a problem that VASP is slower if I call it from custodian, compared to call it directly by "mpirun -np 16 vasp_std". Is there any possible reasons for this kind of weird behaviour?

Example code

from custodian.custodian import Custodian
from custodian.vasp.handlers import VaspErrorHandler, \
    UnconvergedErrorHandler
from custodian.vasp.jobs import VaspJob
import os

vasp = 'vasp_std'
node = os.environ['NCPUS']
vasp_cmd = ['mpirun',"-np", str(node), vasp] # 
handlers = [VaspErrorHandler()]
jobs = VaspJob(vasp_cmd, auto_npar=False, auto_gamma=False )
c = Custodian(handlers, [jobs], max_errors=10)
c.run()

With the same input files, OUTCARs after one hour running:

Custodian - Only the first iteration finished

First call to EWALD:  gamma=   0.147
 Maximum number of real-space cells 5x 5x 1
 Maximum number of reciprocal cells 2x 2x 7

    FEWALD:  cpu time    0.2542: real time    0.2548


--------------------------------------- Iteration      1(   1)  ---------------------------------------


    POTLOK:  cpu time    0.2242: real time    0.2320
    SETDIJ:  cpu time    0.2767: real time    0.2774

VASP called directly- Iteration 28 finished


  energy without entropy =     -126.53388816  energy(sigma->0) =     -126.64889039


--------------------------------------------------------------------------------------------------------




--------------------------------------- Iteration      1(  28)  ---------------------------------------


    POTLOK:  cpu time    0.2173: real time    0.2239
    SETDIJ:  cpu time    0.0101: real time    0.0102

Any suggestions? Thanks.

installation error

unable to install it on nersc

Got error like this.

$ python setup.py develop
Traceback (most recent call last):
File "setup.py", line 11, in
long_desc = f.read()
File "/global/u1/m/mliu/ml_web/virtenv_ml_web/lib/python2.7/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3323: ordinal not in range(128)

Include CONTCAR in VASP_BACKUP_FILES

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

Summary

The list of VASP_BACKUP_FILES doesn't include the CONTCAR
Some of the custodian error corrections claim they are trying to compensate for bad ionic step updates (I am investigating the PotimHandler in particular)
It would be nice to be able to directly verify that a bad ionic update happened by looking at the what the CONTCAR looked like vs the POSCAR. Otherwise I can't tell if the error detection's claim of bad ionic update is correct or not.
More generally, the CONTCAR will help me debug where the structure started and ended in the run and doesn't take much space.

System

Custodian version: 1.0.0
Python version: 3.5.1 & 2.7
OS version: OS X 10.11.5 & Ubuntu (Circle CI)

Summary

A few deprecated methods in pymatgen NOT removed in 1.0.0 release.

Example code

>>> from custodian.vasp.jobs import VaspJob

Error message

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/zhideng/.pyenv/versions/3.5.1/Python.framework/Versions/3.5/lib/python3.5/site-packages/custodian/vasp/jobs.py", line 26, in <module>
    from pymatgen.io.smart import read_structure
ImportError: No module named 'pymatgen.io.smart'

System

Custodian version: master
Python version: 2.7

Summary

I am running custodian with half_kpts option for Si. The job fails almost immediately
The downstream error looks like this: "/global/project/projectdirs/m2439/matmethods_test/pymatgen/pymatgen/io/vasp/inputs.py\", line 1154, in from_string\n kpts = [int(i) for i in lines[3].split()]\nValueError: invalid literal for int() with base 10: '4.0'\n"
This is caused by the KPOINTS file written by custodian which looks like this

Automatic kpoint scheme
0
Gamma
4.0 4.0 4.0

Note the floating points. Note also that the original KPOINTS file was 8 8 8, no floats. So it is custodian that is adding the floating point.

Error message

See also this custodian.json fragment which shows it overriding the KPOINTS with float:

[
    {
        "corrections": [], 
        "job": {
            "settings_override": [
                {
                    "action": {
                        "_set": {
                            "comment": "Automatic kpoint scheme", 
                            "usershift": [
                                0, 
                                0, 
                                0
                            ], 
                            "labels": null, 
                            "tet_number": 0, 
                            "tet_connections": null, 
                            "@module": "pymatgen.io.vasp.inputs", 
                            "nkpoints": 0, 
                            "coord_type": null, 
                            "kpts_weights": null, 
                            "@class": "Kpoints", 
                            "tet_weight": 0, 
                            "kpoints": [
                                [
                                    4.0, 
                                    4.0, 
                                    4.0
                                ]
                            ], 
                            "generation_style": "Gamma"
                        }
                    }, 
                    "dict": "KPOINTS"
                }
            ],

Let me know if you need more info

Two tests broken by recent TravisCI changes in vasp/tests/test_jobs.py

When updating TravisCI build parameters to allow for OpenBabel-dependent tests to run, two tests in Vasp's test_jobs.py began to fail despite being seemingly unrelated to the changes. More specifically, test_setup in both VaspJobTest and VaspNEBJobTest had problems related to multiprocessing.cpu_count(). I commented the failing sections out and would appreciate if someone with more Vasp experience could fix them. Thanks!

Unfreezef and Restart option for FEFF calculation

Current FEFF nonconvergence error correction relies on the adjusting of the maximum iterations and convergence accelerator factor. An alternative option, i.e. RESTART card, could be implemented for more robust error correction. If RESTART is specified, FEFF will start the SCF calculation of the potentials from an existing "pot.bin" file, this way one can continue an earlier SCF calculation and saves SCF potential calculation time.

[VASP] [Bug] Second posmap error does not change SYMPREC

When the VASP "posmap" error occurs twice, SYMPREC should be changed to 1e-4. However, it stays 1e-6 because the error counter was never updated. A test was also not written for the second SYMPREC error handler.

Edit: Will be closed when #183 is merged.

Custodian should support validators

As per Will's suggestion. To implement scheme for checkers that do not actually correct anything, but validates something at the end of a job.

[VASP] Add handler for PSMAXN warnings and associated failures

Summary

VASP failure associated with PSMAXN warning is not recognized as an error by Custodian, triggering misleading ValidationError

Details

With certain combinations of ENCUT, LREAL, and pseudopotentials, VASP issues the warning

WARNING: PSMAXN for non-local potential too small

In some cases vasp still runs successfully, in other cases it will fail, e.g.

WARNING: PSMAXN for non-local potential too small
 LDA part: xc-table for Pade appr. of Perdew
 POSCAR, INCAR and KPOINTS ok, starting setup
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
 VASP aborting ...
 REAL_OPT: internal ERROR:         -32         -32         -32           0
...

It appears that Custodian does not recognize the above type of failure as an error. As a result, _run_job() will attempt to validate the output of the calculation (which never ran in the first place and therefore never generated output) and raise a ValidationError .

Traceback (most recent call last):
  File "/global/u2/r/rsking84/.conda/envs/cms/code/fireworks/fireworks/core/rocket.py", line 262, in run
    m_action = t.run_task(my_spec)
  File "/global/u2/r/rsking84/.conda/envs/cms/code/atomate/atomate/vasp/firetasks/run_calc.py", line 211, in run_task
    c.run()
  File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 378, in run
    self._run_job(job_n, job)
  File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 502, in _run_job
    raise ValidationError(s, True, v)
custodian.custodian.ValidationError: Validation failed: VasprunXMLValidator

The ValidationError is very difficult to troubleshoot without running vasp manually. In this situation, the contents of vasp.out are empty and std_err.txt contains only

srun: fatal: Can not execute vasp_std

System

Custodian version: 0.8.8
Python version: 2.7.11
OS version: Red Hat Enterprise Linux Server release 6.3 (Santiago)
MPenv version: sjtu branch
pymatgen version: 3.5.0 (develop model)
Fireworks version: 1.2.7 (develop model)
MPWorks version: 0.1dev0.1
VASP version: 5.4.1.05Feb16 (also tried 5.3.3)

Summary

vasp calculation with custodian failed to terminate an outdated job when the now job has already created.
to create 5 jobs at most belonging to a single firework.

Example code

from mpworks.submission.submission_mongo import SubmissionMongoAdapter
from pymatgen.matproj.snl import StructureNL
from pymatgen.core.structure import Structure
from pymatgen.transformations.standard_transformations import SubstitutionTransformation

# set submission db
sma         = SubmissionMongoAdapter.auto_load()
# read structure from POSCAR file.
bs_struc    = Structure.from_file("POSCAR")
# some substitution, then bs_struc→re_struc
snl         = StructureNL(re_struc, 'KeLiu <[email protected]>')
sma.submit_snl(snl, '[email protected]', parameters=None)

then go_submissions and qlaunch

qlaunch -r rapidfire --nlaunches infinite -m 4 --sleep 100 -b 10000

Error message

fireworks: Cu1_La1_Te2--GGA_optimize_structure_(2x)
Cu1_La1_Te2--GGA_opt-11690.error

INFO:custodian.custodian:Run started at 2016-04-20 14:24:05.227027 in /lustre/home/umjzhh-1/launcher/layered_material/ycute2/substitution_01stRun/block_2016-04-18-11-18-03-912778/launcher_2016-04-20-01-37-05-453043.
INFO:custodian.custodian:Custodian running on Python version 2.7.11 |Continuum Analytics, Inc.| (default, Dec  6 2015, 18:08:32)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'too_few_bands'], u'actions': [{u'action': {u'_set': {u'NBANDS': 28}}, u'dict': u'INCAR'}]}
INFO:root:Backing up run to error.2.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.PositiveEnergyErrorHandler object at 0x2af3687fcf90>, u'errors': [u'Positive energy'], u'actions': [{u'action': {u'_set': {u'ALGO': u'Normal'}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 2. Errors thus far = 2.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.3.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'eddrmm'], u'actions': [{u'action': {u'_set': {u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 3. Errors thus far = 3.
INFO:root:Running mpirun -n 32 vasp
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 23
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 14
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 21
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 12
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 22
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 16
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 18
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 29
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 20
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 25
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 31
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 24
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.4.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'zpotrf', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISYM': 0, u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}, {u'action': {u'_set': {u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 4. Errors thus far = 4.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.5.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 5. Errors thus far = 5.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.6.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Max errors reached.
ERROR:custodian.custodian:MaxErrors
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-04-20 15:47:38.863389.
INFO:custodian.custodian:Run completed. Total time taken = 1:23:33.636362.
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/rocket.py", line 213, in run
    m_action = t.run_task(my_spec)
  File "/lustre/home/umjzhh-1/kl_me2/codes/MPWorks/mpworks/firetasks/custodian_task.py", line 115, in run_task
    custodian_out = c.run()
  File "/lustre/home/umjzhh-1/kl_me2/codes/custodian/custodian/custodian.py", line 221, in run
    .format(self.total_errors, ex))
RuntimeError: 6 errors reached: (CustodianError(...), u'MaxErrors'). Exited...
INFO:rocket.launcher:Rocket finished

the relevant SLURM status:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
11690        Cu1_La1_T+        cpu acct-umjz+         32  COMPLETED      0:0
11690.batch       batch            acct-umjz+         32  COMPLETED      0:0
11690.0       pmi_proxy            acct-umjz+          2  COMPLETED      0:0
11690.1       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.2       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.3       pmi_proxy            acct-umjz+          2     FAILED      7:0
11690.4       pmi_proxy            acct-umjz+          2     FAILED      7:0

Files

Check POSCAR under input folder

The VaspCustodianTask function from fireworks_vasp.tasks seems to be checking if there is a poscar under the folder or not.

This becomes a problem when I ran it for the NEB input sets, where all poscars are written in 00~0x sub folders. And there is no poscar under the input file-folder.

VaspCustodianTask(vasp_cmd=['aprun', '-n', '48', '/global/homes/r/rongzq/bin/vasp_5.3.3/vasp_hopper/vasp', '>&', 'stdout'], handlers='all', custodian_params=custodian_params)

Potential issue with MaxForceErrorHandler

I believe that there is a problem with the way the MaxForceErrorHandler is implemented.
The correct method only acts on the EDIFFG key, leaving the EDIFF unchanged. This might be fine if this handler is called only once and if there is a large enough factor between EDIFF and EDIFFG. However, for example I ended up in a situation in which this handler was applied multiple times, causing the value of EDIFFG to go below that of EDIFF, which is clearly a bad configuration.

In general I would say that it would preferable to keep the same ratio between EDIFF and EDIFFG. i.e. EDIFF could be multiplied by 0.5 as well. If this seems unsuitable, at the very least it should be checked that EDIFFG does not go below EDIFF.

Custodian Monitor Not Working with terminate_on_nonzero_returncode=True

System

Custodian version: master branch, revision af15bda and 5f0e8b
Python version: 3.5.2
OS version: Cray XC40

Summary

This test compares two revisions in the master branch: 1) "5f0e8b9" which is after adding the return code exception raising code, 2) "af15bda" which before adding return code logic. The "run_vasp" is command is used to perform the test. No workflows messed up with the test, only a simple SLURM script and custodian package were involved.

The most important message is from "run.log" file.

With revision "5f0e8b9":

After "{'errors': ['eddrmm']...", the custodian says "Job return code is 137. Terminating...".

With revision "af15bda":

After "After "{'errors': ['eddrmm']...", the custodian says "Starting job no. 1 (VaspJob) attempt no. 2". There are other issues which are specific to my environment and are already solved, which I don't think will affect the conclusions. The bottom line is that with the old code is trying to fix the error while the new code exits prematurely.

Example code

module load vasp/5.4.1
run_vasp -c "srun -n 32 vasp_std" static

Error message

"Job return code is 137. Terminating..." vs "Starting job no. 1 (VaspJob) attempt no. 2".

Files

cus_5f0e8b9_after_rc.tar.gz

cus_af15bda_before_rc.tar.gz

Updating subspacematrix error handling with VASP

In the custodian/vasp/handlers.py script, there is an error-handling method for the "subspacematrix" error (i.e. "WARNING: Sub-Space-Matrix is not hermitian in DAV"). As it stands right now, the custodian file tries to change LREAL when this error occurs. I have found that in systems I have studied, this error often occurs when PREC is not set to Accurate. I have attached an example set of input files. It runs with PREC = Accurate but has the subspace error after 7-10 SCF iterations when PREC is not set (i.e. it's the default value of Normal). It therefore may be desirable to add a PREC switch for this error message. Note that this may be dependent on the particular VASP build. I've tested this on VASP 5.4.1 for the record.
INCAR.txt
KPOINTS.txt
POSCAR.txt
POTCAR.txt

Temporary modifications to input sets

Sometimes conjugate gradient doesn't work well for ionic relaxation in VASP. In some cases (mostly with high degrees of freedom, and hard+soft vibrational modes), I have found that the velocity quench algorithm (surprisingly) works better (even though damped MD doesn't help). The simple solution would be to switch the algo with the UnconvergedError handler. However, I think it may be be more effective to switch the algo back to the original (usually conjugate gradient) after the 1st relaxation completes.

Currently, this is difficult within the Custodian framework, as it is a state machine. My proposition is to allow error handlers to prepend actions to a list that runs during job post-processing.

Does anyone have alternative suggestions, or think that this is outside the scope of what Custodian should be able to handle?

Check returncode to raise CustodianError() leading some trouble

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

System

Custodian version: master
Python version: 2.7
OS version: CentOS release 6.8

Summary

the default terminate_on_nonzero_returncode is True.
_run_job will raise a CustodianError() if returns a nonzero return code (in L395).
something wrong gets in custodian.py L319~L323, it seems like missing total_errors, i am not sure.

Example code

terminated sub slrum job with MPWorks

Error message

$ more Eu1_S2_Sn1--GGA_opti-294848.error
INFO:custodian.custodian:Run started at 2016-12-19 21:03:28.011467 in /lustre/home/umjzhh-1/launcher/block_2016-10-12-18-29-00-023399/launcher_2016-12-19-13-03-16-629901.
INFO:custodian.custodian:Custodian running on Python version 2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40)  [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Hostname: node077, Cluster: unknown
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running srun -v -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Terminate the job step using scancel --signal=KILL 294848.0
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af91dd3d0d0>, u'errors': [u'brmix', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISTART': 1}}, u'dict': u'INCAR'}, {u'action': {u'_set':
{u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:Job return code is 137. Terminating...
ERROR:custodian.custodian:Job return code is 137. Terminating...
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-12-19 21:13:39.441558.
INFO:custodian.custodian:Run completed. Total time taken = 0:10:11.430091.
Traceback (most recent call last):
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/fireworks/fireworks/core/rocket.py", line 224, in run
    m_action = t.run_task(my_spec)
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/mpworks/mpworks/firetasks/custodian_task.py", line 136, in run_task
    custodian_out = c.run()
  File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/custodian/custodian/custodian.py", line 323, in run
    .format(self.total_errors, ex))
RuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 137. Terminating...'). Exited...
INFO:rocket.launcher:Rocket finished

Files

<contents of file 1>

Add error handlers to check for chgcar, aeccar integrity

See atomate PR hackingmaterials/atomate#276

Needs further investigation; @jmmshn could you attach some sample bad outputs? Suspect it may be an mpi-related issue where lines are getting transposed.

ALGO=All method for correcting unconverged electronic loops

System

Custodian version: master

Summary

If a job has not reached electronic converged (UnconvergedErrorHandler), the fix procedure in custodian is:

                new_settings = {"ISTART": 1,
                                "ALGO": "Normal",
                                "NELMDL": -6,
                                "BMIX": 0.001,
                                "AMIX_MAG": 0.8,
                                "BMIX_MAG": 0.001}

Recently, I've had pretty good luck with just setting ALGO=All when a job is unconverged rather than making all these changes (This is in fact how UnconvergedHandler already deals with SCAN failures, but I think it's more general than that). If it were just up to my qualitative assessment, I would first try ALGO=All; if that still doesn't work, to try making all the various changes above.

Downside: I don't have any hard facts to support my case.

Perhaps @shyamd or @montoyjh or @mkhorton or @tschaume can support or deny.

If we do want to switch to trying ALGO=All first, I'd be happy to do the implementation. So it's just a matter of deciding.

ERROR: no process found

Dear All,

Im using custodian, with python 3.6.8 in a a linux machine. The queue system is PBS.

When I launch the PBS script, instead of running the VASP binary, I run a python script instead that initializes a Custodian object and launches the job with 'vasp_cmd' being the command that I usually use to directly run VASP in a PBS script.

At first it runs, but then I get the following in the error log.

ERROR:custodian.custodian:
{ 'actions': [ { 'action': { '_set': { 'POTIM': 0.30000000000000004}},
'dict': 'INCAR'}],
'errors': [ 'brions'],
'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2acf80726b00>}
/software/bin/vasp: no process found

The file error1.tar.gz contains an empty std_err file. The OSZICAR file contains a completely normal behavior but its cut in the middle of the 3rd ionic step.

I dont know why Custodian runs well at first, and then decides to stop the ionic relaxation in the middle of a scf iteration, and when rerunning does not find the binary.

Does anybody know whats going on?

Best regards,

Oier.

CIC Energigune

VASP fix: ZHEGV error occurs when small # of atoms and too many cores

Currently, when the ZHEGV error appears in VASP, Custodian switches ALGO to All. This is the right approach, but for small systems (e.g. elementals with only 1 or 2 atoms/cell), oftentimes ZHEGV appears because too many cores are requested. With supercomputers having more and more cores per node these days, this comes up more frequently. The solution is just to decrease the number of cores for the job, but rather than trying to play around with that, Custodian could issue a warning to the user if len(structure) is below some cutoff. It would require some testing to figure out what a good value for this might be. My gut feeling is perhaps 5 atoms/cell.

Intent to add remote logging capabilities

We've having a lot of different users have VASP jobs fail for various reasons, and we'd like the capability to be able to aggregate Custodian errors so that we can more easily tell if there are errors that are occurring more often than would be expected (that is, that they reflect an underlying weakness in our input sets or error handlers, rather than user error).

I'd like to propose we integrate Sentry support via their Python client.

By default, this will result in no change to Custodian by default, except an additional optional dependency. However, if you set a SENTRY_DSN environment variable, we can set it up to log and aggregate errors automatically.

If there are no objections to this, I will go ahead and implement.

I think a similar addition to log stack traces in FireWorks when jobs fizzle might also be useful.

docstring of "StoppedRunHandler" looks incorrect

The docstring of "StoppedRunHandler" looks to be copied from a version of the CheckpointHandler docstring. I think it needs to be updated to pertain to StoppedRunHandler. I would do it myself, but I never used these handlers so can't comment intelligently on what they do.

If VASP does not even start running custodian doesn't raise error

When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.

System

Custodian version: 1.0.3 (master)
Python version: 2.7.12

Summary

Custodian doesn't raise error if VASP doesn't even run for any reason.
** example of importance: when files like OUTCAR are copied from the previous step, even if this step doesn't run, from OUTCAR and vasprun.xml, it looks like the job is already finished!
The problem might be in Custodian._run_job() method: after "p = job.run()" custodian just assumes that VASP started running while it might not even start due to an error like this.

Error message

(this example is on NERSC in std_err.txt):

srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: Unable to confirm allocation for job 3178405: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.

but this may happen due to various reasons regardless of what this error message says.

System

Custodian version: latest master branch as of 2020-11-01
Python 3.8
VASP 6.1.1

Summary

VaspErrorHandler fails to detect certain error categories in VASP 6 that were detected properly in VASP 5. This appears to be a result of formatting changes associated with the error messages. My understanding is that custodian reads stdout line-by-line, but some of the error messages have been reformatted in such a way that they now break across multiple lines and hence are not caught.

See example below for the point_group error. I have also observed this for inv_rot_mat message, and I suspect it could affect other handlers as well.

Error message

VASP6:

 -----------------------------------------------------------------------------
|                                                                             |
|     EEEEEEE  RRRRRR   RRRRRR   OOOOOOO  RRRRRR      ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     E        R     R  R     R  O     O  R     R     ###     ###     ###     |
|     EEEEE    RRRRRR   RRRRRR   O     O  RRRRRR       #       #       #      |
|     E        R   R    R   R    O     O  R   R                               |
|     E        R    R   R    R   O     O  R    R      ###     ###     ###     |
|     EEEEEEE  R     R  R     R  OOOOOOO  R     R     ###     ###     ###     |
|                                                                             |
|     VERY BAD NEWS! internal error in subroutine IBZKPT: Error: point        |
|     group operation missing 2                                               |
|                                                                             |
|       ---->  I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <----       |
|                                                                             |
 -----------------------------------------------------------------------------

VASP5:

 VERY BAD NEWS! internal error in subroutine IBZKPT:
 Error: point group operation missing       2

The point_group error looks for the entire message on a single line:

"point_group": ["Error: point group operation missing"],

System

Custodian version: 2019.8.24
Pymatgen version: 2019.9.16
Python version: 3.7
OS version: RHEL7

Summary

Custodian job for VASP fails with ImportError when trying to import VaspInput from pymatgen (2019.9.16)

Error message

Traceback (most recent call last):
  File "/home/users/stamma58/.conda/envs/custodian/bin/cstdn", line 10, in <module>
    sys.exit(main())
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 113, in main
    args.func(args)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 82, in run
    c = Custodian.from_spec(d)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 322, in from_spec
    cls_ = load_class(d["jb"])
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 302, in load_class
    mod = __import__(modname, globals(), locals(), [classname], 0)
  File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/vasp/jobs.py", line 13, in <module>
    from pymatgen.io.vasp import VaspInput, Incar, Poscar, Outcar, Kpoints, Vasprun
ImportError: cannot import name 'VaspInput' from 'pymatgen.io.vasp' (/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/pymatgen/io/vasp/__init__.py)

System

Custodian version: https://github.com/nwinner/custodian branch cp2k. Issue does not relate to this branch, but I am technically using handlers that exist in my fork, rather than already exist in the main repo.
Python version: 3.7

Summary

Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check

It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error, then instead of running a final check with all handlers, only ones with is_monitor=False will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error.

Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.

Example code

This code snippet is taken from custodian.py:448

            # While the job is running, we use the handlers that are
            # monitors to monitor the job.
            if isinstance(p, subprocess.Popen):
                if self.monitors:
                    n = 0
                    while True:
                        n += 1
                        time.sleep(self.polling_time_step)
                        if p.poll() is not None:
                            break
                        terminate = self.terminate_func or p.terminate
                        if n % self.monitor_freq == 0:
                            has_error = self._do_check(self.monitors,
                                                       terminate)
                        if terminate is not None and terminate != p.terminate:
                            time.sleep(self.polling_time_step)
                else:
                    p.wait()
                    if self.terminate_func is not None and \
                            self.terminate_func != p.terminate:
                        self.terminate_func()
                        time.sleep(self.polling_time_step)

                zero_return_code = p.returncode == 0

            logger.info("{}.run has completed. "
                        "Checking remaining handlers".format(job.name))
            # Check for errors again, since in some cases non-monitor
            # handlers fix the problems detected by monitors
            # if an error has been found, not all handlers need to run
            if has_error:
                self._do_check([h for h in self.handlers
                                if not h.is_monitor])
            else:
                has_error = self._do_check(self.handlers)

non-ideal error message for max_errors hit

System

master branch (including latest fix to returncode), Py27, Linux

Summary

In my specific run, the same error comes up twice in a row (EDDRM)
The second time it happens, custodian says that it hits max_errors (I think this part is OK) and raises a custodian error to exit.
What happens next is that this results in a non-zero return code, which intercepts that message and then uses it raise a "nonzero returncode error".
Thus the final output message of nonzero return code is less helpful to debugging runs than the original text of max errors hit.

Error message

The stack trace I get back is:

Traceback (most recent call last):\n  File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n    m_action = t.run_task(my_spec)\n  File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n    c.run()\n  File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n    .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n

You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.

Files

The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609

System

Custodian version: master

Summary

If I have a 2-step structure relaxation, and the first relaxation hits MAX_ERRORs, the 2nd job still seems to perform some of the maintenance steps of "copying files to relax1 extension".
This makes it more difficult to realize that the 2nd job never actually ran. e.g., the directory contains both "OUTCAR.relax1" and "OUTCAR". Usually, that means that OUTCAR is for the 2nd job and OUTCAR.relax1 is for the first job, and the second job failed. In this case, the two files are identical since the first job hit max errors.
see uploaded file error.tar.gz (CHGCAR/WAVECAR/PROCAR removed) for an example where the first relaxation failed due to max errors

Files

error.tar.gz

Add convergence handler for meta-GGAs to use LMIXTAU

Try LMIXTAU=True for convergence issues when meta-GGAs are used. Tagging @shyamd in connection to our SCAN work. Further investigation required, but maybe even worth setting by default in the SCAN set?

https://cms.mpi.univie.ac.at/wiki/index.php/LMIXTAU

file_delete of CHGCAR/WAVECAR and nscf runs

System

Most recent custodian run on NERSC

Summary

Many VASP custodian handlers have a _file_delete action of CHGCAR and WAVECAR as part of their operation as part of the VASP "recommendation"
However, non-SCF runs (where the CHGCAR is held fixed) requires these files to be present. If they are deleted and VASP is restarted, the job will immediately fail.

Error message

 reading WAVECAR
 WARNING: chargedensity file is incomplete
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10
 ERROR: charge density could not be read from file CHGCAR for ICHARG>10

Files

An example run directory of this error will be available for the next couple of weeks if needed. Let me know.

ISTART in run_vasp

What's the use case for setting this in run_vasp? It causes NELMDL issues if LWAVE=False. I think the vasp default would be better.

Can't use VaspErrorHandler in IBRION=-1 or ISIF=5,6,7 runs

"rot_matrix" and "pricel" errors currently use a PerturbStructureTransformation (which is incompatible with these modes).
In my experience, "rot_matrix" is usually caused by difficulty with finding symmetry in even monkhorst grids, and is fixed by using an odd grid (though for some reason ISYM=0 doesn't fix it).

I think "pricel" should be fixed by setting ISYM=0.

Is there any objection to me making these changes?

[VASP] NonConvergingErrorHandler for meta-GGAs/hybrids

Currently, the SCF algorithm ladder in NonConvergingErrorHandler is tuned for GGAs and is not appropriate for meta-GGAs or hybrids. For instance, meta-GGAs should rarely be used with Algo = VeryFast or Fast (Algo = All is generally recommended and is the default in MPScanRelaxSet). For hybrids, Algo = Fast or VeryFast should never be used according to the VASP manual, as hybrids don't support these algorithms even though no warning is printed: https://www.vasp.at/wiki/index.php/LHFCALC.

Will be closed when #179 is merged.

Terminate Function Seems Not Working Properly With SLURM

This issue only matters when using Vasp Monitors and only when running jobs in a SLURM environmental. It seems to originate from the Popen.terminate() python API. If a job terminates normally, all the function works properly. However, once a VASP job is killed by Popen.terminate(), which would be true if the Monitor finds an error, new VASP job will never be able to run. The error message is "Unable to create job step: Job/step already completing or completed". The power of custodian is fully prohibited in this situation.

How do you know whether you are affected? If the success rate of recent calculation degrades, probably you are already affected.

I reproduced the error using a simple python script without custodian and asked NERSC staff to fix it. They confirmed the issue and promised to investigate it. However, it has already been two weeks I didn't get any update. As a precaution, I think we should consider the possibility they are not able fix it in short term.

I have an alternative to walk-around this issue in my fork. The point is to use "scancel" to kill VASP job. This method releases the resource successfully while Popen.terminate() doesn't. It seems to be working fine for the last two weeks. However, it makes the custodian code more complicated and environment dependent. I am not sure whether it is a good idea to merge this piece of code to the main repository.


	error_msgs = {
	"tet": [
	"Tetrahedron method fails",
	"Fatal error detecting k-mesh",
	"Fatal error: unable to match k-point",
	"Routine TETIRR needs special values",
	"Tetrahedron method fails (number of k-points < 4)",
	"BZINTS",
	],

	with open(self.output_filename, "r") as f:
	for line in f:
	l = line.strip()
	for err, msgs in VaspErrorHandler.error_msgs.items():
	if err in self.errors_subset_to_catch:
	for msg in msgs:
	if l.find(msg) != -1:
	# this checks if we want to run a charged
	# computation (e.g., defects) if yes we don't
	# want to kill it because there is a change in
	# e-density (brmix error)
	if err == "brmix" and "NELECT" in incar:
	continue
	self.errors.add(err)
	error_msgs.add(msg)
	for msg in error_msgs:

	if self.errors.intersection(["tet", "dentet"]):
	if vi["INCAR"].get("KSPACING"):
	# decrease KSPACING by 20% in each direction (approximately double no. of kpoints)
	actions.append(
	{
	"dict": "INCAR",
	"action": {"_set": {"KSPACING": vi["INCAR"].get("KSPACING") * 0.8}},
	}
	)

materialsproject / custodian Goto Github PK

custodian's People

Contributors

Stargazers

Watchers

Forkers

custodian's Issues

System

Summary

Example code and Error message is showed in #25

System

Summary

Example code

Summary

Suggested solution (if known)

Number of errors

More informative exceptions

System

Summary

Example code

Error message

Suggested solution (if known)

System

Summary

Error message

Summary

Details

Suggested solution

System

Summary

Example code

Error message

Suggested solution (if known)

Files

System

Summary

Example code

Error message

Suggested solution (if known)

Files

System

Summary

Example code

Error message

Suggested solution (if known)

Files

System

Summary

System

Summary

Error message

Suggested solution (if known)

System

Summary

Error message

Suggested solution (if known)

System

Summary

Error message

Suggested solution (if known)

System

Summary

Example code

System

Summary

Error message

Files

Suggested solution (if known)

System

Summary

Files

System

Summary

Error message

Suggested solution (if known)

Files

Recommend Projects

Recommend Topics

Recommend Org