materialsproject / custodian Goto Github PK
View Code? Open in Web Editor NEWA simple, robust and flexible just-in-time job management framework in Python.
License: MIT License
A simple, robust and flexible just-in-time job management framework in Python.
License: MIT License
The current fix for the EDWAV error is to change the smearing scheme to ISMEAR = 0, which is not a very reliable fix. The origin of this error appears to be a bug in VASP when it is compiled with the default -O2 level of optimization in some versions of the intel compiler. Changing the optimization level to -O1 appears to reliably fix the issue.
I suggest adding a warning or optional switch to an O1-version of VASP to fix the bug, instead of changing the smearing. For what its worth, EDWAV errors only show up when running ALGO = A or ALGO = D, which can't be run with the common ISMEAR = -5 scheme anyways.
The key of "filename" in setting dict should change to "file", otherwise will raise ValueError("Unrecoginized format...").
if job_number < 2 and not converged:
settings = [
{"dict": "INCAR",
"action": {"_set": {"ISTART": 1}}},
{"filename": "CONTCAR",
"action": {"_file_copy": {"dest": "POSCAR"}}}]
#switch to RMM-DIIS once we are near the
#local minimum (assumed after 2 runs of CG)
else:
settings = [
{"dict": "INCAR",
"action": {"_set": {"ISTART": 1, "IBRION": 1}}},
{"filename": "CONTCAR",
"action": {"_file_copy": {"dest": "POSCAR"}}}]
Above code should change to:
if job_number < 2 and not converged:
settings = [
{"dict": "INCAR",
"action": {"_set": {"ISTART": 1}}},
{"file": "CONTCAR",
"action": {"_file_copy": {"dest": "POSCAR"}}}]
#switch to RMM-DIIS once we are near the
#local minimum (assumed after 2 runs of CG)
else:
settings = [
{"dict": "INCAR",
"action": {"_set": {"ISTART": 1, "IBRION": 1}}},
{"file": "CONTCAR",
"action": {"_file_copy": {"dest": "POSCAR"}}}]
Currently, the real_optlay
INCAR swap implicitly assumes the user has set LREAL = Auto, but there is no guarantee this is the case. We wouldn't want LREAL = False to be switched by accident. Also, we probably shouldn't be switching to LREAL = True for large systems regardless -- that goes against the recommendations in the VASP manual.
Will be closed once #182 is merged.
For me, VaspErrorHandler
doesn't trigger on IBZKPT: tetrahedron method fails for NKPT<4. NKPT = 3
even though it's supposed to handle this:
custodian/custodian/vasp/handlers.py
Lines 63 to 72 in d3cb039
custodian/custodian/vasp/handlers.py
Lines 151 to 166 in d3cb039
custodian/custodian/vasp/handlers.py
Lines 178 to 186 in d3cb039
Could the reason be the different cases?
Tetrahedron method fails
tetrahedron method fails
VASP by default does not have a way of checking if the user entered in the name of an INCAR parameter correctly and therefore will still run even if there are nonsensical parameters in the file. e.g. If one were to set a value for "NBAND" instead of "NBANDS" in the INCAR, VASP will ignore any the former. Another example is if someone sets METAGGA=Scan in ver. 5.4.1 even though this metagga parameter does not exist in this version, nothing will happen. This can make troubleshooting confusing sometimes if a user gets results that are not aligned with their expectations and don't know why.
I don't think custodian has such checks yet, but would it be possible to implement it?
How about adding an option to write logs to custodian.yaml
instead of custodian.json
? More readable IMO and fits more logs onto one screen. Yaml is slower to write presumably but won't be the bottleneck.
Jobs experiencing an error (e.g. aliasing) will complete before custodian corrects them (and then restarts them). Seems to be caused by p.communicate() in VaspJob.run(). Is there any reason this is necessary? I'm not sure what bug it was supposedly fixing
Custodian contains a built-in function to generate double relaxation vasp jobs. Is there a reason to not default the copy_magmom to True in this case?
Similar for full_opt_run (although that is not used by Materials Project)
After detecting error, the job shall stop and incar shall be modified and new calculation shall be resubmitted.But right now old calculations were not terminated though new calculation with modified incar submitted, which leads to overloading computer nodes.
I have no idea, but I want know which is the relevant code of "terminate_func"?
The univerisity have a migration from Ubuntu or Linux to RedHat, and I met a problem that VASP is slower if I call it from custodian, compared to call it directly by "mpirun -np 16 vasp_std". Is there any possible reasons for this kind of weird behaviour?
from custodian.custodian import Custodian
from custodian.vasp.handlers import VaspErrorHandler, \
UnconvergedErrorHandler
from custodian.vasp.jobs import VaspJob
import os
vasp = 'vasp_std'
node = os.environ['NCPUS']
vasp_cmd = ['mpirun',"-np", str(node), vasp] #
handlers = [VaspErrorHandler()]
jobs = VaspJob(vasp_cmd, auto_npar=False, auto_gamma=False )
c = Custodian(handlers, [jobs], max_errors=10)
c.run()
With the same input files, OUTCARs after one hour running:
First call to EWALD: gamma= 0.147
Maximum number of real-space cells 5x 5x 1
Maximum number of reciprocal cells 2x 2x 7
FEWALD: cpu time 0.2542: real time 0.2548
--------------------------------------- Iteration 1( 1) ---------------------------------------
POTLOK: cpu time 0.2242: real time 0.2320
SETDIJ: cpu time 0.2767: real time 0.2774
energy without entropy = -126.53388816 energy(sigma->0) = -126.64889039
--------------------------------------------------------------------------------------------------------
--------------------------------------- Iteration 1( 28) ---------------------------------------
POTLOK: cpu time 0.2173: real time 0.2239
SETDIJ: cpu time 0.0101: real time 0.0102
Any suggestions? Thanks.
unable to install it on nersc
Got error like this.
$ python setup.py develop
Traceback (most recent call last):
File "setup.py", line 11, in
long_desc = f.read()
File "/global/u1/m/mliu/ml_web/virtenv_ml_web/lib/python2.7/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 3323: ordinal not in range(128)
When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.
VASP_BACKUP_FILES = {"CUSTODIAN", "INCAR", "KPOINTS", "POSCAR", "OUTCAR", "OSZICAR",
"vasprun.xml", "vasp.out", "std_err.txt"}
Should JSONSerializable be moved to monty? It looks more general than custodian and seems to fit in with "MSONable"
Dear all, I opened this issue not to signal a problem, but rather to propose a couple of improvements to the current custodian implementation. These ideas emerged since we would like to use custodian as a base for error handling in a new project and we think that these functionalities would help us in our use cases.
The first modification is about the introduction of a finer control on the number errors before stopping the job. In particular I think that it would be nice to have the possibility to decide the maximum number of times a correction from a single error handler should be applied.
I can imagine a series of cases where this could be useful:
CheckpointHandler
suggests that this handler should be used alone and with a high value for max_errors
. However I suppose that the limitation of being the single error handler could be lifted if it is possible to control the maximum number of tolerated corrections for each of the other handlers.Of course something like this could still be implemented at the level of each single error handler, but this will likely lead to the writing of a lot of redundant code, while it seems a sufficiently general feature to be present in the core implementation.
I have prepared a rough implementation of this modification here: gpetretto@3ff38fc
Such a modification should have no impact on existing code, as by default everything should still keep working as before. The only possible problem could be if someone has implemented some hanlders outside the official custodian implementation with names clashing with the new ones introduced here, but this seems a relatively remote possibility.
Here is a snippet of code demonstrating possible use cases: https://gist.github.com/gpetretto/b14a1addf53a8e99893fca24228e1d4e
I suppose that more often than not Custodian runs inside some other python code (e.g. fireworks), but, as far as I can see, what happens inside custodian is not really transparent from the outer layers. To give a more specific example, we would like to take advantage of the exact final state of the error handler/validator that caused the program to stop, to take further actions. This may involve a dynamical evolution of the workflow, like adding an intermediate step calculation before rerunning the current job.
I would like to stress that here I am not suggesting to introduce some kind of integration with other python packages (being it fireworks or anything else), but just to let out more information when errors happen for whoever may be interest in having it.
It seems that currently some information could be extracted from the Custodian
instance (from the log_run
, total_errors
and errors_current_job
), but this implies a bit of investigation, and maybe of guessing about what really happened inside custodian (e.g. is there any way one could know which validator made the calculation fail, aside from parsing the text of the error message?).
Here is one possible modification of the current behavior that would already allow to directly receive some useful information
gpetretto@9597595
https://gist.github.com/gpetretto/13f8f00994d62844cadeae812559bfb9
This takes advantage of the validator
attribute in CustodianError
, that is set, but it seems to be never used. Also, since the new error class CustodianRuntimeError
subclasses RuntimeError
, this should preserve almost entirely the current behavior, with the only difference being the name of the exception in the message. In addition it will provide additional information about what happened inside custodian.
However, if a bit more freedom could be allowed, I would prefer introducing a more detailed hierarchy of CustodianError
errors and let these exception directly emerge from Custodian, so that one could be able to catch and analyze more in details each case differently. For example something like this:
RuntimeError
+-- CustodianError
+-- ValidationError
+-- HandlerError
+-- ReturnCodeError
+-- MaxErrorsError
+-- MaxErrorsPerJobError
Notice that I suggest to keep it a subclass of RuntimeError
to ensure that any existing code that relies on catching RuntimeError
will keep working.
Here is a possible implementation:
gpetretto@d2e0105
https://gist.github.com/gpetretto/45506abee33f68b103639399b2c77064
I would be interested in hearing comments and knowing if these kind of modifications could be of interest for you. If that's the case I can clean up the implementations and open pull requests.
Thanks
A few deprecated methods in pymatgen NOT removed in 1.0.0 release.
>>> from custodian.vasp.jobs import VaspJob
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/zhideng/.pyenv/versions/3.5.1/Python.framework/Versions/3.5/lib/python3.5/site-packages/custodian/vasp/jobs.py", line 26, in <module>
from pymatgen.io.smart import read_structure
ImportError: No module named 'pymatgen.io.smart'
Release a new stable version.
"/global/project/projectdirs/m2439/matmethods_test/pymatgen/pymatgen/io/vasp/inputs.py\", line 1154, in from_string\n kpts = [int(i) for i in lines[3].split()]\nValueError: invalid literal for int() with base 10: '4.0'\n"
Automatic kpoint scheme
0
Gamma
4.0 4.0 4.0
Note the floating points. Note also that the original KPOINTS file was 8 8 8, no floats. So it is custodian that is adding the floating point.
See also this custodian.json fragment which shows it overriding the KPOINTS with float:
[
{
"corrections": [],
"job": {
"settings_override": [
{
"action": {
"_set": {
"comment": "Automatic kpoint scheme",
"usershift": [
0,
0,
0
],
"labels": null,
"tet_number": 0,
"tet_connections": null,
"@module": "pymatgen.io.vasp.inputs",
"nkpoints": 0,
"coord_type": null,
"kpts_weights": null,
"@class": "Kpoints",
"tet_weight": 0,
"kpoints": [
[
4.0,
4.0,
4.0
]
],
"generation_style": "Gamma"
}
},
"dict": "KPOINTS"
}
],
Let me know if you need more info
When updating TravisCI build parameters to allow for OpenBabel-dependent tests to run, two tests in Vasp's test_jobs.py began to fail despite being seemingly unrelated to the changes. More specifically, test_setup in both VaspJobTest and VaspNEBJobTest had problems related to multiprocessing.cpu_count(). I commented the failing sections out and would appreciate if someone with more Vasp experience could fix them. Thanks!
Current FEFF nonconvergence error correction relies on the adjusting of the maximum iterations and convergence accelerator factor. An alternative option, i.e. RESTART card, could be implemented for more robust error correction. If RESTART is specified, FEFF will start the SCF calculation of the potentials from an existing "pot.bin" file, this way one can continue an earlier SCF calculation and saves SCF potential calculation time.
When the VASP "posmap"
error occurs twice, SYMPREC should be changed to 1e-4. However, it stays 1e-6 because the error counter was never updated. A test was also not written for the second SYMPREC error handler.
Edit: Will be closed when #183 is merged.
As per Will's suggestion. To implement scheme for checkers that do not actually correct anything, but validates something at the end of a job.
With certain combinations of ENCUT, LREAL, and pseudopotentials, VASP issues the warning
WARNING: PSMAXN for non-local potential too small
In some cases vasp still runs successfully, in other cases it will fail, e.g.
WARNING: PSMAXN for non-local potential too small
LDA part: xc-table for Pade appr. of Perdew
POSCAR, INCAR and KPOINTS ok, starting setup
REAL_OPT: internal ERROR: -32 -32 -32 0
VASP aborting ...
REAL_OPT: internal ERROR: -32 -32 -32 0
VASP aborting ...
REAL_OPT: internal ERROR: -32 -32 -32 0
VASP aborting ...
REAL_OPT: internal ERROR: -32 -32 -32 0
...
It appears that Custodian does not recognize the above type of failure as an error. As a result, _run_job()
will attempt to validate the output of the calculation (which never ran in the first place and therefore never generated output) and raise a ValidationError
.
Traceback (most recent call last):
File "/global/u2/r/rsking84/.conda/envs/cms/code/fireworks/fireworks/core/rocket.py", line 262, in run
m_action = t.run_task(my_spec)
File "/global/u2/r/rsking84/.conda/envs/cms/code/atomate/atomate/vasp/firetasks/run_calc.py", line 211, in run_task
c.run()
File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 378, in run
self._run_job(job_n, job)
File "/global/u2/r/rsking84/.conda/envs/cms/code/custodian/custodian/custodian.py", line 502, in _run_job
raise ValidationError(s, True, v)
custodian.custodian.ValidationError: Validation failed: VasprunXMLValidator
The ValidationError is very difficult to troubleshoot without running vasp manually. In this situation, the contents of vasp.out
are empty and std_err.txt
contains only
srun: fatal: Can not execute vasp_std
Information about the PSMAXN warning and associated failures is scarce, but there appear to be several possible fixes:
I have had the most success with Option 1.
Option 2 has not solved the issue for me and is only applicable if the user does not specify ENCUT in the INCAR file (I think; see docs ). It is also not clear whether this fix is still relevant to the latest versions of VASP.
I'm not yet familiar enough with the architecture of Custodian to know the best way to address this, but it seems to me that, at a minimum, an error handler to catch this type of failure would be valuable. Even better would be to modify LREAL to FALSE on the fly.
Further reading on troubleshooting the VASP PSMAXN warnings:
https://cms.mpi.univie.ac.at/vasp-forum/viewtopic.php?f=3&t=8370
the reason most probably is that you join 2 potentials with very different cutoff, with the POTCAR with the SMALL cutoff (U) being the first in the list. This potentials is used to determine PSMAXN.
please
1) switch the 2 atoms in POSCAR and POTCAR (ie give the atoms such that those with the hardest potentials are first
2) OR use O_s (soft O, low cutoff)
https://cms.mpi.univie.ac.at/vasp-forum/viewtopic.php?t=14811
The warning means that PSMAXN is too small for the required cutoff energy (ENMAX) the first of the atoms given in POTCAR.
Either use a harder potential or decrease ENMAX.
Solved it by setting LREAL=FALSE
https://www.researchgate.net/post/Relaxation_in_metal_using_vasp2
"PSMAXN for non-local potential too small"
Try lowering your ENCUT parameter (how large is it, and what are the defaults in your POTCAR?), this error indicates that you go out of bounds for an array related to the potential, which is related to the cutoff energy.
http://materials.duke.edu/AFLOW/README_AFLOW.TXT)
PSMAXN
PSMAXN errors. By default aflow tries to go around PSMAXN warnings by restarting VASP with reducingly
lower ENMAX until everything is set. This can be done by tuning the INCAR schemes.
The latest version is still 1.1.0 for Mac OS on matsci channel.
Hi,
I tried the example from your website but was not successful, so I looked into the unit tests and updated the example correspondingly, but still I have some issues with the serialisation.
ExampleJob:
class ExampleJob(Job):
def __init__(self, jobid, params=None):
if params is None:
params = {"initial": 0, "total": 0}
self.jobid = jobid
self.params = params
def setup(self):
self.params["initial"] = 0
self.params["total"] = 0
def run(self):
sequence = [random.uniform(0, 1) for i in range(100)]
self.params["total"] = self.params["initial"] + sum(sequence)
def postprocess(self):
pass
@property
def name(self):
return "ExampleJob{}".format(self.jobid)
Error Handler:
class ExampleHandler(ErrorHandler):
def __init__(self, params):
self.params = params
def check(self):
return self.params["total"] < 50
def correct(self):
self.params["initial"] += 1
return {"errors": "total < 50", "actions": "increment by 1"}
This works:
njobs = 100
params = {"initial": 0, "total": 0}
c = Custodian([ExampleHandler(params)],
[ExampleJob(i, params) for i in range(njobs)],
max_errors=njobs)
output = c.run()
This does not:
njobs = 100
c = Custodian([ExampleHandler({"initial": 0, "total": 0})],
[ExampleJob(i, {"initial": 0, "total": 0}) for i in range(njobs)],
max_errors=njobs)
output = c.run()
When running VASP with custodian (using fireworks), custodian errors are printed to the run directory after execution is completed. This is an issue, however, in cases when lots of errors result in a job hitting walltime or when a job gets pre-empted.
Perhaps the custodian log could be updated/written after each correction? This would give preserve the information even if the job is killed suddenly.
from mpworks.submission.submission_mongo import SubmissionMongoAdapter
from pymatgen.matproj.snl import StructureNL
from pymatgen.core.structure import Structure
from pymatgen.transformations.standard_transformations import SubstitutionTransformation
# set submission db
sma = SubmissionMongoAdapter.auto_load()
# read structure from POSCAR file.
bs_struc = Structure.from_file("POSCAR")
# some substitution, then bs_strucโre_struc
snl = StructureNL(re_struc, 'KeLiu <[email protected]>')
sma.submit_snl(snl, '[email protected]', parameters=None)
then go_submissions
and qlaunch
qlaunch -r rapidfire --nlaunches infinite -m 4 --sleep 100 -b 10000
fireworks: Cu1_La1_Te2--GGA_optimize_structure_(2x)
Cu1_La1_Te2--GGA_opt-11690.error
INFO:custodian.custodian:Run started at 2016-04-20 14:24:05.227027 in /lustre/home/umjzhh-1/launcher/layered_material/ycute2/substitution_01stRun/block_2016-04-18-11-18-03-912778/launcher_2016-04-20-01-37-05-453043.
INFO:custodian.custodian:Custodian running on Python version 2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:08:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'too_few_bands'], u'actions': [{u'action': {u'_set': {u'NBANDS': 28}}, u'dict': u'INCAR'}]}
INFO:root:Backing up run to error.2.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.PositiveEnergyErrorHandler object at 0x2af3687fcf90>, u'errors': [u'Positive energy'], u'actions': [{u'action': {u'_set': {u'ALGO': u'Normal'}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 2. Errors thus far = 2.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.3.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'eddrmm'], u'actions': [{u'action': {u'_set': {u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 3. Errors thus far = 3.
INFO:root:Running mpirun -n 32 vasp
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 19
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 23
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 14
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 17
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 8
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 11
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 21
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 12
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 22
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 4
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 6
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 7
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 10
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 16
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 18
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 29
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 13
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 20
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 25
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 5
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 31
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 27
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 24
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 28
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 26
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 30
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 9
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.4.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'zpotrf', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISYM': 0, u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}, {u'action': {u'_set': {u'POTIM': 0.125}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 4. Errors thus far = 4.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.5.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 5. Errors thus far = 5.
INFO:root:Running mpirun -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Backing up run to error.6.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af3687fce10>, u'errors': [u'subspacematrix'], u'actions': [{u'action': {u'_set': {u'LREAL': False}}, u'dict': u'INCAR'}]}
INFO:custodian.custodian:VaspJob.run has completed. Checking remaining handlers
INFO:custodian.custodian:Max errors reached.
ERROR:custodian.custodian:MaxErrors
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-04-20 15:47:38.863389.
INFO:custodian.custodian:Run completed. Total time taken = 1:23:33.636362.
Traceback (most recent call last):
File "/lustre/home/umjzhh-1/kl_me2/codes/fireworks/fireworks/core/rocket.py", line 213, in run
m_action = t.run_task(my_spec)
File "/lustre/home/umjzhh-1/kl_me2/codes/MPWorks/mpworks/firetasks/custodian_task.py", line 115, in run_task
custodian_out = c.run()
File "/lustre/home/umjzhh-1/kl_me2/codes/custodian/custodian/custodian.py", line 221, in run
.format(self.total_errors, ex))
RuntimeError: 6 errors reached: (CustodianError(...), u'MaxErrors'). Exited...
INFO:rocket.launcher:Rocket finished
the relevant SLURM status:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
11690 Cu1_La1_T+ cpu acct-umjz+ 32 COMPLETED 0:0
11690.batch batch acct-umjz+ 32 COMPLETED 0:0
11690.0 pmi_proxy acct-umjz+ 2 COMPLETED 0:0
11690.1 pmi_proxy acct-umjz+ 2 FAILED 7:0
11690.2 pmi_proxy acct-umjz+ 2 FAILED 7:0
11690.3 pmi_proxy acct-umjz+ 2 FAILED 7:0
11690.4 pmi_proxy acct-umjz+ 2 FAILED 7:0
<If input files are needed for the error, please copy and paste them here.>
The VaspCustodianTask function from fireworks_vasp.tasks seems to be checking if there is a poscar under the folder or not.
This becomes a problem when I ran it for the NEB input sets, where all poscars are written in 00~0x sub folders. And there is no poscar under the input file-folder.
VaspCustodianTask(vasp_cmd=['aprun', '-n', '48', '/global/homes/r/rongzq/bin/vasp_5.3.3/vasp_hopper/vasp', '>&', 'stdout'], handlers='all', custodian_params=custodian_params)
I believe that there is a problem with the way the MaxForceErrorHandler
is implemented.
The correct method only acts on the EDIFFG key, leaving the EDIFF unchanged. This might be fine if this handler is called only once and if there is a large enough factor between EDIFF and EDIFFG. However, for example I ended up in a situation in which this handler was applied multiple times, causing the value of EDIFFG to go below that of EDIFF, which is clearly a bad configuration.
In general I would say that it would preferable to keep the same ratio between EDIFF and EDIFFG. i.e. EDIFF could be multiplied by 0.5 as well. If this seems unsuitable, at the very least it should be checked that EDIFFG does not go below EDIFF.
This test compares two revisions in the master branch: 1) "5f0e8b9" which is after adding the return code exception raising code, 2) "af15bda" which before adding return code logic. The "run_vasp" is command is used to perform the test. No workflows messed up with the test, only a simple SLURM script and custodian package were involved.
The most important message is from "run.log" file.
After "{'errors': ['eddrmm']...", the custodian says "Job return code is 137. Terminating...".
After "After "{'errors': ['eddrmm']...", the custodian says "Starting job no. 1 (VaspJob) attempt no. 2". There are other issues which are specific to my environment and are already solved, which I don't think will affect the conclusions. The bottom line is that with the old code is trying to fix the error while the new code exits prematurely.
module load vasp/5.4.1
run_vasp -c "srun -n 32 vasp_std" static
"Job return code is 137. Terminating..." vs "Starting job no. 1 (VaspJob) attempt no. 2".
In the custodian/vasp/handlers.py
script, there is an error-handling method for the "subspacematrix" error (i.e. "WARNING: Sub-Space-Matrix is not hermitian in DAV"). As it stands right now, the custodian file tries to change LREAL when this error occurs. I have found that in systems I have studied, this error often occurs when PREC is not set to Accurate. I have attached an example set of input files. It runs with PREC = Accurate but has the subspace error after 7-10 SCF iterations when PREC is not set (i.e. it's the default value of Normal). It therefore may be desirable to add a PREC switch for this error message. Note that this may be dependent on the particular VASP build. I've tested this on VASP 5.4.1 for the record.
INCAR.txt
KPOINTS.txt
POSCAR.txt
POTCAR.txt
Sometimes conjugate gradient doesn't work well for ionic relaxation in VASP. In some cases (mostly with high degrees of freedom, and hard+soft vibrational modes), I have found that the velocity quench algorithm (surprisingly) works better (even though damped MD doesn't help). The simple solution would be to switch the algo with the UnconvergedError handler. However, I think it may be be more effective to switch the algo back to the original (usually conjugate gradient) after the 1st relaxation completes.
Currently, this is difficult within the Custodian framework, as it is a state machine. My proposition is to allow error handlers to prepend actions to a list that runs during job post-processing.
Does anyone have alternative suggestions, or think that this is outside the scope of what Custodian should be able to handle?
When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.
terminated sub slrum job with MPWorks
$ more Eu1_S2_Sn1--GGA_opti-294848.error
INFO:custodian.custodian:Run started at 2016-12-19 21:03:28.011467 in /lustre/home/umjzhh-1/launcher/block_2016-10-12-18-29-00-023399/launcher_2016-12-19-13-03-16-629901.
INFO:custodian.custodian:Custodian running on Python version 2.7.12 |Continuum Analytics, Inc.| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
INFO:custodian.custodian:Hostname: node077, Cluster: unknown
INFO:custodian.custodian:Starting job no. 1 (VaspJob) attempt no. 1. Errors thus far = 0.
INFO:root:Running srun -v -n 32 vasp
INFO:custodian.custodian:Terminating job
INFO:root:Terminate the job step using scancel --signal=KILL 294848.0
INFO:root:Backing up run to error.1.tar.gz.
ERROR:custodian.custodian:{u'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2af91dd3d0d0>, u'errors': [u'brmix', u'eddrmm'], u'actions': [{u'action': {u'_set': {u'ISTART': 1}}, u'dict': u'INCAR'}, {u'action': {u'_set':
{u'POTIM': 0.25}}, u'dict': u'INCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'CHGCAR'}, {u'action': {u'_file_delete': {u'mode': u'actual'}}, u'file': u'WAVECAR'}]}
INFO:custodian.custodian:Job return code is 137. Terminating...
ERROR:custodian.custodian:Job return code is 137. Terminating...
INFO:custodian.custodian:Logging to custodian.json...
INFO:custodian.custodian:Run ended at 2016-12-19 21:13:39.441558.
INFO:custodian.custodian:Run completed. Total time taken = 0:10:11.430091.
Traceback (most recent call last):
File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/fireworks/fireworks/core/rocket.py", line 224, in run
m_action = t.run_task(my_spec)
File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/mpworks/mpworks/firetasks/custodian_task.py", line 136, in run_task
custodian_out = c.run()
File "/lustre/home/umjzhh-1/keliu/envs/mpworks/codes/custodian/custodian/custodian.py", line 323, in run
.format(self.total_errors, ex))
RuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 137. Terminating...'). Exited...
INFO:rocket.launcher:Rocket finished
<If input files are needed for the error, please copy and paste them here.>
<contents of file 1>
See atomate PR hackingmaterials/atomate#276
Needs further investigation; @jmmshn could you attach some sample bad outputs? Suspect it may be an mpi-related issue where lines are getting transposed.
new_settings = {"ISTART": 1,
"ALGO": "Normal",
"NELMDL": -6,
"BMIX": 0.001,
"AMIX_MAG": 0.8,
"BMIX_MAG": 0.001}
Recently, I've had pretty good luck with just setting ALGO=All
when a job is unconverged rather than making all these changes (This is in fact how UnconvergedHandler already deals with SCAN failures, but I think it's more general than that). If it were just up to my qualitative assessment, I would first try ALGO=All
; if that still doesn't work, to try making all the various changes above.
Downside: I don't have any hard facts to support my case.
Perhaps @shyamd or @montoyjh or @mkhorton or @tschaume can support or deny.
If we do want to switch to trying ALGO=All first, I'd be happy to do the implementation. So it's just a matter of deciding.
Dear All,
Im using custodian, with python 3.6.8 in a a linux machine. The queue system is PBS.
When I launch the PBS script, instead of running the VASP binary, I run a python script instead that initializes a Custodian object and launches the job with 'vasp_cmd' being the command that I usually use to directly run VASP in a PBS script.
At first it runs, but then I get the following in the error log.
ERROR:custodian.custodian:
{ 'actions': [ { 'action': { '_set': { 'POTIM': 0.30000000000000004}},
'dict': 'INCAR'}],
'errors': [ 'brions'],
'handler': <custodian.vasp.handlers.VaspErrorHandler object at 0x2acf80726b00>}
/software/bin/vasp: no process found
The file error1.tar.gz contains an empty std_err file. The OSZICAR file contains a completely normal behavior but its cut in the middle of the 3rd ionic step.
I dont know why Custodian runs well at first, and then decides to stop the ionic relaxation in the middle of a scf iteration, and when rerunning does not find the binary.
Does anybody know whats going on?
Best regards,
Oier.
CIC Energigune
Currently, when the ZHEGV
error appears in VASP, Custodian switches ALGO
to All
. This is the right approach, but for small systems (e.g. elementals with only 1 or 2 atoms/cell), oftentimes ZHEGV
appears because too many cores are requested. With supercomputers having more and more cores per node these days, this comes up more frequently. The solution is just to decrease the number of cores for the job, but rather than trying to play around with that, Custodian could issue a warning to the user if len(structure)
is below some cutoff. It would require some testing to figure out what a good value for this might be. My gut feeling is perhaps 5 atoms/cell.
We've having a lot of different users have VASP jobs fail for various reasons, and we'd like the capability to be able to aggregate Custodian errors so that we can more easily tell if there are errors that are occurring more often than would be expected (that is, that they reflect an underlying weakness in our input sets or error handlers, rather than user error).
I'd like to propose we integrate Sentry support via their Python client.
By default, this will result in no change to Custodian by default, except an additional optional dependency. However, if you set a SENTRY_DSN
environment variable, we can set it up to log and aggregate errors automatically.
If there are no objections to this, I will go ahead and implement.
I think a similar addition to log stack traces in FireWorks when jobs fizzle might also be useful.
The docstring of "StoppedRunHandler" looks to be copied from a version of the CheckpointHandler docstring. I think it needs to be updated to pertain to StoppedRunHandler. I would do it myself, but I never used these handlers so can't comment intelligently on what they do.
When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.
(this example is on NERSC in std_err.txt):
srun: error: slurm_receive_msg: Socket timed out on send/recv operation
srun: error: Unable to confirm allocation for job 3178405: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable for expired or invalid job.
but this may happen due to various reasons regardless of what this error message says.
I don't exactly know how VASP exits when it can't even start but the machine itself raises various errors like the one copied here. A check for this may be added before p = job.run()
VaspErrorHandler
fails to detect certain error categories in VASP 6 that were detected properly in VASP 5. This appears to be a result of formatting changes associated with the error messages. My understanding is that custodian reads stdout line-by-line, but some of the error messages have been reformatted in such a way that they now break across multiple lines and hence are not caught.
See example below for the point_group
error. I have also observed this for inv_rot_mat
message, and I suspect it could affect other handlers as well.
VASP6:
-----------------------------------------------------------------------------
| |
| EEEEEEE RRRRRR RRRRRR OOOOOOO RRRRRR ### ### ### |
| E R R R R O O R R ### ### ### |
| E R R R R O O R R ### ### ### |
| EEEEE RRRRRR RRRRRR O O RRRRRR # # # |
| E R R R R O O R R |
| E R R R R O O R R ### ### ### |
| EEEEEEE R R R R OOOOOOO R R ### ### ### |
| |
| VERY BAD NEWS! internal error in subroutine IBZKPT: Error: point |
| group operation missing 2 |
| |
| ----> I REFUSE TO CONTINUE WITH THIS SICK JOB ... BYE!!! <---- |
| |
-----------------------------------------------------------------------------
VASP5:
VERY BAD NEWS! internal error in subroutine IBZKPT:
Error: point group operation missing 2
The point_group
error looks for the entire message on a single line:
"point_group": ["Error: point group operation missing"],
It's not clear to me what the best solution is. Ideally we would move away from reading one line at a time and instead detect the entire string, regardless of where a line break occurs, but that may be difficult in practice. Alternatively, perhaps we need to shorten the error messages that are detected so that custodian will catch them in both VASP5 and VASP6.
So drift in forces is obviously important to catch, since it means that you might be misled into thinking you are more force converged than you actually are. However, it is not a "failure" in the traditional sense: the calculation can still finish and be otherwise accurate. One could imagine a warning in, say, the atomate drone if ingesting a calculation where drift > desired force convergence, and warn there, instead of causing outright failures.
Currently, the DriftErrorHandler fails to actually fix drift issues, and instead causes a large quantity of jobs to fail since it's in the default handler group... I've seen 1.4k issues in the last month.
Tagging @shyamd since I know it's your creation -- do you have any suggestions to fix drift issues, beyond the strategies in the handler?
ImportError
when trying to import VaspInput
from pymatgen (2019.9.16)Traceback (most recent call last):
File "/home/users/stamma58/.conda/envs/custodian/bin/cstdn", line 10, in <module>
sys.exit(main())
File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 113, in main
args.func(args)
File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/cli/cstdn.py", line 82, in run
c = Custodian.from_spec(d)
File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 322, in from_spec
cls_ = load_class(d["jb"])
File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/custodian.py", line 302, in load_class
mod = __import__(modname, globals(), locals(), [classname], 0)
File "/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/custodian/vasp/jobs.py", line 13, in <module>
from pymatgen.io.vasp import VaspInput, Incar, Poscar, Outcar, Kpoints, Vasprun
ImportError: cannot import name 'VaspInput' from 'pymatgen.io.vasp' (/home/users/stamma58/.conda/envs/custodian/lib/python3.7/site-packages/pymatgen/io/vasp/__init__.py)
Starting from pymatgen version 2019.9.12 VaspInput
is no longer available from the pymatgen.io.vasp
namespace. Suggested solution:
from pymatgen.io.vasp import VaspInput
to from pymatgen.io.vasp.inputs import VaspInput
Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check
It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error
, then instead of running a final check with all handlers, only ones with is_monitor=False
will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error
.
Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.
This code snippet is taken from custodian.py:448
# While the job is running, we use the handlers that are
# monitors to monitor the job.
if isinstance(p, subprocess.Popen):
if self.monitors:
n = 0
while True:
n += 1
time.sleep(self.polling_time_step)
if p.poll() is not None:
break
terminate = self.terminate_func or p.terminate
if n % self.monitor_freq == 0:
has_error = self._do_check(self.monitors,
terminate)
if terminate is not None and terminate != p.terminate:
time.sleep(self.polling_time_step)
else:
p.wait()
if self.terminate_func is not None and \
self.terminate_func != p.terminate:
self.terminate_func()
time.sleep(self.polling_time_step)
zero_return_code = p.returncode == 0
logger.info("{}.run has completed. "
"Checking remaining handlers".format(job.name))
# Check for errors again, since in some cases non-monitor
# handlers fix the problems detected by monitors
# if an error has been found, not all handlers need to run
if has_error:
self._do_check([h for h in self.handlers
if not h.is_monitor])
else:
has_error = self._do_check(self.handlers)
The stack trace I get back is:
Traceback (most recent call last):\n File \"/projects/matqm/matmethods_env/codes/fireworks/fireworks/core/rocket.py\", line 224, in run\n m_action = t.run_task(my_spec)\n File \"/projects/matqm/matmethods_env/codes/atomate/atomate/vasp/firetasks/run_calc.py\", line 167, in run_task\n c.run()\n File \"/projects/matqm/matmethods_env/codes/custodian/custodian/custodian.py\", line 323, in run\n .format(self.total_errors, ex))\nRuntimeError: 1 errors reached: (CustodianError(...), u'Job return code is 1. Terminating...'). Exited...\n
You can see that it's difficult to know from above that max_errors was reached and that is why we are exiting. You can figure it out though if you look at custodian.py line 323.
The run is located in :
/projects/ps-matqm/prod_runs/block_2016-12-20-23-00-16-536064/launcher_2016-12-20-23-00-35-095234/launcher_2016-12-21-09-28-04-031609
When reporting bugs/issues, please supply the following information. If this
is a feature request, please simply state the requested feature.
Try LMIXTAU=True for convergence issues when meta-GGAs are used. Tagging @shyamd in connection to our SCAN work. Further investigation required, but maybe even worth setting by default in the SCAN set?
Most recent custodian run on NERSC
reading WAVECAR
WARNING: chargedensity file is incomplete
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
ERROR: charge density could not be read from file CHGCAR for ICHARG>10
Handlers that delete the CHGCAR/WAVECAR should first check whether ICHARG > 10. If it is, then do not perform the deletion of CHGCAR/WAVECAR.
An example run directory of this error will be available for the next couple of weeks if needed. Let me know.
What's the use case for setting this in run_vasp? It causes NELMDL issues if LWAVE=False. I think the vasp default would be better.
"rot_matrix" and "pricel" errors currently use a PerturbStructureTransformation (which is incompatible with these modes).
In my experience, "rot_matrix" is usually caused by difficulty with finding symmetry in even monkhorst grids, and is fixed by using an odd grid (though for some reason ISYM=0 doesn't fix it).
I think "pricel" should be fixed by setting ISYM=0.
Is there any objection to me making these changes?
Currently, the SCF algorithm ladder in NonConvergingErrorHandler
is tuned for GGAs and is not appropriate for meta-GGAs or hybrids. For instance, meta-GGAs should rarely be used with Algo = VeryFast or Fast (Algo = All is generally recommended and is the default in MPScanRelaxSet
). For hybrids, Algo = Fast or VeryFast should never be used according to the VASP manual, as hybrids don't support these algorithms even though no warning is printed: https://www.vasp.at/wiki/index.php/LHFCALC.
Will be closed when #179 is merged.
This issue only matters when using Vasp Monitors and only when running jobs in a SLURM environmental. It seems to originate from the Popen.terminate() python API. If a job terminates normally, all the function works properly. However, once a VASP job is killed by Popen.terminate(), which would be true if the Monitor finds an error, new VASP job will never be able to run. The error message is "Unable to create job step: Job/step already completing or completed". The power of custodian is fully prohibited in this situation.
How do you know whether you are affected? If the success rate of recent calculation degrades, probably you are already affected.
I reproduced the error using a simple python script without custodian and asked NERSC staff to fix it. They confirmed the issue and promised to investigate it. However, it has already been two weeks I didn't get any update. As a precaution, I think we should consider the possibility they are not able fix it in short term.
I have an alternative to walk-around this issue in my fork. The point is to use "scancel" to kill VASP job. This method releases the resource successfully while Popen.terminate() doesn't. It seems to be working fine for the last two weeks. However, it makes the custodian code more complicated and environment dependent. I am not sure whether it is a good idea to merge this piece of code to the main repository.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.