molssi / qcschema Goto Github PK

View Code? Open in Web Editor NEW

93.0 23.0 36.0 286 KB

A Schema for Quantum Chemistry

Home Page: http://molssi-qc-schema.readthedocs.io/en/latest/index.html#

License: BSD 3-Clause "New" or "Revised" License

Makefile 1.11% Python 98.89%

quantum-chemistry schema

qcschema's People

Contributors

Stargazers

Watchers

qcschema's Issues

Basis issue orderings

Overall, we'd like to address the ordering of the molecular orbitals, particularly in how the Cartesian order is formatted. The Molden format can be found here. However, there seems to a disconnect in the codes in emulating the same ordering. There is not a way to extend the handwritten code in a suitable manner. So, in order to start off the discussion, an outline of the current status is summarized here. If there is anything that's been misinterpreted or missing, please feel free to add!

While Molden and GAMESS seem to follow the convention, Andy Simmonett has pointed out some inconsistencies through the FHCK file for Psi4. Ed Valeev's libint code also contains code to support basis function ordering, with some minor inconsistencies with the Molden format.

Cartesian coordinates

Currently, for s, p, d formats we should be expecting the following from the Molden format:

Molden format (expected)	Andy's work [FHCK/Psi4]	Ed's work [GAMESS/libint]
S: 0	S: 0	S: 0
P: X, Y, Z	P: Z, X, Y	P: X, Y, Z
D: XX, YY, ZZ, XY, XZ, YZ	D: XX, XY, XZ, YY, YZ, ZZ	D: XX, YY, ZZ, XY, YZ, XZ
F: XXX, YYY, ZZZ, XXY, XXZ, XZZ, XYY, YZZ, YZZ, XYZ	F: XXX, XXY, XXZ, XYY, XYZ, XZZ, YYY, YYZ, YZZ, ZZZ	F: XXX, YYY, ZZZ, XXY, XXZ, XYY, YYZ, XZZ, YZZ, XYZ
G: XXXX, YYYY, ZZZZ, XXYY, XXXZ, XYYY, YYYZ, XZZZ, YZZZ, XXYY, XXZZ, YYZZ, XXYZ, XYYZ, XYZZ	G: XXXX, XXXY, XXXZ, XXYY, XXYZ, XXZZ, XYYY, XYYZ, XYZZ, XZZZ, YYYY, YYYZ, YYZZ, YZZZ, ZZZZ	G: XXXX, YYYY, ZZZZ, XXXY, XXXZ, XYYY, YYYZ, XZZZ, YZZZ, XXYY, XXZZ, YYZZ, XXYZ, XYYZ, XYZZ

A more organized layout can be seen in this screenshot:

Part of the issue that can contribute to the non-standardization of the Cartesian formatting is the lexical nature of the various codes. Angular momentum is denoted as "l" in FHCK and "am" in libint's work. In Molden, you can call the Cartesian format by the number of available orbitals in the systems and the term denoting the orbital type. For example, [6D] calls all six d-orbital Cartesian functions (xx, yy, zz, xy, xz, yz)
There needs to be a discussion as well in terms of organizing the row order of how the Cartesian functions are to come out.

Iterator example

Once we've come to terms on what the ordering is, the next effort would be developing a small code that would iterate out the desired format across all systems. Daniel Smith (@dgasmith ) wrote this:

def row_cartesian_order(L):
    idx = -1
    for i in range(L + 1):
        l = L - i
        for j in range(i + 1):
            m = i - j
            n = j
            idx += 1
            yield (idx, L, m, n)

and the iterator will print out the Cartesian orders as:

This is the order for s orbitals:
(0, 0, 0, 0)
This is the order for p orbitals:
(0, 1, 0, 0)
(1, 1, 1, 0)
(2, 1, 0, 1)
This is the order for d orbitals:
(0, 2, 0, 0)
(1, 2, 1, 0)
(2, 2, 0, 1)
(3, 2, 2, 0)
(4, 2, 1, 1)
(5, 2, 0, 2)
This is the order for f orbitals:
(0, 3, 0, 0)
(1, 3, 1, 0)
(2, 3, 0, 1)
(3, 3, 2, 0)
(4, 3, 1, 1)
(5, 3, 0, 2)
(6, 3, 3, 0)
(7, 3, 2, 1)
(8, 3, 1, 2)
(9, 3, 0, 3)
This is the order for g orbitals:
(0, 4, 0, 0)
(1, 4, 1, 0)
(2, 4, 0, 1)
(3, 4, 2, 0)
(4, 4, 1, 1)
(5, 4, 0, 2)
(6, 4, 3, 0)
(7, 4, 2, 1)
(8, 4, 1, 2)
(9, 4, 0, 3)
(10, 4, 4, 0)
(11, 4, 3, 1)
(12, 4, 2, 2)
(13, 4, 1, 3)
(14, 4, 0, 4)

Spherical coordinates

In addition to the Cartesian orbital formats, we should pin down what the standard format for the spherical coordinates should be. Currently, the Molden format has this structure for orbitals:

Molden format (expected) *	NWChem
S: 0	S: 0
P: 0, +1, -1	P: -1, 0, +1
D: 0, +1, -1, +2, -2	D: -2, -1, 0, +1, +2
F: 0, +1, -1, +2, -2, +3, -3	F: -3, -2, -1, 0, +1, +2, +3
G: 0, +1, -1, +2, -2, +3, -3, +4, -4	G: -4, -3, -2, -1, 0, +1, +2, +3, +4

*Psi4 and libint share Molden's format for spherical coordinates

Edits:
04/27/2018- added in spherical coordinates for computational programs stated by @wadejong and @dgasmith which includes adding NWChem to the mix.
04/23/2018- minor edits to make the table reader-friendly; linked Daniel in at iterator section.

Suggestion: support for YAML file format

YAML files, like JSONs, translate directly into hierarchies of key/value pairs. However, YAMLs offer superior human-readability compared to JSONs. Why not allow the option for YAML-style inputs and outputs? This may be preferred for users who choose to interact directly with input and output files.

ordering of lists in Molecule schema

If the rules are:

order of atoms in topology schema is absolute and may not be reshuffled and
lists are inherently ordered

then array fields like "masses" are fine b/c atoms are in order.

What of bonds and fragments, then?

QCSchema/qcschema/dev/molecule.py

Lines 81 to 106 in 4153408

    
           "connectivity": { 
        
               "description": "A list describing bonds within a molecule. Each element is a (atom1, atom2, order) tuple.", 
        
               "type": "array", 
        
               "items": { 
        
                   "type": "array", 
        
                   "minItems": 3, 
        
                   "maxItems": 3, 
        
                   "items": { 
        
                       "type": "number", 
        
                       "minimum": 0, 
        
                       "maximum": 5, 
        
                   } 
        
               } 
        
           }, 
        
           "fragments": { 
        
               "description": 
        
               "(nfr, -1) list of indices (0-indexed) grouping atoms into molecular fragments within the topology.", 
        
               "type": "array", 
        
               "items": { 
        
                   "type": "array", 
        
                   "items": { 
        
                       "type": "number", 
        
                       "multipleOf": 1.0 
        
                   } 
        
               } 
        
           },

I very much recognize that fragmenters and n-body drivers will have their own systems for defining and indexing fragments (and that fragment_multiplicities and fragment_charges must follow along) that must not be disturbed, but is there any reason not to require this field be sorted (e.g., [[5, 0], [4, 1, 3], [2]]) for ease of comparison? Same (and stronger, imo) case for sorting "connectivity" field.

Maybe beyond QC there's a good reason to leave these free-ordered? Should two json molecules whose schema differ by only [[5, 0], [4, 1, 3], [2]] vs. [[2], [5, 0], [4, 1, 3]] resolve to the same hash?

Counting starts from 0 or 1?

When refering to array elements, e.g. an atom or a center index, we should decide that counting starts either from 0 (please) or 1 (noooo). Some is already mentioned here: Technical_Specifications.md

For example, it is relevant to #12.

Versioning Issues

General questions about the versioning from the workshop discussion:

How frequently do you want to update the schema? (6 months)
What is the minimum amount of time a PR should be up? (week)
How many collaborators must approve of the schema before it is pulled in? (3)
How are contentious issues resolved? (?)

Chemical identity information for non-QM packages

Those of us in the MD community would very much like to be able to take output from QM packages and take it directly into MD engines and chemistry toolkits we use. However, these typically require what I'll call the "chemical identity" of the molecule, as (without QM) we can't infer this simply from the elements/number of electrons.

To that end, I'd like to see how receptive people would be to including in the schema the necessary information, such as formal charges (on atoms), connectivity, and bond orders. Presumably this wouldn't be particularly helpful to people staying in the QM world, but for us it would save a whole bunch of intermediate steps and/or a need to know what molecule is contained in the JSON before we can do anything with it.

Alternatively, providing an isomeric SMILES of the molecule or similar could also work. Basically, we just need some way of knowing what molecule (and charge state) it contains without having to "do chemistry" on the file to determine that.

If people are receptive to this idea I can open up a PR to add this to the requirements.md.

To be slightly more specific, I am also proposing broadening the concept of topology to also include bonding information and/or chemical identity (beyond just the coordinates and elements for the molecule).

C-compatible QCSchema implementation

I'm planning to create a QCSchema implementation which can be used easier with compiled languages. But before I set out to reinvent the wheel, does anyone know about an existing C-compatible QCSchema implementation?

Units and conversion factors

I have implemented in Jmol a general Java class for handling QC-JSON units. It is working smoothly in my first tries at QC-JSON writer and reader.

See org.qcshema

I suggest that this be the start of a Java library.

Who else is interested in working on this with me?

Request wavefunction data returns

A key component of the schema which we have not hit on too much so far is the return of orbitals/densities/eigenvalues for visualization and passing data between programs. I would like to push the discussion of the return types and storage of these quantities off to a separate topic (there should be one soon discussing ordering and the like).

These "wavefunction" returns would be isolated to anything of the size of the basis set or larger. Browsing around it seems like the following quantities are useful to return:

Orbital
Densities
Eigenvalues
Overlap matrix
Is there anything else crucial for a first pass?

Would a proposed structure like the following work?

{
    "return_wavefunction_data": {
        "orbitals": True,
        "density": True,
        "eigenvalues": True,
        "overlap": True
    }
}

with a similar return structure.

Questions:

Should these go in the keywords argument or present a new top level option?
Is it sufficient to return the AO matrices only for now and consider spatial symmetry at a later date?
Should the output live in the current properties field which is currently restricted to single numbers and small arrays.
Output keys should be able to handle alpha/beta perhaps: orbitals_alpha?

move "schema_*" fields into molecule schema

It's not a major problem, but I do seem to hit a lot some odd constructions because the identifying info for molecule (schema_name and schema_version) lives a level higher than the molecule data.

For example, the below has two layers of 'molecule' so that to_schema and from_schema can act directly from/to findif['molecule']. One has to findif.update(mol.to_schema()) rather than assignment to add the QCSchema Mol or extract findif['molecule'] and the two schema_* fields above to pull the molecule out.

I recognize that for sanity purposes we want as few schemas as possible in the wild. But I wonder if that's false savings in Molecule's case. Other experiences?

I suggest {'schema_name': 'qc_mol_schema', 'schema_version': 2} be added within the Mol schema and Mol be indep of the upper job schema.

findif = 
{'displacement_space': 'CdSALC',
 'displacements': {'0: -1': {'geometry': [-1.0035217817285502,
                                          -1.5308084989341915e-17,
                                          0.0,
                                          1.0035217817285502,
                                          1.5308084989341915e-17,
                                          0.0]},
                   '0: 1': {'geometry': [-0.9964782182714499,
                                         -1.5308084989341915e-17,
                                         0.0,
                                         0.9964782182714499,
                                         1.5308084989341915e-17,
                                         0.0]}},
 'molecule': {'molecule': {'atom_labels': ['', ''],
                           'atomic_numbers': [1, 1],
                           'fix_com': False,
                           'fix_orientation': False,
                           'fragment_charges': [0.0],
                           'fragment_multiplicities': [1],
                           'fragments': [[0, 1]],
                           'geometry': [-1.0,
                                        -1.5308084989341915e-17,
                                        0.0,
                                        1.0,
                                        1.5308084989341915e-17,
                                        0.0],
                           'mass_numbers': [1, 1],
                           'masses': [1.00782503223, 1.00782503223],
                           'molecular_charge': 0.0,
                           'molecular_multiplicity': 1,
                           'name': 'H2',
                           'provenance': {'creator': 'QCElemental',
                                          'routine': 'qcelemental.molparse.from_string',
                                          'version': 'v0.1.3'},
                           'real': [True, True],
                           'symbols': ['H', 'H']},
              'schema_name': 'qc_schema_input',
              'schema_version': 1},
 'project_rotations': True,
 'project_translations': True,

Geometry unit conversion factor

posting for @andysim

Instead of providing the units, it may make sense to provide conversion factors to atomic units because they can vary fairly significantly between packages.

Providing an input_units_to_au field kills both different units and different physconst conversions with one stone. Helps universal printing labels like Geometry (in Bohr * 1.00000000):.

May also consider OpenMM's units solution: https://github.com/pandegroup/openmm/blob/master/wrappers/python/simtk/unit/unit_definitions.py

Obligations of Partner Codes

Should there be lists of light to full obligations for partner codes, both on the JSON-producing and JSON-consuming side?

for example,

outputs QM portion of Molecule into JSON (QC program)
can run a geometry optimization from a JSON input (QC program)
tests its JSON input/output alongside usual testing scheme (ALL)
plots orbitals from JSON schema basis sets and densities (Viz program)
returns JSON vibrational modes from JSON Hessian (vibrational analysis program)

For CCSD(T) add separate entry for (T) contributions to cc_properties

molecule extensions for zmat and efp

This is a continuation from a split on #44

Without undermining the agreed-upon Cartesian exchange format for Mols, there are other input formats and other non-QM molecule domains out there. In particular, these can interact with the main Cartesian QM domain

In Psi4 we've rewritten stuff so that all molecule parsing, basis set attaching, and molecule exchange is in (a close relative of) QCSchema up to the point at which it hits our internal C++ class. That class supports ZMat internal storage, so rather than drop that widespread functionality, we need a way of transporting the ZMat info to the constructor, hence the very generic geom_unsettled and variables fields. Psi4 has no intention of using the ZMat extension as an output format. That is, (in: Cart, ZMat) --> (out: Cart) is and remains the plan. This is possible b/c all progs accept Cart as input.

ZMatrix -- required fields are geom_unsettled and symbols

        "psi4:geom_unsettled": {
            "description": "(nat, )  all-string Cartesian and/or zmat anchor and value contents.",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "psi4:variables": {
            "description": "(nvar, 2) pairs of variables (str) and values (float). May be incomplete.",
            "type": "array",
            "items": {
                "type": "array",
                "items": {
                    "type": "string",
                }
            }
        }

zmat_schema_example = {
    'symbols': np.array(['H', 'O', 'O', 'H']),
    'geom_unsettled': [[], ['1', '0.95'], ['2', '1.40', '1', 'A'], ['3', '0.95', '2', 'A', '1', '120.0']],
    'variables': [['A', 105.0]],
}

mixed_zmat_cartesian_example = {
    'geom_unsettled': [..., ['-2.509000000000', '-0.794637665924', '0.000000000000'], ['1', 'CC', '3', '30', '2', 'A2']],
    ...
}

Programming-wise, effective fragment potentials are very useful in complicating and clarifying dictionary-like system exchange between programs. In EFP, the full Cartesian geometry is only available through calls to an EFP library with fragment files. Instead, parsing only supplies efp fragment files and orientation hints. Most importantly, EFP and QM domains are connected (because need to freeze orientation) and are best parsed simultaneously. The output format is pure-xyzabc hints.

EFP -- required fields are fragment_files and geom_hints. I usually require hint_types, too, but that can be read off from the length of arrays in geom_hints. It's just a question of if this should be generic enough for other, perhaps overlapping-length hint-types.

        "psi4:fragment_files": {
            "description": "(nfr, ) lowercased names of efp fragment files (no path info).",
            "type": "array",
            "items": {
                "type": "string"
            }
        },
        "psi4:hint_types": {
            "description": "(nfr, ) type of fragment orientation hint.",
            "type": "string",
            "enum": ["xyzabc", "points"]
        },
        "psi4:geom_hints": {
            "description": "(nfr, ) inner lists have length 6 (xyzabc; to orient the center) or
                            9 (points; to orient the first three atoms) of the EFP fragment.",
            "type": "array",
            "items": {
                "type": "array",
                "items": {
                    "type": "number"
                }
            }
        },

mixed_qm_efp_schema = {
    'qm': {
        'geom': np.array([0., 0., 0.118720, -0.753299, 0.0, -0.474880, 0.753299, 0.0, -0.474880]),
        'symbols': np.array(['O', 'H', 'H']),
        'fix_com': True,
        'fix_orientation': True,
        'fix_symmetry': 'c1',
    },
    'efp': {
        'fragment_files': ['h2o', 'ammonia'],
        'geom_hints': [[-2.12417561, 1.22597097, -0.95332054, -2.902133, -4.5481863, -1.953647],
                       [0.98792, 1.87681, 2.85174, 1.68798, 1.18856, 3.09517, 1.45873, 2.55904, 2.27226]],
        'hint_types': ['xyzabc', 'points'],
        'fix_com':
        True,
        'fix_orientation':
        True,
        'fix_symmetry':
        'c1',
    }
}

We have from_string parsing and validation tech for all three domains (QM Cart, ZMat, EFP) that have been working soundly with these extensions for many months and that others are free to use.

Reserved Fields

You may want to create a list of reserved fields for likely expansion and to prevent ppl from trying to be clever by claiming multiplicity with definition 2S b/c that’s more convenient for them than the MolSSI molecular_multiplicity of 2S+1 (did not look up proper field names)

QCSchema with PBC?

I looked at the specifications of the QCSchema at https://molssi-qc-schema.readthedocs.io/en/latest/auto_topology.html and noticed that I cannot pass lattice parameters or the systems periodicity with the topology.

Are you planning to add periodic boundary conditions to QCSchema?

Multiple conformations in a single file?

As a force field developer and quantum chemistry user, I often find myself working with collections of structures (conformations) and associated energies. This could be useful for torsion drives in 1D and 2D, as well as reaction energies / minimum energy paths. Often I am also interested in running the same quantum chemistry method on the whole set of conformations. Thus, I think it would be very helpful if the schema could support this.

Version 1

I have started implementing this schema into several QM packages and external geometry optimizers to get a better handle on what we can currently do, what are the issues, and what are the (many) additional fields we need to add.

So that other software can have something stable to complete the handshake, I would like to mint a "Version 1" of the schema this month. While particularly lacking on the visualization and atomic orbital quantities, the record of energies and gradients is still quite useful for many applications.

Are there objection to this or fields that we must add before a "beta" release?

Suggestion: rename variables to results

It is probably just me but variables is an odd name for results of a calculation. These results are by definition not variable once you receive the output of a calculation.

Additional tensorial properties: pair with QCEl#241

This is the sister-issue to QCEl#241. If the properties are to be added to the schema, it looks like we should solicit opinions from this repo as well (if I read #68 correctly).

We're looking to add some tensorial properties, namely polarizabilities and other (linear) response tensors such as the Rosenfeld mixed magnetic- and electric-dipole tensor. This is somewhat related to #70 in that these tensors are often redundant, but also unique in that the details of the perturbing field (order, operator(s), frequency) should also be considered. The openrsp model is promising, but perhaps more complicated than we need (for now). Any feedback is appreciated.

Wavefunction data

A request that we see is Wavefunction data, now that basis set specifications are beginning to conclude it would be a good time to begin discussing the layout. An example of a new top level field where the basis data and wavefunction data is provided.

Decisions to make here:

Should there be a new top level field for wavefunction-like data and what should it be called? Another option would be to place this data in the properties top level field, but this combines output that requires the full basis set specification and data that are simple variables.
Where should the basis information be located? There is some discussion that this could (eventually) be a JSON-LD like approach to a basis library (https://www.basissetexchange.org). However, until the basis information is uniform between programs this is not currently possible.
What are the fields that we should list?
Can we assume that for RHF wave functions, only alpha quantities are present?
Should we tackle matrix quantities that have symmetry (Fock matrices) to reduce footprint by a factor of ~2.

Remaining issues to work through:

Orbital ordering (#12)
Large array representations (#15)
How does the input change to request these quantities?

Example:

{
    # Schema headers
    'schema_name': 'qcschema_output',
    'schema_version': 1,

    # Minimal Input
    'molecule': {
        'symbols': ['He'],
        'geometry': [0.0, 0.0, 0.0]
    },
    'model': {
        'method': 'SCF',
        'basis': '6-31g'
    },
    'driver': 'energy',
    'keywords': {},

    # Standard output data
    'raw_output': None,
    'success': True,
    'provenance': {
        'creator': 'Psi4',
        'version': '1.3.2',
        'routine': 'psi4.json.run_json'
    },
    'return_result': -2.8551790335918543,
    'properties': {
        'calcinfo_nbasis': 2,
        'calcinfo_nmo': 2,
        'calcinfo_nalpha': 1,
        'calcinfo_nbeta': 1,
        'calcinfo_natom': 1,
        'return_energy': -2.8551790335918543,
        'nuclear_repulsion_energy': 0.0,
        'scf_one_electron_energy': -3.882050835689409,
        'scf_two_electron_energy': 1.0268718020975545,
        'scf_dipole_moment': [0.0, 0.0, 0.0],
        'scf_iterations': 4,
        'scf_total_energy': -2.8551790335918543
    },
    'wavefunction': {
        
        # Begin with a basis
        'basis': {
            'revision_description': 'Data from Gaussian 09/GAMESS',
            'elements': {
                '2': {
                    'electron_shells': [{
                        'function_type':
                        'gto',
                        'region':
                        'valence',
                        'angular_momentum': [0],
                        'exponents': ['0.3842163400E+02', '0.5778030000E+01', '0.1241774000E+01'],
                        'coefficients':
                        [['0.4013973935E-01', '0.2612460970E+00', '0.7931846246E+00']]
                    }, {
                        'function_type': 'gto',
                        'region': 'valence',
                        'angular_momentum': [0],
                        'exponents': ['0.2979640000E+00'],
                        'coefficients': [['1.0000000']]
                    }],
                    'references': [{
                        'reference_description': '31G Split-valence basis set for H,He',
                        'reference_keys': ['gaussian09e01']
                    }]
                }
            },
            'version': '1',
            'function_types': ['gto', 'gto_cartesian'],
            'names': ['6-31G'],
            'flags': [],
            'family': 'pople',
            'description': '6-31G valence double-zeta',
            'role': 'orbital',
            'auxiliaries': {},
            'name': '6-31G'
        },
        
        # The individual matrices
        'orbitals_a':
        [0.5920657102503524, 1.1498260600756343, 0.5136020636861139, -1.1869518498493001],
        'orbitals_b':
        [0.5920657102503524, 1.1498260600756343, 0.5136020636861139, -1.1869518498493001],
        'density_a':
        [0.3505418052542542, 0.30408617062236576, 0.30408617062236576, 0.26378707982263505],
        'density_b':
        [0.3505418052542542, 0.30408617062236576, 0.30408617062236576, 0.26378707982263505],
        'fock_a':
        [-0.549237187834466, -1.000373967977106, -1.000373967977106, -0.4292228146057008],
        'fock_b':
        [-0.549237187834466, -1.000373967977106, -1.000373967977106, -0.4292228146057008]
    }
}

QC-JSON working prototype -- Jmol

I have a working prototype in Jmol now. Very simple to test.

Just:

download and unzip the [Jmol latest release] (https://sourceforge.net/projects/jmol/files/latest/download?source=files)
start Jmol.jar and open a file...console
load some Jmol-readable calculation output into Jmol using the load command or by drag-dropping into the application.
issue WRITE xxx.qcjson.

A directory of prototypes produced this way is now in the jmol-data qcjson directory.

It is designed just to cover the area that I will be interested in for Jmol. Specifically:

Anyway, it is working. The purpose of this is not to force an issue, just to explore what is needed.

And one of the things I found I needed was an indication of the orbital normalization mode; maybe this is not needed in the final business, but I suggest we have that, as it would allow direct conversion from legacy output such as we have here.

Bob

overall format:

["magic number/version",
{metadata block}
{job block},
{job block},
....
]

basic job/step hierarchy

{job block} ==
{
"metadata":{....},
"steps":[{step block}, {step block},...],
"mo_bases":{
basis id: {basis block},
basis id: {basis block},
....
}
}

{step block} ==
{
"metadata":{....},
"topology": {topology block},
"vibrations":[{vibration block}, {vibration block},..},
"molecular_orbitals":{molecular orbital block},
}

{topology block} ==
{
"atoms":{atom block}
}

{atom block} ==
{
"coords_units":["angstroms",1.88972613],
"coords":[ x1, y1, z1, x2, y2, z2,....],
"symbol":["_RLE_",...run-length-encoded data...],
"atom_number":["_RLE_",...run-length-encoded data...]
}

{molecular orbital block} ==
{
"orbitals":[{orbital block},{orbital block}...],
"__jmol_calculation_type":critical jmol data",
"basis_id":basis id,
"__jmol_normalized":boolean,
"orbitals_energy_units":["?","?"]
}

{basis block} ==
{
"gaussians":[GTO gaussians array of arrays],
"shells":[GTO shells array of arrays],
"slaters":[STO slaters array]
}

etc. (sorry, I know that is not complete...)

add schema fields to molecule

The molecule schema carries its identifying information ("schema_name" and "schema_version") fields a level higher than the main molecule data. This leads to odd constructions like the below so that one can to_schema and from_schema directly to the findif['molecule'] field rather than having to update (findif.update(mol.to_schema)) or extract from the big dict.

I agree with minimizing the number of schema versions running around for everyone's sanity. But I wonder if keeping Mol in the main job QCSchema is a consolidation too far. Other experiences?

As alternative, I propose that molecule itself support {'schema_name': 'qc_mol_schema', 'schema_version': 2} alongside its data.

findif =
{'displacement_space': 'CdSALC',
 'displacements': {'0: -1': {'geometry': [-1.0035217817285502,
                                          -1.5308084989341915e-17,
                                          0.0,
                                          1.0035217817285502,
                                          1.5308084989341915e-17,
                                          0.0]},
                   '0: 1': {'geometry': [-0.9964782182714499,
                                         -1.5308084989341915e-17,
                                         0.0,
                                         0.9964782182714499,
                                         1.5308084989341915e-17,
                                         0.0]}},
 'molecule': {'molecule': {'atom_labels': ['', ''],
                           'atomic_numbers': [1, 1],
                           'fix_com': False,
                           'fix_orientation': False,
                           'fragment_charges': [0.0],
                           'fragment_multiplicities': [1],
                           'fragments': [[0, 1]],
                           'geometry': [-1.0,
                                        -1.5308084989341915e-17,
                                        0.0,
                                        1.0,
                                        1.5308084989341915e-17,
                                        0.0],
                           'mass_numbers': [1, 1],
                           'masses': [1.00782503223, 1.00782503223],
                           'molecular_charge': 0.0,
                           'molecular_multiplicity': 1,
                           'name': 'H2',
                           'provenance': {'creator': 'QCElemental',
                                          'routine': 'qcelemental.molparse.from_string',
                                          'version': 'v0.1.3'},
                           'real': [True, True],
                           'symbols': ['H', 'H']},
              'schema_name': 'qc_schema_input',
              'schema_version': 1},
 'project_rotations': True,
 'project_translations': True,

CSI JSON-LD

Chemical Semantics, Inc. (CSI) has developed a JSON-LD format for computational chemistry data. The example file can be found at here. A json-ld file is still a valid json file, but it defines special keywords to construct documents that can alias the definition of the name of the key-value pairt, the data type of the value, URIs, and other unique identifiers. Consequently, the JSON-LD specification can store data/metadata with semantic meaning in a JSON format. A JSON-LD file can be easily converted to an RDF file. At our web portal, the JSON-LD file associated with each publication can be downloaded.

Bot Integration

Hello, I am bot triggering several integrations. This issue will be immediately closed.

Add item to "existing JSON efforts"

The Materials Project software stack (including atomate for workflow generation, fireworks for running workflows/workflow management, pymatgen for analysis) makes heavy use of JSON -- any class that subclasses MSONable has a JSON representation. This includes Structure (periodic crystals) and Molecule classes in pymatgen, as well as the workflows themselves and calculation outputs.

I'm not sure if this is relevant to the current effort, since atomate is primarily used for inorganic materials at present, and is quite general and not quantum chemistry-specific, but I thought it'd be worth adding to the list in case it's of interest to anyone here.

Starter FAQ

Answers by DGAS

Will the json be validated before it reaches me?
No, but we will provide a library.
Is MolSSI the validation gatekeeper?
Yes
Can I add extra fields if my software piece needs internal extensions?
Absolutely (any non-taken field is valid)
Will this be broad enough that we can actually abandon [annoying-but-widespread-file-format] files like xyz, molden, mol2?
Hopefully
Are there libraries for writing this in [language]?
No, we we will only supply libraries for Python/C++

Detailed molecular basis set specification (output of a calculation, not input)

Rough idea

The current schema does not yet have standardized method for specifying a Gaussian (or Slater) AO basis that was used in a calculation for a given molecule. Just to be clear: I'm looking for a way to fully specify a basis set in the output of a calculation, after the calculation has been carried out. This is different from specifying a basis set in the input for a calculation, because in the output you want to know how to interpret quantities expressed in an AO basis. For that you need more information, namely:

Order of the basis functions within one shell.
Normalization and sign conventions for the basis functions.
The center for each shell of basis functions.

Because basis functions are not necessarily centered on nuclei, an array with centers must be included as well.

I think we can mostly JSON-ize the molden format (see [GTO], [STO] and [MO] sections of http://www.cmbi.ru.nl/molden/molden_format.html) plus some improvements:

Make the ordering of the basis functions within one shell (and pure versus Cartesian) explicit as follows:
```
{"conventions": {
  "d": ["xx", "yy", "zz", "xy", "xz", "yz"],
  "f": ["f0c", "f1c", "f1s", "f2c", "f2s", "-f3s", "-f3c"]
}}
```
Cartesian functions always match the x*y*z* regular expression. For pure functions, the string consists of four parts: sign (optional), angular momentum (letter code), magnetic quantum number (absolute value), s or c for sine- or cosine-like. The optional sign is needed because some codes (e.g. ORCA) have unusual sign conventions, different from most other QC codes. The line for the f-functions in the example is how ORCA orders basis functions (and flips signs) in AO arrays in a Molden file. (This is different from what the Molden program actually expects.) To fix the meaning of the strings for the pure functions completely, we should write out the mathematical form in the JSON schema.
Allow very high angular momenta, exhausting the whole alphabet (except j, see Wikipedia, more authoritative reference always welcome): ["s", "p", "d", "f", "g", "h", "i", "k", "l", "m", "n", "o", "q", "r", "t", "u", "v", "w", "x", "y", "z", "a", "b", "c", "e"].
Assume contractions are normalized already, i.e. do not assume that the program reading the contraction coefficients will fix the normalization for you. Instead, the program writiing out the contractions should take care of that. (The Molden format assumes contractions need to be normalized upon reading, which is not suitable for all cases.)
Support (very) generalized contractions, i.e. also compatible with basis sets from CP2K. See https://github.com/cp2k/cp2k/blob/master/cp2k/data/BASIS_MOLOPT (It does not get more general than that.) For every shell we should have a list of angular momenta, e.g. ["s", "p"], ["s", "s"] or ["s", "s", "s", "p", "p", "d"]. The last example is something you could encounter in a molopt basis set.
Keep a list of centers for the shells, which can be different from atomic positions, e.g. when using ghost atoms or doing other funny things.
Include pseudopotential specification? (I'm not an expert.)
Details for STO basis functions need to be worked out. (I'm not an expert on that either.)

Example for NH3

{"orbital_basis": {
  "type": "gto",
  "conventions": {
    "p": ["x", "y", "z"],
    "d": ["d0c", "d1c", "d1s", "d2c", "d2s"],
  },
  "shells": [
    [0, 8, ["s", "s"],
     [0.9046E+4, 0.1357E+4, 0.3093E+3, 0.8773E+2, 0.2856E+2, 0.1021E+2, 0.3838E+1, 0.7466],
     [0.6996174134E-3, 0.538605463E-2, 0.2739102119E-1, 0.103150592, 0.2785706633, 0.4482948495, 0.2780859284, 0.1543156123E-1],
     [-0.304990096E-3, -0.2408026379E-2, -0.1194444873E-1, -0.489259929E-1, -0.1344727247, -0.3151125777, -0.2428578325, 0.1094382207E+1],
    ],
    [0, 1, ["s"], [0.2248], [1.0]],
    [0, 1, ["s"], [0.6124E-1], [1.0]],
    [0, 3, ["p"],
     [0.1355E+2, 0.2917E+1, 0.7973],
     [0.5890567677E-1, 0.3204611067, 0.7530420618]],
    [0, 1, ["p"], [0.2185], [1.0]],
    [0, 1, ["p"], [0.5611E-1], [1.0]],
    [0, 1, ["d"], [0.817], [1.0]],
    [0, 1, ["d"], [0.23], [1.0]],
    [1, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [1, 1, ["s"], [0.122], [1.0]],
    [1, 1, ["s"], [0.2974E-1], [1.0]],
    [1, 1, ["p"], [0.727], [1.0]],
    [1, 1, ["p"], [0.141], [1.0]],
    [2, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [2, 1, ["s"], [0.122], [1.0]],
    [2, 1, ["s"], [0.2974E-1], [1.0]],
    [2, 1, ["p"], [0.727], [1.0]],
    [2, 1, ["p"], [0.141], [1.0]],
    [3, 3, ["s"],
     [0.1301E+2, 0.1962E+1, 0.4446],
     [0.3349872639E-1, 0.2348008012, 0.8136829579]],
    [3, 1, ["s"], [0.122], [1.0]],
    [3, 1, ["s"], [0.2974E-1], [1.0]],
    [3, 1, ["p"], [0.727], [1.0]],
    [3, 1, ["p"], [0.141], [1.0]],
  ],
  "centers": [
    [-0.0140883131, 0.0845903925, 0.1037711513]
    [1.4952113836, 0.0214187375, 0.0445603623]
    [-0.5919457779, -1.6621666211, 0.5350312215]
    [-0.7075176488, 0.4654193413, -2.0214243076]
  ]
}}

Details

The shells field is a list, where each item represents one generalized contraction stored as a list with the following items:

center index (counting from zero?)
the number of primitives
the angular momenta for the generalized contraction
the Gaussian exponents
one or more lists of contraction coefficients. (More are present in case of generalized contractions.)

To do

Units
Support different normalization conventions. Furthermore, Cartesian functions have different normalization constants within one shell, which is causing some ambiguity.

Keeping QCSchema in sync with QCElemental

QCSchema lags behind what's actually implemented AND DOCUMENTED in QCElemental. The QCSchema docs are where people look for info, so this creates a misleading impression about what QCSchema is/does in practice.

Some possible solutions:

QCSchema/docs is always what you get from running .schema()/autodoc on the myriad models in QCElemental
QCElemental models represent schema in development, and from time to time we concretize its state into release versions of QCSchema.
QCElemental models are not allowed to change until QCSchema changes.

My favorite option is 1. I'm okay with 2. I think 3 is a bad idea.

@dgasmith @bennybp

How to represent large arrays

Large arrays (e.g. density matrices when using a lot of basis functions) may cause some efficiency issues. At the moment all arrays are represented as a list of (lists of ...) numbers. This has some limitations:

Writing a double precision number as text in JSON gives about +150% overhead. This may be solved to some extent with BSON or HDF5, provided arrays can be recognized easily before they are written to the file.
Reading a long list of data, then detecting that it is an array and then converting it to an array is slow and very annoying in some programming languages. It would be very convenient to know in advance that the list is an array with a given shape and data type.

Both aspects of this issue can be fixed by representing arrays as follows and to handle such data appropriately when reading and writing a BSON or HDF5 file.

{"density matrix": {
  "shape": [10, 10],
  "type": "float64",
  "array": [1.234123, ...]
}}

It would even be possible to extend the JSON format to encode the array with base64 encoding (only +33% overhead). I know that such practical details are not really a part of the JSON schema. However, it is relevant to define the schema such that arrays can be read and written efficiently.

Add a schema version number to the JSON file

This is useful when dealing with backward and forward compatibility issues.

Multi-method properties

In the course of a given QM computation multiple properties (particularly one-electron) could be constructed so that a single field may become ambiguous. A good example is a CCSD density computation which may form SCF, MP2, and CCSD densities. For a quantity like dipole_moments a program may build a set of moments for each density.

A possibility is to have keys for each scf_dipole_moments, mp2_dipole_moments, and ccsd_dipole_moments. Another possibility is to build a dipole moment definition object:

property: {
  type: "dipole moment"
  method: "SCF"
  value: ...
}

which holds the properties for each method. (Let me know if I misunderstood this @langner.)

Brought up by @langner in #37.

Charges (AKA populations)

Add atomic charges/populations to the schema. The field needs to carry information about both the charge method and the electronic structure method. One possibility is to use our current convention and name fields like scf_lowdin_charges, scf_mulliken_charges, etc.

Another would be to keep the method in the name, but then have a list of charge methods, e.g.

{
  "scf_charges": [
    {"charges": [1,2,3,4],
      "charge_method": "lowdin"
    },
    {"charges": [5,6,7,8],
      "charge_method": "mulliken"
    }
  ]
}

Or finally, it could just be a great big list:

{
  "charges": [
    {"charges": [1,2,3,4],
      "charge_method": "lowdin",
      "method": "scf"
    },
    {"charges": [5,6,7,8],
      "charge_method": "mulliken",
      "method": "mp2 relaxed"
    }
  ]
}

multipole storage

No decision necessary for dipole because all 3 elements are unique, but for quadrupoles and higher one has to choose compact storage and defined order (e.g., xx, xy, xz, yy, yz, zz) or full representation (e.g., 9 element quadrupole storage). Former saves space but requires more management, which is hard to impose in schema as a data layout. I propose higher multipoles should be stored in full. For 64-poles, this is 729 elements redundant (28 unique). Any concerns or objections?

	"connectivity": {
	"description": "A list describing bonds within a molecule. Each element is a (atom1, atom2, order) tuple.",
	"type": "array",
	"items": {
	"type": "array",
	"minItems": 3,
	"maxItems": 3,
	"items": {
	"type": "number",
	"minimum": 0,
	"maximum": 5,
	}
	}
	},
	"fragments": {
	"description":
	"(nfr, -1) list of indices (0-indexed) grouping atoms into molecular fragments within the topology.",
	"type": "array",
	"items": {
	"type": "array",
	"items": {
	"type": "number",
	"multipleOf": 1.0
	}
	}
	},