Comments (10)
I think the schema could only ever go so far, and we see similar issues with other formats, i.e. referring to an index that is out of range, or a unique id that doesn't exist. Schema gets you so far, then document expected use. I think order should absolutely matter as other options end up blowing up the need for unique ids, which also may not exist but as a security blanket of sorts.
I would go (c) all the way, and at some point would like to take a stab at a deeper validation of files once I am done with our implementation of the spec for Avogadro. Atom reordering, index reordering is tough. How would you hash even if things are ordered, remove all space, and ensure same numerical accuracy somehow? Hashing of JSON is tough as object order is not guaranteed, especially considering all the parsers, numerical accuracy, etc.
from qcschema.
I don't quite understand the problem. If the ordering is unconstrained but fully specified, isn't possible to re-order whenever needed, for hashing of whatever else?
from qcschema.
Is there a place that collects these "un-written rules"? Maybe formalizing these conventions that a little would help with topics like these, where this is some expectation of use but it doesn't feel like a schema should enforce it outright.
from qcschema.
I guess on the flip side is there a reason to require ordering here and who does that enforcing since JSON Schema specs do not enforce it. I am hesitant to enforce requirements above and beyond what the base schema provides as it increases implementation details for adopters.
I think molecule hashing is a question for QCFractal itself and not particularly specific to the Schema itself.
from qcschema.
Taking a step back, here's three choices. Hope this makes more sense.
For the examples, consider a H2/He system where "QC" is a quantum chemistry program like psi that can't handle non-contiguous fragments and "FQC" is a fancy program that can.
-
(a) Fragment order doesn't matter to mol identity, fragments sorted for Schema
- this is what I was pushing for in original post, with the assumption that (a) vs. (b) was the choice. this (a) has some aesthetic advantages, but they're not pressing. happy to give up on (a).
- easy at-a-glance comparison of schema instances
- no need for schema processors to reorder frags
- visualizers will always show fragment A or bond B0 as same color
- examples
[[0, 1], [2]]
— Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}[[2], [0, 1]]
— Schema violated (though unenforceable to schemavalidator) as frags unsorted
-
(b) Fragment order doesn't matter to mol identity, Fragment order unconstrained for Schema
- I think this is what the schema is representing now, where fragment order matters only to correlate fragments across
fragments
,fragment_multiplicities
, andfragment_charges
fields. I think this'll work for a long time, as most QC programs aren't reporting per-fragment data (e.g., F-SAPT). As soon as programs do report per-frag data, this (b) may become problematic for the same reason that atom reshuffling is already recognized as problematic. Hence, my new question is (b) vs. (c). - examples
[[0, 1], [2]]
— Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}[[2], [0, 1]]
— Schema happy. QC can sort fragments into[[0, 1], [2]]
{A: H2; B: He}. FQC happy {A: He; B: H2} Conflict!
- I think this is what the schema is representing now, where fragment order matters only to correlate fragments across
-
(c) Fragment order does matter to mol identity
- new rule that like atom order, schema processors must not reorder frags
- visualizers will always show fragment A or bond B0 as same color
- examples
[[0, 1], [2]]
— Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}- -- diff sys --
[[2], [0, 1]]
— Schema happy. QC stop as atoms out of order. FQC happy {A: He, B: H2}
from qcschema.
Currently for QCArchive we use c) where order does matter to the most strict molecular identity due to the reasons mentioned above (there are less strict which this does not matter). I do agree that a) is the best scenario.
If we do a) the main question to me is how do we enforce this? At the moment we can say use json-schema draft4 validators or higher and everything is great. Stepping off json-schema validators we will need to say draft4 + extra stuff which starts to get complex. Saying that this is the "recommended" way of organizing fragments will likely cause odd conflicts in the future. Not sure I have a good answer here.
I do recommend trying to separate identity and the schema. QCA has multiple identity tags depending on the required precision and is fairly specific to the QCA framework. It would great to specify a generic identity here, but I believe would require much more input and work than the schema is currently receiving.
from qcschema.
I agree that "draft4 + extra stuff" is to be avoided. I guess I consider that not much of the schema beyond structure is enforced now. There's nothing to prevent geometry
from being (2 * nat,)
or atomic number from being -3
or fragments
from 1-indexing atoms or symbols
from being all-caps or two atoms from being co-spatial. All those checks are part of the validation step in code (the extension of which sparked this issue), and I figured enforcement of (a) would be similar.
EDIT: That is, there's extra spec in the "description" subfield, and (a)/(b)/(c) would be similar.
I think (c) is best for future-proofing. (a) takes advantage of "fragmentN" being a dummy variable of sorts, but I can imagine it scaring nbody writers. (b) is what I think the Schema intends now, and it looks dangerous.
(Trying to keep identity and schema separated ...) The schema contains info on what the consumer can do to the molecule (e.g., if fix_com=True
, then COM mustn't shift). There's also un-written info like consumer must not attach this molecule to results containing per-atom arrays (like gradients) if atoms have been reordered in the calculation thereof (recall that I guessed wrongly on this interpretation). Wherever that un-written rule goes, I just think that the analogous rule about fragments (diff btwn (b) and (c)) should also be clarified.
from qcschema.
@cryos Thanks for the input. Do you have examples where schema document expected use? As @loriab implied we are already doing that with the (2 * nat, )
language. It would be best if we could represent this in a better way.
As per hashing JSON molecule representations, it is certainly doable but not particularly transferable outside of a single piece of code which is why it does need to remain isolated outside the schema. Numerical accuracy is the fun one where you have to pin rounding rules and (my favorite) flip zeros so that they are all negative/positive and also potentially orientating systems to a common frame. Feel free to ping me if you want to spin off that discussion.
from qcschema.
I don't have one off the top of my head, I will try to take a better look when I have time to come up for air - so many deadlines these next few days...
from qcschema.
It sounds like we don't know where a schema guru would go looking for "use guidelines". But is my impression that we're settling toward (c) correct? Any reason not to plop 'em into molecule.py (both atom-ordering and frag-ordering guidance) until "correct" location identified?
Back to 'connectivity'
in https://github.com/MolSSI/QCSchema/blob/master/qcschema/dev/molecule.py#L81-L94), I withdraw my first-post wish to constrain to sorted (equiv. of (a)). Do we agree that the schema's intent for bonds is (b)? (I think anyone using QC data simply expects bond/angle/etc. ordering to differ between programs and geometries.) And am I misreading the connectivity snippet or is the bond between atoms 6 & 7 going to fail schema validation?
from qcschema.
Related Issues (20)
- Suggestion: support for YAML file format HOT 5
- Request wavefunction data returns HOT 11
- Multi-method properties HOT 8
- Basis issue orderings HOT 5
- Version 1 HOT 2
- molecule extensions for zmat and efp
- Bot Integration
- move "schema_*" fields into molecule schema HOT 1
- add schema fields to molecule HOT 3
- Wavefunction data HOT 7
- For CCSD(T) add separate entry for (T) contributions to cc_properties HOT 2
- QCSchema with PBC? HOT 22
- Charges (AKA populations) HOT 2
- Keeping QCSchema in sync with QCElemental HOT 20
- multipole storage HOT 3
- Additional tensorial properties: pair with QCEl#241 HOT 2
- C-compatible QCSchema implementation HOT 7
- How to store gradients and hessians? HOT 1
- How to represent periodic systems in QCSchema? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qcschema.