Giter Site home page Giter Site logo

Comments (10)

cryos avatar cryos commented on June 20, 2024 2

I think the schema could only ever go so far, and we see similar issues with other formats, i.e. referring to an index that is out of range, or a unique id that doesn't exist. Schema gets you so far, then document expected use. I think order should absolutely matter as other options end up blowing up the need for unique ids, which also may not exist but as a security blanket of sorts.

I would go (c) all the way, and at some point would like to take a stab at a deeper validation of files once I am done with our implementation of the spec for Avogadro. Atom reordering, index reordering is tough. How would you hash even if things are ordered, remove all space, and ensure same numerical accuracy somehow? Hashing of JSON is tough as object order is not guaranteed, especially considering all the parsers, numerical accuracy, etc.

from qcschema.

langner avatar langner commented on June 20, 2024 1

I don't quite understand the problem. If the ordering is unconstrained but fully specified, isn't possible to re-order whenever needed, for hashing of whatever else?

from qcschema.

langner avatar langner commented on June 20, 2024 1

Is there a place that collects these "un-written rules"? Maybe formalizing these conventions that a little would help with topics like these, where this is some expectation of use but it doesn't feel like a schema should enforce it outright.

from qcschema.

dgasmith avatar dgasmith commented on June 20, 2024

I guess on the flip side is there a reason to require ordering here and who does that enforcing since JSON Schema specs do not enforce it. I am hesitant to enforce requirements above and beyond what the base schema provides as it increases implementation details for adopters.

I think molecule hashing is a question for QCFractal itself and not particularly specific to the Schema itself.

from qcschema.

loriab avatar loriab commented on June 20, 2024

Taking a step back, here's three choices. Hope this makes more sense.

For the examples, consider a H2/He system where "QC" is a quantum chemistry program like psi that can't handle non-contiguous fragments and "FQC" is a fancy program that can.

  • (a) Fragment order doesn't matter to mol identity, fragments sorted for Schema

    • this is what I was pushing for in original post, with the assumption that (a) vs. (b) was the choice. this (a) has some aesthetic advantages, but they're not pressing. happy to give up on (a).
    • easy at-a-glance comparison of schema instances
    • no need for schema processors to reorder frags
    • visualizers will always show fragment A or bond B0 as same color
    • examples
      • [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
      • [[2], [0, 1]] — Schema violated (though unenforceable to schemavalidator) as frags unsorted
  • (b) Fragment order doesn't matter to mol identity, Fragment order unconstrained for Schema

    • I think this is what the schema is representing now, where fragment order matters only to correlate fragments across fragments, fragment_multiplicities, and fragment_charges fields. I think this'll work for a long time, as most QC programs aren't reporting per-fragment data (e.g., F-SAPT). As soon as programs do report per-frag data, this (b) may become problematic for the same reason that atom reshuffling is already recognized as problematic. Hence, my new question is (b) vs. (c).
    • examples
      • [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
      • [[2], [0, 1]] — Schema happy. QC can sort fragments into [[0, 1], [2]] {A: H2; B: He}. FQC happy {A: He; B: H2} Conflict!
  • (c) Fragment order does matter to mol identity

    • new rule that like atom order, schema processors must not reorder frags
    • visualizers will always show fragment A or bond B0 as same color
    • examples
      • [[0, 1], [2]] — Schema happy. QC happy. FQC happy. Both: {A: H2, B: He}
      • -- diff sys --
      • [[2], [0, 1]] — Schema happy. QC stop as atoms out of order. FQC happy {A: He, B: H2}

from qcschema.

dgasmith avatar dgasmith commented on June 20, 2024

Currently for QCArchive we use c) where order does matter to the most strict molecular identity due to the reasons mentioned above (there are less strict which this does not matter). I do agree that a) is the best scenario.

If we do a) the main question to me is how do we enforce this? At the moment we can say use json-schema draft4 validators or higher and everything is great. Stepping off json-schema validators we will need to say draft4 + extra stuff which starts to get complex. Saying that this is the "recommended" way of organizing fragments will likely cause odd conflicts in the future. Not sure I have a good answer here.

I do recommend trying to separate identity and the schema. QCA has multiple identity tags depending on the required precision and is fairly specific to the QCA framework. It would great to specify a generic identity here, but I believe would require much more input and work than the schema is currently receiving.

from qcschema.

loriab avatar loriab commented on June 20, 2024

I agree that "draft4 + extra stuff" is to be avoided. I guess I consider that not much of the schema beyond structure is enforced now. There's nothing to prevent geometry from being (2 * nat,) or atomic number from being -3 or fragments from 1-indexing atoms or symbols from being all-caps or two atoms from being co-spatial. All those checks are part of the validation step in code (the extension of which sparked this issue), and I figured enforcement of (a) would be similar.

EDIT: That is, there's extra spec in the "description" subfield, and (a)/(b)/(c) would be similar.

I think (c) is best for future-proofing. (a) takes advantage of "fragmentN" being a dummy variable of sorts, but I can imagine it scaring nbody writers. (b) is what I think the Schema intends now, and it looks dangerous.

(Trying to keep identity and schema separated ...) The schema contains info on what the consumer can do to the molecule (e.g., if fix_com=True, then COM mustn't shift). There's also un-written info like consumer must not attach this molecule to results containing per-atom arrays (like gradients) if atoms have been reordered in the calculation thereof (recall that I guessed wrongly on this interpretation). Wherever that un-written rule goes, I just think that the analogous rule about fragments (diff btwn (b) and (c)) should also be clarified.

from qcschema.

dgasmith avatar dgasmith commented on June 20, 2024

@cryos Thanks for the input. Do you have examples where schema document expected use? As @loriab implied we are already doing that with the (2 * nat, ) language. It would be best if we could represent this in a better way.

As per hashing JSON molecule representations, it is certainly doable but not particularly transferable outside of a single piece of code which is why it does need to remain isolated outside the schema. Numerical accuracy is the fun one where you have to pin rounding rules and (my favorite) flip zeros so that they are all negative/positive and also potentially orientating systems to a common frame. Feel free to ping me if you want to spin off that discussion.

from qcschema.

cryos avatar cryos commented on June 20, 2024

I don't have one off the top of my head, I will try to take a better look when I have time to come up for air - so many deadlines these next few days...

from qcschema.

loriab avatar loriab commented on June 20, 2024

It sounds like we don't know where a schema guru would go looking for "use guidelines". But is my impression that we're settling toward (c) correct? Any reason not to plop 'em into molecule.py (both atom-ordering and frag-ordering guidance) until "correct" location identified?

Back to 'connectivity'
in https://github.com/MolSSI/QCSchema/blob/master/qcschema/dev/molecule.py#L81-L94), I withdraw my first-post wish to constrain to sorted (equiv. of (a)). Do we agree that the schema's intent for bonds is (b)? (I think anyone using QC data simply expects bond/angle/etc. ordering to differ between programs and geometries.) And am I misreading the connectivity snippet or is the bond between atoms 6 & 7 going to fail schema validation?

from qcschema.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.