materials-consortia / optimade Goto Github PK

View Code? Open in Web Editor NEW

81.0 21.0 37.0 4.54 MB

Specification of a common REST API for access to materials databases

Home Page: https://optimade.org/specification

License: Creative Commons Attribution 4.0 International

Makefile 47.49% Shell 17.78% Perl 8.93% Dockerfile 0.77% CSS 22.61% Python 2.43%

materials-databases optimade-specification optimade-api optimade

optimade's Introduction

The OPTIMADE Specification

The Open Databases Integration for Materials Design (OPTIMADE) consortium aims to make materials databases interoperational by developing a common REST API.

This repository contains the specification of the OPTIMADE API.

optimade.rst: The API specification document.
specification.optimade.org: HTML builds of the different specification versions.
schemas.optimade.org: Machine-readable schemas for property and server definitions.
AUTHORS: List of contributors.
CHANGELOG: The release notes for each version of the specification.
optimade.org: Public OPTIMADE web site
OPTIMADE wiki: Information for developers

For developers

The master branch of the repository is at the latest release or pre-release version of the specification. Versions without a version number suffix (alpha, beta, release candidates and similar) indicate a stable release.

The develop branch of the repository contains the present in-development version of the specification.

API and client implementations are encouraged to support the latest release or pre-release of the specification. If this is a pre-release, implementations are also encouraged to support the latest stable release.

Licensing of the unit definitions database `definitions.units`

The OPTIMADE standard refers to a specific version of the definitions.units database included with the source distribution of GNU Units. This file is included in the OPTIMADE repository under the subdirectory external/GNU_Units. The file is licensed separately from other files in the repository: it is available under the GNU General Public License (GPL). Full information on how the file is licensed is available in the header of the file and the license file included in that directory, COPYING.

The following does not constitute legal advice; however, we believe implementations under other licenses can use this file if:

The file is distributed separated from other source files in a way that makes it clear that it is part of the GNU Units software and is licensed under the GPL. (For example, as done in this repository: in a separate subdirectory with its own readme and license files.)
The software reads the file during program execution, e.g., at startup (as opposed to, e.g., having the file compiled or linked into a binary program distributed to end users).

Alternatively, the software using the file could itself be licensed in a way compatible with the GNU GPL.

How to cite

If you use OPTIMADE to access or host data, we kindly ask that you cite our papers accompanying the version 1.0 and 1.2 releases:

Andersen et al, OPTIMADE, an API for exchanging materials data, Sci. Data 8, 217 (2021) 10.1038/s41597-021-00974-z
Evans et al, Developments and applications of the OPTIMADE API for materials discovery, design, and data exchange, Digital Discovery (2024) 10.1039/D4DD00039K

To cite an individual version of the specification, please use the versioned records on Zenodo:

Andersen et al, The OPTIMADE Specification, Zenodo, 10.5281/zenodo.4195050

optimade's People

Contributors

Stargazers

Watchers

optimade's Issues

Property to indicate which optional structure features are used

Across discussions on various aspects of the structure format, i.e., #23 resolved by PR #49 it has been discussed that by introducing optional features in separate properties, we open for the risk of naive clients that miss the use of some optional features of the format.

In the format we ended up accepting in PR #49, the most clear example is the assemblies property. However, arguably also the use of disorder in structures can give similar issues. In #50 we have started discuss a separate optional property to indicate what is 'unknown' in the structural information (e.g., unknown coordinates), etc. One could make the argument here that a client is already able to filter for no optional properties by, e.g., querying for assemblies IS NOT KNOWN (a suggested syntax in #17). But if OPTIMaDe in later versions adds further optional properties, e.g., magnetic_moments, code that rely on this will break, because it will miss to filter out structures that uses this feature.

Hence, if we continue along this direction it would be beneficial to have a property that specifies precisely what optional structure features are present for a structure. This would make it easier for clients to only query for and operate on structures they are prepared to handle (e.g., no assemblies, no unknown coordinates.)

Some options include a format like this:

structure_features: {
  'simple': False
  'disordered': True,
  'assembly': False,
  'unknown_positions': True
}

Or this:

structure_features: ["disordered", "unknown_positions"]

Personally, I prefer the latter. Together with the syntax in #16 it will be easy to formulate sensible queries to only operate on structures with the features that one wants. E.g.,

filter=LENGTH structure_features=0

Means to only return structures using no optional features.

filter=structure_features HAS ONLY "disordered"

for simple structures and structures with disorder, but no other optional features.

OPTiMaDe logo

Please, see Materials-Consortia/materials-consortia.github.io#1

PS sorry for cross-posting!

REST vs. GraphQL

Dear colleagues,

it seems GraphQL spec has many advantages over REST:

better logical encapsulation and multiplexing
more explicit implementation guidelines
nice integration of query language with the schemata
no overfetching

What do you think on including GraphQL into OPTiMaDe?

"4.1.1. URL Query Parameters" contains a reference to 'jsonapi' format

4.1.1. URL Query Parameters contains a reference to 'jsonapi', which has been renamed to 'json' recently. This should be removed, or am I missing something?

'base_url' incompatible with JSON API links object

Currently, section 3.3.2 specifies that the API response objects should have a JSON API links object,
but then it states that this links object should have an optional base_url field.

Both is not possible - a JSON API links object can have one of two forms:

"links": {
  "self": "http://example.com/posts"
}

"links": {
  "related": {
    "href": "http://example.com/articles/1/comments",
    "meta": {
      "count": 10
    }
  }
}

Either the format needs to be extended or base_url needs to be dropped (or moved elsewhere).

P.S. We'll also need to update the example below base_url which seems to be outdated.

Is the dictionary of available_api_versions in the wrong order?

I'm running in some trouble with the available_api_versions property in the info endpoint. The example specify it like this:

        "available_api_versions": {
          "0.9.5": "http://db.example.com/optimade/v0.9/",
          "0.9.2": "http://db.example.com/optimade/v0.9.2/",
          "1.0.2": "http://db.example.com/optimade/v1.0/"
        },

This makes it impossible to serve the same API version on two different URLs. But, that seems a very natural thing to want to do. E.g., if one's implementation supports up to v0.9.5, then to let both http://db.example.com/optimade/v0.9/ and http://db.example.com/optimade/v0.9.5/ both serve 0.9.5 seems the right thing to do.

So, would it not make more sense if we defined the dictionary the other way around?, i.e.,

        "available_api_versions": {
          "http://db.example.com/optimade/v0.9/": "0.9.5",
          "http://db.example.com/optimade/v0.9.5/": "0.9.5",
          "http://db.example.com/optimade/v0.9.2/": "0.9.2",
          "http://db.example.com/optimade/v1.0/": "1.0.2"
        },

Or, if we want the dictionary in this order, perhaps we can let the URL be a list?,

        "available_api_versions": {
          "0.9.5": ["http://db.example.com/optimade/v0.9/, http://db.example.com/optimade/v0.9.5/"],
          "0.9.2": ["http://db.example.com/optimade/v0.9.2/"],
          "1.0.2": ["http://db.example.com/optimade/v1.0/]"
        },

Extending "reserved keywords"

There may be a need/want to add reserved words. Now, only info is a reserved keyword.

Technically, I think only info needs to be reserved in the sense described under ID, since the point is to not get collisions between, e.g., /structures/info and /structures/42. If we intend to actually reserve our other endpoint names here, then extensions should also be in this list. (In fact, it might be a very good idea to at least also reserve extensions, since it facilitates future use cases where someone needs to add endpoint-specific extensions...)

Originally posted by @rartino in #68

Chemical composition and chemical formula

How do we represent chemical composition and chemical formulas in OPTIMaDe?

API response should include reference to schema, to be completed first

We (Fawzi, Andrius, Markus, Matt, Leopold) propose that the API response should include a reference to a schema that can be used to validate its responses.

This can include one or two of:

a JSON schema
a NOMAD meta-info schema

Advantages:

machine-readable way for clients to understand the (intended) response format, including custom extensions

Disadvantages:

none that we are aware of

Structure symmetry in OPTIMADE

We need to think a bit about how we represent symmetry in relation to structures if we are going to include that. There is indeed good reason to allow queries on symmetry data when available.

If symmetry information is given, it may be best to require giving the list of symmetry operations, which is what strict CIF files do to avoid ambiguity. In addition, spacegroup number, ML, and Hall symbol can also be allowed as optional go give. However, it is probably unwise to allow giving those, without also giving symmetry operations, since that would replicate issues found for less strict CIF files.

The question is how to represent symmetry operations. Some alternatives:

The cif x,y,z format
The cif x,y,z format, but with canonicalized individual symmetry operations (e.g., require translations to be [0-1), etc.
The cif x,y,z format but with the complete set of symmetry operations canonicalized, e.g., by sorting alternative 2.
Some other, more compact, format we come up with.

JSON API headers for Index Meta-Database

Lifted from PR #68.

(..) I think in the final version of the pull request that put this info here, we lost a sentence either here or elsewhere saying that these endpoints on index meta-databases are exempt from the jsonapi header requirements on the Content-type and Accept headers, and may serve the response with any Content-type.

But maybe this is something we should take in a separate PR, since it isn't only a cosmetic change. It seems very important, though.

Originally posted by @rartino in https://github.com/_render_node/MDE3OlB1bGxSZXF1ZXN0UmV2aWV3MjM2NjMzNDQw/pull_request_reviews/more_threads

Support sub-databases within each database

This issue suggests a possible implementation of the concept of sub-databases within a database provider registered in the OPTiMaDe specification.

This is useful for database providers (like e.g. Materials Cloud) where there isn't necessarily a "main" database, but many databases, all however provided by the same database provided (Materials Cloud, in this case), exposing the same API, but managed independently by different researchers (as an analogue, imagine GitHub repositories, where each user can manage his own GitHub repository, but the API is provided for all DBs by GitHub itself).

While this could be achieved by implementing the current API for each sub-DB, this would require to register each of them with the current specification (that, at the bottom, lists DB providers and their URL) but this is not a viable approach for cases where the sub-DBs within one registered DB provider can be added and removed by users at any time.

This proposal has the following aims:

provide a minimal extension to the current API, supporting a minimal but useful usecase (having multiple DBs), and defer to the future further interesting extensions (see e.g. #22, that focuses more on discovering other DBs)
allow for the concept of 'default' DB for those who only have only one "default" sub-DB
it should require only minimal effort to implement (especially for those who only have only one "default" sub-DB) - the current addition can be implemented as a static JSON file by DBs that only have one (or a fixed number of) sub-DBs.

As side-effect:

this allows to easily define a "test" DB (that was discussed at the June 2018 Optimade meeting), see #44.

Proposed implementation

Implement the same API (the one already defined by OPTiMaDe) for each sub-DB, where each sub-DB has
a different base_url, for instance:
(versions might be different for each DB - no requirement here). It is up to the implementation to decide the best approach to reuse the API code.
In the OPTiMaDe API specification Appendix, where now we
list only a prefix with a generic name of the DB, e.g.:
```
_exmpl_: used for examples, not to be assigned to a real database
```
replace this with a common, well defined API endpoint, e.g. http://www.example.com/optimade-discovery/, like
```
_exmpl_: http://www.example.com/optimade-discovery/ (used for examples, not to be assigned to a real database)
```
whose format we describe below.
Ideally, I think we should also add a machine-readable list for automated discovery.
The http://www.example.com/optimade-discovery/ returns generic information on the given DB provider, about the sub-DBs, and in the future can be also extended (this is e.g. a good place to add siblings as discussed in #22). Moreover it can also contain DB-specific extensions using the prefix as already possible in the OPTiMaDe API.
The format of the output of the discovery endpoint could be something like this (this is not meant to be an actual format specification, but only an example to discuss the API content -
once we make this issue an accepted feature, we will discuss the format in more detail):

{
    'db_provider_name': 'Example',
    'db_provider_description': 'DB used for examples, not to be assigned to a real database',
    'db_provider_homepage': 'http://example.com',
    'db_provider_optimade_prefix': 'exmpl',
    'sub_databases': {
         "main": {
             "base_url": "http://example.com/mainDB/optimade/"
         },
         "test": {
             "base_url": "http://example.com/testDB/optimade/"
         },
         "third": {
             "base_url": "http://example.com/AThirdDB/optimade/"
         }
    },
    'default_sub_database': 'main',
    'optimade_test_sub_database': 'test',
}

the default_sub_database is optional, should be provided only for DBs that want to have clients point by default to a given DB.
the optimade_test_sub_database is optional and should be exposed only by those sub-databases that expose a test DB as discussed in #44.
we can include additional metadata, e.g. the version of this information etc., if we think it's useful.
I made each sub-DB a dictionary if in the future we want to have more metadata here (e.g. if in the future we want to have multiple test DBs of different formats, if we want to say if the sub-DB is open or requires authentication, if there is a suggested indication on rate-limiting for querying, etc.)
Probably we will need to implement a small extension to my example above to support pagination of the sub-DBs.
as noted above, this response can be hardcoded in a static JSON file if the list of sub-DBs does not change (quickly) over time.

Finally, additional side-effects:

this extension will make all DB-providers defined by OPTiMaDe completely automatically discoverable by a machine:
1. read the OPTiMaDe specs
2. check each "discovery" endpoint
3. check each sub-DB (or only the default one)
this extension will provide machine-readable metadata about the various DB-providers

Adoption of json schemas and continous integration

Build json schemas (or Nomad meta info schemas) for OPTIMaDe, and implement verification / continuous integration.

Separate out organizational and technical specifications

Already in #63 the list of providers is suggested to be separated out of the API specifications into a JSON file, providers.json.
Since this is considered an organizational specification, this is apt.

In accordance with the latest consortium discussions and meetings, there is a desire to separate out the organizational aspects from the API specifications.
This will help to cement the OPTiMaDe API as a stand-alone API specification.

The rules/specification of the consortium's own use of OPTiMaDe should be put in a separate GitHub repository.

This includes, among other things, the proposed providers.json file, as well as several recommendations pertaining to the proposed 'index' meta-database.

Discovery mechanism

We have discussed the many benefits to add a discovery mechanism to the OPTIMaDe API.

This requires establishing relations to other databases as siblings, parents, or children. The resonable place for this seems to be a section in the output of the info endpoint.

Some notes:

All endpoints need to have a string identifier of "which database" is served at a given endpoint so clients can avoid duplication.
There should be a standard for the name of the "topmost" database (to distinguish it from children), e.g., database_specific_prefix + ".main".
Children should be named database_prefix + "." + child_name
A testing database, if provided, should be named: database_prefix + ".test"

Allowing for unknown/null coordinates

As discussed in #49, we split part of the discussion of #23 to this new issue.

From the comment of @rartino in #23:

There are differing opinions on how to handle systems with atoms at unknown coordinates. How to resolve this is deferred into a separate issue. Two alternatives is to allow null in the coordinate lists, OR put an extra optional property that alongside the known coordinates describe the information that exists about the atoms at unknown positions.

The separate x,y,z coordinates are allowed to be given as 'null' if unknown. E.g., for disordered systems where coordinates are not fixed, and certain situations where light atoms can be near heavier atoms without knowing exactly where. However, we note that the latter situation can alternatively be represented by a species ~ "CH4", which can also be used in an assembly with "C" to indicate that it is not clear which one of the C atoms has the four H. (If it is known that all 4 always sits on on C...)

There is an extra required property all_coordinates_known that is true or false depending on whether any coordinate is a null value. (So clients can easily use filter to get only structures where all coordinates are known.)

And from @sauliusg :

Actually, there seems to be the third way to represent atoms with unknown positions: specify no site for them (so no 'null' coordinates), and have a species that is on no site.

Experimental data in OPTIMaDe

It has been suggested that we make sure to have standardized entries, properties, etc. to handle experimental data in OPTIMaDe. This includes some info on synthesis.

Allow sorting and/or paging of output

Allow a sort query parameter, e.g., something like:

    ?filter=element_ratios HAS element = "Si"&order=chemical_formula

Ascending, descending? Multiple fields?

Improve project README

Link to the Optimade website
Describe the branching model (short)
Add link to relevant wiki pages

Missing sentence separator

A sentence in 4.1.2 JSON API response schema subsection reads:

immutable_id: an OPTIONAL field containing the entry's immutable ID db specific properties need to be prefixed by the db specific prefix

From the source of .md file I assume that the part "db specific properties need to be prefixed by the db specific prefix" should constitute a separate paragraph (not connected to the description of immutable_id).

Provenance / Publication references

OPTIMaDe needs to provide some facility to communicate data provenance.

Specifically, some data provenance property needs to facilitate the aggregation of OPTIMaDa IDs (i.e., database prefix + local id) and other ids.

Dates?

Publications?

Structure entry output format standardization (for crystal coordinates)

We need a standardization of the structure entry type that allows communicating "where the atoms are", composition, etc.

Adopt a length predicate to the filtering language

The suggestion is to adopt a length predicate to the filtering language

LENGTH set_property < 5

Placement of properties described in "6. Entry list" is not specified

6. Entry list defines entry properties, however, it is not specified where (to which level) these properties need to be added to the response. 4.1.2 JSON API response schema makes it clear that id is a property of objects in top-level data list, however, properties of "Structure" should possibly (I suppose) be placed in attributes subobject of top-level data list objects.

Allow fuzzier searches

For example, "element HAS [‘III’, ‘IV’]" to search for certain group elements (proposed in the talk in the 'matador' system).

Data model for correlated data

I've had the discussion with several people which data format (in queries and output) makes the most sense:

    elements: [ "Si", "Ge" ]
    element_ratios: [ 0.5, 1.0 ]

    element_ratios: {
       element: [ "Si", "Ge" ],
       ratio: [ 0.5, 1.0 ],
    }

    element_ratios: [
        { element: "Si"
          ratio: 0.5
	},
	{ element: "Ge",
          ratio: 1.0
	}
    ]

I think there is a weak consensus towards preferring the first format. So, the proposal is to put something like this in the specification:

In the definition of the data format of new properties (e.g., with data-specific prefixes) OPTIMaDe recommends in the jsonapi output format to prefer separate correlated lists over nested json dictionaries if there are not specific reasons to use a nested format.

A few motivations to prefer the flat format follows:

Keeping the data formats simple will allow for greater consistency between different output data format.
The presently suggested query format does not really allow for queries on correlated data in other formats. (Though, if sub-fields in the output are required to always use unique keys - even on other levels - one can arguably re-interpret the zip operator as a subfield selector.)
When processing data resulting from queries, to handle lists where every element is the same is often easier for implementations.
Parsing a flat format into each databases' preferred internal structure is preferred.
It leads to a reduction in data traffic that in some cases can be significant.

Each DB provider should have a test sub-DB

This would allow for

centralized testing of the API (unit tests written only once and tested on all known/registered DB providers)
easy checking of the currently implemented features

Moreover, if we agree on the content of the test DB (say ~200 structures) each DB can store them in the internal format of the DB, and it is easy to check additionally that the way structures are returned by the API (e.g. in corner cases like partial occupancies, low-dimenasional systems, assemblies, ... - see #23) is consistent and correct across implementations.

Handle unset optional properties

We should adopt policies around unset optional properties.

XOR operator in filtering language

The suggestion is to adopt the XOR operator to the query language.

Requirement level of properties described in "6. Entry list" is not specified

6. Entry list defines entry properties, however, it is not specified which of these properties MUST be present and which are OPTIONAL. I suggest noting the requirements explicitly.

Data response format when only one item can ever be returned

The jsonapi specification gives examples for when the "primary data is a single resource object:" And in those cases the returned 'data' is not a list, but directly the dictionary for that single resource object. Here is an example directly from the jsonapi specification:

{
  "data": {
    "type": "articles",
    "id": "1",
    "attributes": {
      // ... this article's attributes
    },
    "relationships": {
      // ... this article's relationships
    }
  }
}

In the queries in OPTIMaDe that only ever will return a single item in data, maybe it would be cleaner to follow their example?

So for the info endpoint we now have this:

{
  ... <other response items> ...
  "data": [
    {
      "type": "info",
      "id": "/",
      "attributes": {
        "api_version": "v0.9",
        "available_api_versions": {
          "0.9.5": "http://db.example.com/optimade/v0.9/",
          "0.9.2": "http://db.example.com/optimade/v0.9.2/",
          "1.0.2": "http://db.example.com/optimade/v1.0/"
        },
        "formats": [
          "json",
          "xml"
        ],
        "entry_types_by_format": {
          "json": [
            "structure",
            "calculation"
          ],
          "xml": [
            "structure"
          ]
        },
        "available_endpoints": [
          "entry",
          "all",
          "info"
        ]
      }
    }
  ]
}

But maybe it should be this:

{
  ... <other response items> ...
  "data": {
      "type": "info",
      "id": "/",
      "attributes": {
        "api_version": "v0.9",
        "available_api_versions": {
          "0.9.5": "http://db.example.com/optimade/v0.9/",
          "0.9.2": "http://db.example.com/optimade/v0.9.2/",
          "1.0.2": "http://db.example.com/optimade/v1.0/"
        },
        "formats": [
          "json",
          "xml"
        ],
        "entry_types_by_format": {
          "json": [
            "structure",
            "calculation"
          ],
          "xml": [
            "structure"
          ]
        },
        "available_endpoints": [
          "entry",
          "all",
          "info"
        ]
      }
    }
}

Perhaps more important, this also seems to trigger an ambiguity in the present specification for what we call "Single entry endpoints", i.e. for this request:
http://example.com/optimade/v0.9/structures/exmpl:struct/3232823
Should one return this:

{
  ... <other response items> ...
  "data": [
    {
      "type": "structure",
      "attributes": {
        "formula": "Es2"
        "local_id": "exmpl:struct/3232823",
        "url": "http://example.com/structures/exmpl:struct/3232823",
        "immutable_id": "http://example.db/structs/1234@123",
        "last_modified": "2007-04-07T12:02Z"
      },
    },
  ]
}

Or this:

{
  ... <other response items> ...
  "data": {
      "type": "structure",
      "attributes": {
        "formula": "Es2"
        "local_id": "exmpl:struct/3232823",
        "url": "http://example.com/structures/exmpl:struct/3232823",
        "immutable_id": "http://example.db/structs/1234@123",
        "last_modified": "2007-04-07T12:02Z"
      }
  }
}

And, regardless of what we decide, we probably should add an example response in that section of the specification.

Correct how the all endpoint is specified (or remove it completely from the specs)

What does the /all endpoint presently return?

Without a filter it should return all entries of all entry types, using the jsonapi 'type' member to specify which entry_type each record is.

So, e.g.,

   /all?filter=id="_exmpl_42"

Looks for any entry in the database with the id _exmpl_42, regardless of which type it is.

Update authors list

including participants to June 2018 meeting?

Add travis tests for grammar checks

Also, add protection on the GitHub branch

Centralized server

Should OPTIMaDe provide a centralized server to aggregate all OPTIMaDe-supporting databases?

This correlates with the discovery mechanism in #22, but it is likely good to provide both.
An aggregator could be implemented via OPTIMaDe API itself, which aggregates all
sibling servers it knows about. This would be useful even outside a "web interface". But, then the website could simply query that API point.

Some suggestions for the filter grammar

First, a big thank you to (I think primarily) @sauliusg for the work with the formal grammar for the filtering language. Now, that I have gone through actually implementing these filters based on the grammar, I have a few thoughts.

I realize it may at first seem a quite major thing to propose changes to the filtering grammar at this point. However, I stress that none of the changes proposed below actually changes any behavior of presently working API implementations. These aren't changes of how API implementations should interprete filtering strings, they are changes in how to best define the present behavior in a way that is as unambiguous, consistent and as useful for implentors as possible.

The 'filter=' keyword

I propose 'filter=' should not formally be part of the filtering language grammar at all. To me 'filter=' it is the delivery mechanism of the filter in the url query which is outside the filtering language itself. My two primary motivations:

I now want to refer to 'OPTIMaDe filter strings' in various contexts, not just as delivered in the query API. It seems awkward to keep 'filter=' as a prefix in those contexts when it has no function there, and it seems equally awkward to talk about, e.g., "the standard OPTIMaDe grammar but starting from the node".
It introduces an obstructive keyword that may be in the way of relevant queries. It is fairly easy to understand that you cannot name your Identifiers AND, OR, or NOT, but 'filter=' will down the line surely get in the way for someone's attempts to query on a field named 'filter' with potentially weird error messages (since if one in an Expression has "filter='test" it will be tokenized as (Keyword:'filter=', Identifier:'test') instead of ('Identifier:'filter', Operator:'=', Identifier:'test') as expected.)

Hence, my proposal is to change the top of the grammar tree to:

OPTIMaDeFilter = Expression ;

and completely remove <Filter> and <Keyword> from the grammar.

Spaces

Is this way of handling whitespace with [Spaces] a good idea? Is any other 'standard' published EBNF using anything resembling this? It seems rare when looking around. The wikipedia article says "Whitespaces and comments are typically ignored in EBNF grammars".

Isn't rather the common way to deal with this to defer whitespace to be handled by the tokenizer? (unless whitespace really is an integral part of the syntax of a language). I believe we could just say something in the specification along the line of: Except for strings, tokens do not span whitespace. All other whitespace (space, tab and newline) should be discarded during tokenizing. As the specification presently stands, in my implementation that uses a lexer that handles whitespace "in the normal way", I'm very tempted to fetch the official grammar and then just do: grammar = grammar.replace("[Spaces]", ""). Is that what we want implementors to do?

The non-standard definition of UnicodeHighChar

I think it is not such a good idea to insert a Grammatica-specific syntax in the middle of the otherwise standard EBNF. IF we could create a completely resolved standard EBNF I would be all for including that in the specification, but I do not see how it can be done with the choice of allowing arbitrary unicode in strings.

That means we need to go to some "non-standard-EBNF" of defining the <String> token anyway. And since that must be done, I suggest splitting the present EBNF into two machine-separable parts. One as the formal standard EBNF grammar of non-terminals. The other would define all the suggested tokens to use in the lexer in a format useful for implementors; I propose POSIX-Extended Regular Expressions.

Below follows the POSIX-Extended Regular Expressions token definitions I presently use in my implementation, which I believe are equivalent with the present specification. I suggest we incorporate in the specification:

  Identifier: [a-zA-Z_][a-zA-Z_0-9]*
  String: "([^\"]|\\.)*"
  Number: (\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)((e|E)(\+|-)?[0-9]+)?

(Note that due to differences in escaping backslash in regex flavors the String one is edited from what I use in Python, the one above should be right for Posix ERE.) These definitions then technically obsoletes all of <UnicodeHighChar>, <EscapedChar>, <UnescapedChar>, <Punctuator>, <Exponent>, <Sign>, <Digits>, <Digit>, which would be removed from the formal standard EBNF grammar part of the specfication.

But I'm certainly not opposed to ALSO include a 'grammatica' definition of the tokens, which would be the present EBNF-like version of those definitions with the grammatica extension.

EDIT 2018-03-22: (To keep everything up here, I've appended another issue)

Allowing `value=value`, `value=identifier`, and `identifier=identifier`

Arguably, the most commonly expected construct in, what is meant to be, a somewhat straightforward filtering language is on the form identifier <operator> value. But, the grammar explicitly allows also the following constructs: value <operator> value, value <operator> identifier, and identifier <operator> identifier. As I am trying to implement the handling of these, I get into some difficulties because OPTIMaDe doesn't (yet) properly define types for its fields. (I brought that up on the last CECAM meeting, but I couldn't find an issue filed for it; I need to look more or and if it is not there, also file it an issue.)

value=identifier: from the technical standpoint, this one is trivial, I've included it here only because one can question if there is a need to allow it, or if the querying language would be simpler by disallowing it.
identifier=identifier: as the specification presently stands, what is the formally correct way of handling such a comparison if the identifiers are not the same? E.g., chemical_formula=prototype_formula or nelements > _exmp_other_numerical_field. Note that presently the OPTIMaDe type model essentially makes every property is its own type, which defines its own semantics for comparisons. E.g., elements are equal regardless of order and equal even if they contain subsets of elements (which really seems an abuse of the equal operator when there exists an >=...).

However, I suspect that the correct handling here is to simply reject any comparison of two different identifiers unless it is clear in the specification that they have the same semantics (e.g., for integers with no comments about non-standard comparisons...) But, we are not so clear on that in the spec presently.

But, the one that truly baffles me how to implement correctly is this one:

value <operator> value: Since we don't have a type model where we can unambiguously detect the type from the expression of the value, I don't see how I can derive the semantics for this comparison. If I see "Al, Ga" = "Ga": is that an "element"-type comparison? Or a string comparison? How can I know?

So: in summary: going forward we absolutely need to think about the typing system for OPTIMaDe. In my opinion we need a type system where types (including the semantics for comparisons) are clearly derivable from the value expression. Then one can confidently either carry out a comparison, or throw a type error if the types do not match. With that, value <operator> value, identifier <operator> identifier becomes well-defined. This means that if we want to keep the particular comparison semantics for, e.g., the elements property, we need to define that as a "set"-type and give it a form of expression that is recognizable preferably down on the token level.

Until we have sorted that out, would it be better to disallow all other forms than identifier <operator> value?

Handle uncertainty in data values

Handle data uncertainty in the OPTIMaDe API.

Consider rewriting/updating the HTTP Response Status Codes

Lifted from PR #68.

(...) I also don't like those comments in this table anyway. People may think those are supposed to be included as part of the response message, and that really shouldn't be the case. (These messages are not meant to vary with code other than localization). Should we remove them and instead focus on being more clear in the specification on for what conditions which errors are supposed to be returned? Furthermore, I also think this table misses some error codes specified by the jsonapi specification...

Originally posted by @rartino in https://github.com/_render_node/MDE3OlB1bGxSZXF1ZXN0UmV2aWV3MjM2NjMzNDQw/pull_request_reviews/more_threads

conflicting definitions of content of "entries"

There are currently two, conflicting definitions of the basic content of an "entry"

Section 4.1.2 describing the case of multiple entries
Section 6.1 describing properties used by multiple entry types

First of all, having two definitions seems unnecessary, since 4.2.2 states:

The response for a 'single entry' endpoint is the same as for 'entry listing' endpoint responses, except that the value of the "data" key MUST have only one entry.

Second, one needs to decide for one of the two, in particular

local_id vs id
~~last_modified vs modification_date~~
immutable_id vs nothing

I'm not sure whether this decision has already been made at some point and just hasn't found its way into the spec yet, could someone please comment on that?

~~Let me just reiterate that modification_date is a standard field in JSON schema as already pointed out in #1~~
Edit: Issue #1 has been fixed in PR #77, keeping only last_modified.

'available_endpoints' is not defined

Example in "4.4.1. Base URL info endpoint" displays the usage of a field available_endpoints, however, this field is not described in the documentation. Introduction of this description should be straightforward.

What to return for requests to the base_url

As far as I see in the specification, we do not currently say what to return for a request to the base_url. Should we?

I'd propose something like this:

{
    "data": [
        {
            "id": "/",
            "type": "OPTIMaDe base url"
            "attributes": {
                "available_endpoints": [
                    [
                        "structures",
                        "calculations"
                    ]
                ]
            },
        }
    ]
}

But, then I realized; for #46 and PR #56 we are now discussing that an "index" meta-database should be possible to be hosted with just static files. When doing do, it may in some hosting situations be an obstacle to have to return something specific for a request for the "directory" where the files are stored.

So, maybe we should say that this is optional? But when implementing it just seemed weird not to say anything about what the base URL returns.

Filter language extensions

It is arguably desirable to be able to write filter implementations that can translate between an optimade filter string and most backend languages without having to know the specifics for each and every property we define. This requires lexically, or at least syntaxtically, distinguishable datatypes so the filter implementation can deduce sufficient information from the filter string itself.

In the OPTIMaDe v0.9 syntax this has the implicit meaning of returning only structures with both "Si" AND "Ge" and other elements:

filter=elements="Si,Ge"

A filter language implementation that does not know how we defined the elements property cannot tell this apart from a string equality. It won't know what to do with this.

Also, we'd like to investigate extensions for searching for correlated 'list-form' data, e.g., element_ratios that tie elements with ratios.

Calculation entry standardization

We need to standardize mandatory and optional the fields in the calculation entry type.

Include implementation details into meta

I propose including implementation details in top-level meta subobject. An example:

{
   "meta":{
      "implementation":{
         "maintainer":{
            "email":"[email protected]"
         },
         "source_url":"svn://www.crystallography.net/cod/trunk/cod/cgi-bin/optimade.pl@205080",
         "title":"Crystallography Open Database",
         "version":"v0.9.5-alpha.1"
      }
}

Description:

maintainer: JSON object. Attributes:
- email: e-mail address to contact the webmaster of the site
source_url: URL of the source of the program, which has handled the request
title: title of the database
version: version of the program, which has handled the request

All attributes may be defined as optional.

Workflow entry type

It would be nice with a workflow entry type in OPTIMaDe do describe a workflow as steps.

It may be nice if calculations can reference this to explain the workflow taken in the calculation.

We could have abstract standardized names for steps, e.g., 'structure relaxation', plus allow database specific prefix ones, e.g. _exmpl_calculate_color.

It may be nice to think carefully what goes into workflow and what goes into calculation (parameters?)

'last_modified' vs. 'modification_date'

4.1.2 JSON API response schema defines property last_modified for last modification date of an entry, however, modification_date property is described for the same function in 6.1.2. modification_date. I propose removing the definition of modification_date. Or am I missing the point here?

Trajectories

Trajectories should be a simple extension of the structure API (see Issue #23)

Suggestion :

lattice_vectors_trajectory : list of each lattice_vectors-like (as defined in structure, see Issue #23) at each time step or relaxation step
cartesian_site_positions_trajectory : list of each cartesian_site_positions-like (as defined in structure) at each time step or relaxation step
For a trajectory/relaxation to be "activated", at least one of the previous field has to be defined.

For trajectories, the following (optional) fields can be given :
md_start_time : Starting time of the trajectory (suggested unit : fs)
md_time_step : Time step of the MD trajectory (if time step is constant), in suggested unit fs
md_times : If variable time steps are allowed, list of times of the MD trajectory, in suggested unit fs

We suggest that the rest of the structure field stays the same (dimension_types, species, ...). Assemblies could also be used for e.g. Path Integral Molecular Dynamics.

Extend query (filter) language to include substring comparison operators

Let's add the following operators to the filter language:

string_property LIKE "value" # as in SQL 'select string_property from data where string_property like "%value%"'
string_properties STARTS WITH "value" # as in SQL 'select string_property from data where string_property like "value%"'
string_properties ENDS WITH "value" # as in SQL 'select string_property from data where string_property like "%value"'
string_properties UNLIKE "value" # as in SQL 'select string_property from data where string_property NOT like "%value%"'

Should the API response include units?

As soon as the API responses include physical quantities that have units, we need to decide whether to include these units as part of the response.

pros:
- including units (from a standardized set) as part of the response makes it possible to avoid (some) unit conversions.
- Numbers may be more human-readable by using units adapted to the data in the corresponding entry
cons:
- clients will need to implement unit conversion
- the schema will be slightly more complicated
- the size of the response increases

We (Fawzi, Markus, Andrius, Leopold) propose, for the moment, to do the simple thing and not include units as part of the response.
Standard units will be defined for all physical quantities. This information will be present in the JSON schema, preferably in machine-readable form.

Semantic interoperability

It's important to have a common vocabulary of terms at the solid state physics level.

For example, take a simple term formation energy. In fact, there are lots of approaches (whether we include chemical potential and, if yes, how do we define it etc.) Another more complex example: how is the crystalline structure defined in a particular case. Is it centered, are the atoms wrapped in a unit cell, what is the space group setting, etc.

In general, how can we make sure a term used in repo A is the same as used in repo B?