First, a big thank you to (I think primarily) @sauliusg for the work with the formal grammar for the filtering language. Now, that I have gone through actually implementing these filters based on the grammar, I have a few thoughts.
I realize it may at first seem a quite major thing to propose changes to the filtering grammar at this point. However, I stress that none of the changes proposed below actually changes any behavior of presently working API implementations. These aren't changes of how API implementations should interprete filtering strings, they are changes in how to best define the present behavior in a way that is as unambiguous, consistent and as useful for implentors as possible.
The 'filter=' keyword
I propose 'filter=' should not formally be part of the filtering language grammar at all. To me 'filter=' it is the delivery mechanism of the filter in the url query which is outside the filtering language itself. My two primary motivations:
-
I now want to refer to 'OPTIMaDe filter strings' in various contexts, not just as delivered in the query API. It seems awkward to keep 'filter=' as a prefix in those contexts when it has no function there, and it seems equally awkward to talk about, e.g., "the standard OPTIMaDe grammar but starting from the node".
-
It introduces an obstructive keyword that may be in the way of relevant queries. It is fairly easy to understand that you cannot name your Identifiers AND, OR, or NOT, but 'filter=' will down the line surely get in the way for someone's attempts to query on a field named 'filter' with potentially weird error messages (since if one in an Expression has "filter='test" it will be tokenized as (Keyword:'filter=', Identifier:'test') instead of ('Identifier:'filter', Operator:'=', Identifier:'test') as expected.)
Hence, my proposal is to change the top of the grammar tree to:
OPTIMaDeFilter = Expression ;
and completely remove <Filter>
and <Keyword>
from the grammar.
Spaces
Is this way of handling whitespace with [Spaces] a good idea? Is any other 'standard' published EBNF using anything resembling this? It seems rare when looking around. The wikipedia article says "Whitespaces and comments are typically ignored in EBNF grammars".
Isn't rather the common way to deal with this to defer whitespace to be handled by the tokenizer? (unless whitespace really is an integral part of the syntax of a language). I believe we could just say something in the specification along the line of: Except for strings, tokens do not span whitespace. All other whitespace (space, tab and newline) should be discarded during tokenizing.
As the specification presently stands, in my implementation that uses a lexer that handles whitespace "in the normal way", I'm very tempted to fetch the official grammar and then just do: grammar = grammar.replace("[Spaces]", "")
. Is that what we want implementors to do?
The non-standard definition of UnicodeHighChar
I think it is not such a good idea to insert a Grammatica-specific syntax in the middle of the otherwise standard EBNF. IF we could create a completely resolved standard EBNF I would be all for including that in the specification, but I do not see how it can be done with the choice of allowing arbitrary unicode in strings.
That means we need to go to some "non-standard-EBNF" of defining the <String>
token anyway. And since that must be done, I suggest splitting the present EBNF into two machine-separable parts. One as the formal standard EBNF grammar of non-terminals. The other would define all the suggested tokens to use in the lexer in a format useful for implementors; I propose POSIX-Extended Regular Expressions.
Below follows the POSIX-Extended Regular Expressions token definitions I presently use in my implementation, which I believe are equivalent with the present specification. I suggest we incorporate in the specification:
Identifier: [a-zA-Z_][a-zA-Z_0-9]*
String: "([^\"]|\\.)*"
Number: (\+|-)?([0-9]+(\.[0-9]*)?|\.[0-9]+)((e|E)(\+|-)?[0-9]+)?
(Note that due to differences in escaping backslash in regex flavors the String
one is edited from what I use in Python, the one above should be right for Posix ERE.) These definitions then technically obsoletes all of <UnicodeHighChar>
, <EscapedChar>
, <UnescapedChar>
, <Punctuator>
, <Exponent>
, <Sign>
, <Digits>
, <Digit>
, which would be removed from the formal standard EBNF grammar part of the specfication.
But I'm certainly not opposed to ALSO include a 'grammatica' definition of the tokens, which would be the present EBNF-like version of those definitions with the grammatica extension.
EDIT 2018-03-22: (To keep everything up here, I've appended another issue)
Allowing value=value
, value=identifier
, and identifier=identifier
Arguably, the most commonly expected construct in, what is meant to be, a somewhat straightforward filtering language is on the form identifier <operator> value
. But, the grammar explicitly allows also the following constructs: value <operator> value
, value <operator> identifier
, and identifier <operator> identifier
. As I am trying to implement the handling of these, I get into some difficulties because OPTIMaDe doesn't (yet) properly define types for its fields. (I brought that up on the last CECAM meeting, but I couldn't find an issue filed for it; I need to look more or and if it is not there, also file it an issue.)
-
value=identifier
: from the technical standpoint, this one is trivial, I've included it here only because one can question if there is a need to allow it, or if the querying language would be simpler by disallowing it.
-
identifier=identifier
: as the specification presently stands, what is the formally correct way of handling such a comparison if the identifiers are not the same? E.g., chemical_formula=prototype_formula
or nelements > _exmp_other_numerical_field
. Note that presently the OPTIMaDe type model essentially makes every property is its own type, which defines its own semantics for comparisons. E.g., elements are equal regardless of order and equal even if they contain subsets of elements (which really seems an abuse of the equal operator when there exists an >=
...).
However, I suspect that the correct handling here is to simply reject any comparison of two different identifiers unless it is clear in the specification that they have the same semantics (e.g., for integers with no comments about non-standard comparisons...) But, we are not so clear on that in the spec presently.
But, the one that truly baffles me how to implement correctly is this one:
value <operator> value
: Since we don't have a type model where we can unambiguously detect the type from the expression of the value, I don't see how I can derive the semantics for this comparison. If I see "Al, Ga" = "Ga": is that an "element"-type comparison? Or a string comparison? How can I know?
So: in summary: going forward we absolutely need to think about the typing system for OPTIMaDe. In my opinion we need a type system where types (including the semantics for comparisons) are clearly derivable from the value expression. Then one can confidently either carry out a comparison, or throw a type error if the types do not match. With that, value <operator> value
, identifier <operator> identifier
becomes well-defined. This means that if we want to keep the particular comparison semantics for, e.g., the elements property, we need to define that as a "set"-type and give it a form of expression that is recognizable preferably down on the token level.
Until we have sorted that out, would it be better to disallow all other forms than identifier <operator> value
?