modlanguage / grammar Goto Github PK

1.0 2.0 1.0 423 KB

MODL ANTLR4 grammar http://www.modl.uk

License: MIT License

ANTLR 21.82% HTML 22.33% CSS 10.09% JavaScript 45.76%

grammar's Introduction

MODL Grammar

This is the definition for Minimal Object Description Language (MODL). For more information about MODL take a look at the MODL Specification.

Machine diagrams are shown below:

Tests

Tests are provided in the tests directory. Implementations of the parsers should use those tests to check if it behaves as expected.

ANTLR4 Parser Generation

antlr4 MODLLexer.g4 MODLParser.g4 -o ../../typescript-modl-interpreter/lib/gen/MODL -package MODL -Dlanguage=JavaScript
antlr4 MODLLexer.g4 MODLParser.g4 -o ../../java-interpreter/src/main/java/uk/modl/parser/antlr -package uk.modl.parser.antlr -Dlanguage=Java

grammar's People

Contributors

Stargazers

Watchers

Forkers

sigmundfridge

grammar's Issues

Upcase method not invoked

The input:

_testing = quick-test of John's variable_methods
url_encode_example = %testing.ue

Produces:

{
    "url_encode_example": "quick-test+of+John%27s+variable_methods"
}

but should produce:

{
    "url_encode_example": "QUICK-TEST+OF+JOHN%27S+VARIABLE_METHODS"
}

@alexdalitz please confirm this and I'll update the grammar tests.

New Built-in Method Suggestions

Possibilities for the future:

%x.len - Get the string length
%x.head(n) - Extract the first n characters.
%x.tail(n) - Extract everything but the first n characters.
%x.after(s) - Extract everything after string s
%x.toptail - Remove the first and last characters, e.g. for quotes, brackets, etc.
%x.join(x,a,b,c...) - Return all params joined with x as the separator. If any param is an array then join the individual elements.
%x.split(c) - Create an array of parts of a string separated by char c.

Only strings in conditional tests

At the moment we can only test a string, e.g:

_test="http://www.tesco.com"

result={
  test="http://www.tesco.com"?
      yes
  /?
     no
}

Due to this rule in the grammar:

modl_condition
  // e.g. country=gb
  : NEWLINE* STRING? modl_operator? modl_value (PIPE modl_value )* NEWLINE*
;

This means we can't quote a test or use a number as the test. I can't really come up with any reasons why someone would need to do this but interested in your thoughts @twalmsley @alexdalitz

String Unicode Fragment Issues

👋 Hey folks! I saw the NUM project Reddit and thought it was a great idea. Diving in I found MODL, which made me even more excited, but noticed it didn't have a ton of libraries yet, so I started hacking on one just to see if I could get something working. It's still very much a work in progress, and just something I'm hacking on in my free time, so no promises on quality.

https://github.com/bign8/modl.go

Anyway, I ran into an issue with my unicode parsing logic. Based on the test added in d066849, it appears MODL is supporting non-4 digit unicode characters which doesn't seem to match with the grammar defined below or the written specification: https://www.modl.uk/specification#hex-values.

grammar/antlr4/MODLLexer.g4

Lines 73 to 78 in 3c78809

    
                 fragment UNICODE 
        
                   : 'u' HEX HEX HEX HEX 
        
                   ; 
        
                     fragment HEX 
        
                     : [0-9a-fA-F] 
        
                     ;

But, the Java library looks to support this behavior, which is great, I just didn't notice it really documented anywhere besides the test case and in the java source.

https://github.com/MODLanguage/java-interpreter/blob/d9cc9d76f73687a03114d57fccc253c3c82fad71/src/main/java/uk/modl/utils/UnicodeEscapeReplacer.java#L104-L174

Given the complexity of the UnicodeEscapeReplacer, I'm not really sure the best way to represent those nuances in the grammar effectively. But having a note somewhere that non-4 digit code points are supported would be dope. Anyway, let me know what you think and I can get something in a PR for ya.

Cheers 🍻

Operator Precedence

The precedence of '=' is not specified - equality should probably be the same precedence as '!='.

Except in something like '!a=2' where the '!' is higher precedence than '=', so it means ((!a)=2) when I suspect it should mean !(a=2)

Lexer Grammar Observations

HASH_PREFIX is absorbed into STRING and is not recognised as a separate token. It isn't mentioned in the specification and isn't handled by the java interpreter.
STRING and CSTRING are greedy which prevents detection of tokens such as INSIDE_GRAVES and as alread mentioned, HASH_PREFIX. These cases need to be handled by custom code in the Interpreter/ModlParsed classes.
HASH_PREFIX and GRAVED are still required otherwise the unit tests fail.

Is there a separate mechanism to disallow characters that are invalid for DNS TXT records? (Since MODL supports characters that DNS can't handle.)

Stray tokens files

In the antlr4 directory there are some *.tokens files that don't need to be saved as they're generated by ANTLR4. Also, they seem to interfere with code generation on Windows and result in test failures due to incorrect token constants.

I found this when trying to remove HASH_PREFIX from the MODLLexer.g4 file - its fine when the code is generated on Mac, but the *.tokens files aren't updated properly on Windows resulting in a stray HASH_PREFIX constant that causes the problem.

Short path for loading modules

Suggestion:

Use site modl.uk to store standard modules rather than using an organisation-specific site.

This is to encourage standardisation modules rather than proprietary ones, and the MODL site then becomes a repository for the standard numbered (or named) modules. The language can then use shorter instructions for loading modules hosted at the default location, e.g. modules.modl.uk, or override the location using *base.

Use *repo=x to load module x from https://modules.modl.uk/x.modl

Use *base="http://my-site.com/my-repo/" to redefine the base URL for *repo modules.

The standard module 1 can then be loaded using:

*r=1

instead of

*l="http://modules.my-org.uk/1.modl"

Standard Error Messages for Failures

Is it worth standardising error messages? The error_tests.json file could be updated to hold the expected message for each error for all interpreters to conform to.

Deep object referencing in conditionals

This works:

_array[1;2]
_array_item=%array>0
test={
  array_item=1?
     yes
  /?
     no
}

{
    "test": "yes"
}

But this doesn't:

_array[1;2]
test={
  array>0=1?
     yes
  /?
     no
}

-> line 3:10 no viable alternative at input '{\n%array>0='

I guess this is because it's assuming > is greater than in the conditional test?

Identifier names

Would it be useful to restrict that character set for identifiers (i.e. the left side of a pair)?
The current grammar allows these as identifiers, when maybe it should be something like _?[_a-zA-Z][_a-zA-Z0-9]*.

!a=1
a!=1
!a!=1
!~a~!=1
£x=y
$x=y
@x=y
-x=y
+x=y
'x'=y
'x?=y

The playground produces the following output for the above:

[
    {
        "!a": 1
    },
    {
        "a!": 1
    },
    {
        "!a!": 1
    },
    {
        "!~a~!": 1
    },
    {
        "£x": "y"
    },
    {
        "$x": "y"
    },
    {
        "@x": "y"
    },
    {
        "-x": "y"
    },
    {
        "+x": "y"
    },
    {
        "'x'": "y"
    },
    {
        "'x?": "y"
    }
]

Nested Arrays

Given:

x=[1;2:3:4]

The grammar says this is an array with 2 entries, the first is the number 2 and the second is the NB array 2:3:4, but the interpreter returns it as the 4-element array [1;2;3;4]. The ANTLR4 rule test parses it correctly.

I would expect to refer to the number 3 as %x>1>1, but I have to use %x>2 instead.

It might be worth disallowing modl_nb_array as a member of modl_array to force 'proper' nesting of arrays. (I tried it and all of the current unit tests still pass.)

File update needed for new grammar

At some point the file here:

http://s3-eu-west-1.amazonaws.com/modltestfiles/testing.txt

will need the quotes removed otherwise the %var reference won't be processed under the new interpreter behaviour. Its causing one base_test.json test to fail at the moment.

Not urgent, this is just so it doesn't get forgotten.

Grammar problem - escaped grave

Given:

a= %`x`.p;
b= %~`x`.p;
c= %`x`.p;

The ANTLR4 parser reports an error on line 3 due to the escaped grave on line 2 - removing the escape allows it to be parsed successfully.

@elliottinvent would you expect this to work or can we just let it fail as an invalid reference?

Mixed arrays and nb_arrays

This is a valid array:

	x=[1;;;;;;;2:::::::::3;;;;;;;;;;4:::::::::5]

Double negatives

Should '!a!=2' be interpreted as 'a=2'?

Graves and References Handling

For discussion:

_replace_me=new value;

one=%replace_me.u;

two=`replace_me`.u;

three=`replace_me.u`;

four=`%replace_me.u`;

five=`%replace_me`.u;

Should produce:

[
    {
        "one": "NEW VALUE"
    },
    {
        "two": "REPLACE_ME"
    },
    {
        "three": "replace_me.u"
    },
    {
        "four": "NEW VALUE"
    },
    {
        "five": "new value.u"
    }
]

Because:

one has a reference that needs to be replaced and converted to upper case.
two is not a reference so shouldn't be replaced, but should be converted to upper case because we are supposed to run methods on graved strings.
three is not a reference and has no methods outside the graves so should be de-graved but otherwise unchanged.
four is a reference and the .u is attached to it so it should be replace with new value then converted to upper case.
five is substituted with a new value to produce new value.u which is not a reference or a graved string so no further processing takes place.

See the specification Section 12, item 10, for how to handle graved strings.

Newlines before semicolons

A structure cannot have a semicolon on the next line, e.g.

		a=1:2:3;
		b=4:5:6

	is valid, but
		
		a=1:2:3
		;
		b=4:5:6

	is not.

Optional modl_operator

The modl_operator in modl_condition is optional so can be excluded while still parsing correctly, but the behaviour is incorrect. E.g.

		country=uk
		british={country.u=GB|UK?true/?false}

Can be parsed successfully as:

		country=uk
		british={country.u GB|UK?true/?false}

(by removing the '=' in the condition)
But the results are incorrect because british is set to 'false' in the second case.

Interpreter escape for %

We can use percents like this:

key=%letters

{
  "key": "%letters"
}

If the reference exists it'll replace the reference:

_letters=abc;
key=%letters

{
  "key": "abc"
}

However, it's impossible to print the characters "%letters" if _lettersexists as a reference.

So we need an escape here, and I think it needs to be an interpreter escape:

_letters=abc;
key="~%letters"

Or:

_letters=abc;
key="\%letters"

{
    "key": "%letters"
}

Grammar Problem - double quotes

this fails to parse:

a="x".p;

but its fine if the .p is removed, or if we use graves instead of double quotes, or if both double quotes are escaped.

Is this a problem that needs to be fixed or can we live with it?

More Compact Conditionals

country=uk
british={country.u=GB|UK?true/?false}

Could possibly be reduced to:

country=uk
british={country.u=GB|UK?}

With a suitable grammar change.

Array problem

Pretty sure these examples used to work but don't anymore

This is standard stuff:

[
one;
two;
three
]

This is an example with a trailing comma, this can be useful for config files:

[
one;
two;
three;
]

Both error:
-> line 2:4 extraneous input '\n' expecting {NULL, TRUE, FALSE, SC, LBRAC, LSBRAC, NUMBER, STRING, QUOTED, '{'}

White Space and newlines

not allowed in pairs, so formatting like this is not allowed (but would be easy to add if it would be useful):

		a = 
			b =
				c

Non-UTF-8 Input Characters in tests

The following base test uses non-UTF-8 characters for the apostrophes so it fails to parse in the Ruby interpreter. The spec says the input files should be valid UTF-8 so this test uses invalid MODL (it works in the playground however!):

_colour = green
_test = { colour=green? true /? false }

{
 !test?
   result=it’s not green
 /?
   result=it’s green
}

The apostrophe used in this test is \u2019 when it should be \u0027
@elliottinvent do we change the spec and the grammar or stick to the UTF-8 restriction? Personally I prefer the latter.

modl_array_conditional_return

modl_array_conditional_return doesn't require square brackets around the array items, it just requires them to be separated by semicolons. (Presumably correct because the entire modl_array_conditional is already embedded inside "[]"?)

modl_value_conditional_return allows colons

modl_value_conditional_return allows colons, so this is valid:

		b=2;c=2;d=2;e=2;a={{b=c & d=e}?v1:v2:v3/?v4:v5:v6}

I would expect this to return the modl_nb_arrays 'v1:v2:v3' and 'v4:v5:v6', is this correct? (it sets a=2 due to the general issue with expression evaluation)

a=3
b=4
c={

a=b

?

0

/

a>b

?

1

/

?

-1

}

	fragment UNICODE
	: 'u' HEX HEX HEX HEX
	;
	fragment HEX
	: [0-9a-fA-F]
	;