ianh / owl Goto Github PK

View Code? Open in Web Editor NEW

740.0 740.0 22.0 2.63 MB

A parser generator for visibly pushdown languages.

License: MIT License

C 95.86% Makefile 0.28% JavaScript 1.71% HTML 2.15%

parser-generator parsing

owl's People

Contributors

Stargazers

Watchers

owl's Issues

Feature request: Unicode properties

Owl is awesome, thank you!

My proposals:

range(cp1, cp2) or range[cp1, cp2] - cp1 and cp2 are codepoints here (hex or decimal)
block(name) - Unicode's script name (Basic_Latin, Latin-1_Supplement, etc.)
property(name) - Unicode's property name (White_Space, Hyphen, Ps, Mn, etc.)
script(name) - Unicode's script name (Common, Latin, etc.)

What do you think?

How would one handle a macro/define keyword, is it possible currently?

I'm trying to implement a macro keyword. Is it possible to do something like?
'def' identifier remaining_line: define

If not, where would be a good place for me to start looking to add such a feature?

Also this is a really cool project! I've tested out quite a few different parser generators and this one just blows everything out of the water so far for what I'm trying to use it for.

ER: Regex-defined custom tokens

It occurs to me that a C-based custom token matcher is overkill in most cases. For example:

.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'

Could be completely defined in the language specification using regexes:

.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/

Potential crash after calloc due to empty node->slots

In construct_node_alloc() there is a possibly empty allocation, because number_of_slots can be 0:

owl/src/1-parse.h

Lines 1675 to 1676 in b7f1cf4

    
           node->slots = calloc(number_of_slots, sizeof(size_t)); 
        
           if (!node->slots) abort();

In this case the behavior of calloc() is not defined:

If nmemb or size is 0, then calloc() returns either NULL, or a unique pointer value that can later be successfully passed to free().
https://linux.die.net/man/3/calloc

I was trying to compile owl for an embedded system, but abort() was called when creating the tree. So I fixed the parser with the following sed command, considering the number of slots within the if-statement:

sed -i '' 's/(!node->slots)/(number_of_slots \&\& !node->slots)/' src/parser.h

I guess a similar fix would help to ensure a more consistent behavior across different compilers/architectures.

Parser doesn't compile with multiple custom tokens

The grammar:

.token foo
.token bar

program = 'f'

Produces a parser.h that fails to compile due to defining one write_custom_token function for each custom token:

static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) {
    struct owl_tree *tree = info;
    size_t token_offset = tree->next_offset;
    write_tree(tree, token_offset - tree->next_foo_token_offset);
    write_tree(tree, offset);
    write_tree(tree, length);
    write_tree(tree, data);
    tree->next_foo_token_offset = token_offset;
}
static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) { // <== Redefinition of write_custom_token
    struct owl_tree *tree = info;
    size_t token_offset = tree->next_offset;
    write_tree(tree, token_offset - tree->next_bar_token_offset);
    write_tree(tree, offset);
    write_tree(tree, length);
    write_tree(tree, data);
    tree->next_bar_token_offset = token_offset;
}

Out-of-bounds Read Vulnerability

Hello,

I am currentlry working on OpenCRS, and I used your app for testing the Vulnerability Detection module.

While doing so, I discovered an out-of-bounds read that could prove itself as a vulnerability.

Steps to reproduce

Compile using the base Makefile.
Run with valgrind to see the error report for the invalid read: valgrind ./owl -T -g -.

Patches

I already forked the repository and proposed a patch in the Pull Request #32

Shadowing of local variable

offset gets shadowed here:

owl/src/6a-generate.c

Line 618 in 0622ab1

output_line(out, " size_t offset = tree->next_%%rule_token_offset;");

This results in warning in most compilers. Don't know the project, but at first glance offset%%rule-index might work.

Storing grammars in the url.

Hello @ianh,

This is essentially the same issue as this one.

I've used that feature to collect a bunch of grammars here. I think it would be useful if the owl playground could support that too.

Other types of numbers?

owl is fantastic, thanks for sharing it! I've started experimenting with it for creating a little language that defines a system of differential equations.

The predefined number is convenient, but I would like to have more flexible parsing of numbers. In some expressions, I want integers only. For example, in parsing the expression in an index operation such as x[k - 1], the numerical values inside the brackets must be integers. I want to treat x[k - 1.5] as invalid.

For other expressions, I would like to parse complex numbers, where the imaginary part of the number is indicated with a j, e.g. z = -1.0 + 2.5j (I'm using the Python convention for complex numbers).

Is there already an obvious way to handle these cases, or will I have to modify the tokenizer to not automatically consume numerical values as doubles? After only a brief look at the code, I suspect modifying the tokenizer would involve adding two more numerical token types, say integer and imaginary. If the sequence of characters in the input stream contains a number that is terminated by j (with no whitespace), create an imaginary token, and if the number is only digits (and not terminated by j), create an integer.

Option to switch priority between user-defined tokens and built-in tokens

Hi there,

Awesome project. I love the strict approach taken to validation and the owl utility is really nifty and makes writing a definition for an existing language a breeze.

However I'm running into an issue that I think can be rather easily solved in the library: the current token classes are non-comprehensive, in that there are symbols that don't fall into the category number | string | identifier.

For context, I'm actually trying to use owl to model a shell scripting language, where there are two issues that don't play well with owl's current architecture: tokens not matching a reserved pattern should be treated as strings ("literal data") and line breaks are special characters that shouldn't be treated as generic whitespace (I'll file another issue for that separately).

This first issue should be solvable by simply introducing a final token class other or unmatched which should take literally anything so long as no keyword is expected at that point, and can be used to accept un-quoted content where it does not introduce ambiguities into the parser.

Deforested tree construction possible?

Hello @ianh,

Here you wrote:

So the last thing to do is to incrementally update the tree itself based on the change to the sequence of construction actions. I'm not sure how to do this, but I'm sure someone has written about it somewhere.

This reminds me a lot of GLR parsing. One thing that has made GLR parsing much simpler for me is not to construct an explicit tree, but to maintain a postordered list of tree events (see "Post-Order Parser Output" for a brief explanation).

A postordered list of events is inherent to LR parsing. A preordered list of events is inherent to LL parsing. I was wondering, how feasible would it be to support generating such a list of events with the approach that owl takes? That is, would it be possible to compress your construction actions into just 2 categories of events: begin_subtree_of_type_x and end_subtree_of_type_x?

If so, do you think that your approach would support a preordered or a postordered list of events?

With such a list of events, updating subtrees could be made very simple with a Piece table, which I'm currently investigating in a different project.

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fwrite ⇒ write_to_output
malloc ⇒ main

reserved identifier violation

I would like to point out that identifiers like “_AUTOMATON_H_” and “_TERMINAL_H_” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

Is the disjointness requirement for call and exit symbols really necessary?

According to #43 (comment), call and return symbols need to be disjoint.

Therefore, #38 is not possible because string literals viewed as a VPL would have an identical call and return symbol, and would not be a VPL.

However, I'm wondering, could it be that this requirement by Rajeev Alur & P. Madhusudan for VPLs is too restrictive?

I don't have an intuitive understanding of the inner workings of owl, and this might be obvious, but @ianh can you think of a counterexample where having identical call and return symbols would break any of the invariants that owl depends on?

What would happen (where would owl break) if owl ignored this requirement, and a grammar like [ '[' ... '[' ] would be accepted and compiled by owl?

e.g.?:

Provide a way to include two different parsers without symbol conflicts

Owl's generated parsers declare function names with external linkage. This is a problem if you want to include two different parsers in the same program.

Incremental parsing of visibly pushdown languages?

Hello @ianh,

I'm wondering, do you have any insights into incremental parsing where VPLs could offer some kind of advantage over, e.g., parsing context free languages using, e.g., LR/LL parsing?

Or, specifically, how would you go about making owl work in an incremental fashion, that is, if owl has parsed some source once, and only small parts of it have changed, subsequent parses should not have to reparse everything from scratch.

I'd appreciate any thoughts you might have on this.

Hexadecimal input is parsed as number instead of integer

I was thinking about adding support for hexadecimal arguments in my grammar, when I noticed that OWL already parses things like "0xff". But why is every hex input parsed as a number and not an integer? In my application it messes with the type checking system, because some function might expect an integer and would refuse to accept "0xff".

For example, the grammar

input = argument{','}
argument =
  integer : int
  number : num

with the input

1, 1.2, 0xff, 0xff.f

yields:

1           , 1.2         , 0xff        , 0xff.f      
argument:int  argument:num  argument:num  argument:num

Would it be possible to distinguish hex integers and hex floating point numbers without writing a new tokenize function?

Unable to install (make)

>make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings -DNO
T_UNIX  -o owl src/*.c
cc: error: src/*.c: Invalid argument
cc: fatal error: no input files
compilation terminated.
make: *** [owl] Error 1

What build tools do I need to get this to work?

token ' ' can't be used as both a whitespace and a normal keyword

Consider the grammar of smileys with an arbitrary long space as a nose:

face = eyes nose mouth
eyes =
  ':' : normal
  ';' : winking
nose =
  ' '* : space
mouth =
  ')' : smile

The playground says:

error: token ' ' can't be used as both a whitespace and a normal keyword

    ' '* : space
    ~~~

It's not quite clear to me why a nose can't be a space. Are space characters like ' ' used as whitespace by default? Is there a way to disable any default whitespace rules?

Indentation sensitive parsing.

Utterly random question, how difficult do you think it would be to extend owl with indentation sensitive parsing in the style of this paper: https://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf?

I'm mainly curious since this is a seriously cool project and I've learned a fair bit from it already. Basically, I'm hoping to absorb your expertise by osmosis. ^_^

How would you handle optional trailing semicolons?

This is a really nice project, I am considering porting a shell scripting language to a formal owl definition and an owl-based parser as I really like the design of this project.

However, one blocker for us is that in scripting languages with an optional terminating semicolon, a line break may be used instead of an explicit ;. The hard-coded interpretation of \n as a mere token separator is already a blocking issue here, but even if it weren't, I feel as though it would require a separate class of tokens resting somewhere between whitespace and literal tokens to handle, if I'm not mistaken?

In particular, I am looking to parse syntax like

for f in foo; bar; baz; end

as being equal to

for f in foo
    bar # or bar;
    baz # or baz;
end

This example can work by replacing all linebreaks with semicolons prior to parsing (feasibility/overhead aside) but that doesn't always work, for example:

echo hello |
    cat

as some symbols are allowed to wrap on to the next line without hard-escaping the new line with a backslash, but wouldn't be accepted if it were input as echo hello |; cat as that's two separate statements.

Any ideas?

Addition of a build system generator

I suggest to reuse a higher level build system than your current make script so that powerful checks for software features will become easier.

Confusing start symbol and missing ambiguity warning.

Consider the following grammar:

start = kabstract | ident
kabstract = "abstract"
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"

If we use the playground to parse abstract, we get the following result:

. abstract 
  kabstract

this is a little confusing to me:

it's not clear what the start symbol is, or how one might select it themselves. Does owl use all unused nonterminals as the alternatives of a pseudo start nonterminal symbol as the start symbol?
kabstract and ident are ambiguous because "abstract" can be matched via kabstract or via "a" "b" "s" "t" "r" "a" "c" "t" via ident. Why does owl not report any ambiguities here?

hex integer

If I try to implement another "built-in token class" like hexinteger (I will follow the implementation for other builtins), would the PR be excepted?

Unfortunately for my usecase I can not use 0xABC integer matcher with 0x.

syntactic sugar for 'separated by'

I find myself writing some variation of RuleName ? ("," RuleName)* repeatedly, even in simple grammars. This is not a tremendous hardship, but it would be more elegant to have a "separated by" operator, similar to those that come in a lot of parser combinator libraries. For example:

FnDefinition = "function" identifier "(" identifier ? ("," identifier)* ")" Block

would instead be:

FnDefinition = "function" identifier "(" identifier % ","  ")" Block

Again, I recognize that this is merely syntactic sugar, but the + and ? operators are also merely syntactic sugar, but they reduce repetition and increase readability.

'integer' token type?

Lots of programming languages support integers, and it'd be nice to support them in Owl too. If both types of tokens were used in the same grammar, both would be matched, with the longer match winning—if integer weren't used in the grammar, number would behave the same as it does now.

non-breaking space is not a right turnstile

(@ianh I hope you don't me reporting any issues I find)

This one is really not urgent, but the following grammar:

foo = a

Leads to the following error message:

error: 'ￂﾠ' isn't a valid token

  foo = aￂﾠ
         ~~

The space after the a is a non-breaking space (command+space), and it is being reported as a right turnstile.

Can't include parser.h in C++ project due to "class" keyword

I'm trying to use owl in a C++ project. Therefore I need to include parser.h (at least without implementation) in the main.cpp, where I call e.g. owl_tree_create_from_string(). The implementation is included in an almost empty file parser.c. The C++ compiler, however, has trouble parsing parser.h while compiling main.cpp due to the keyword class, which occurs at unexpected places.

I fixed the generated parser.h with the following sed command, appending an underscore after every "class":

sed -i '' 's/class/class_/g' src/parser.h

This is a bit more than necessary, because "class" also occurs as part of longer words. But this fix works.

It might be reasonable to slightly change the naming in 1-parse.h to make it C++ compatible. After all owl works nicely in C as well as C++.

Documentation on Custom Tokenization

I can't find any resources on how to define a custom tokenizer other than the hex-color example. That example seems to be considerable simpler than custom tokenizers owl can currently support (for example not using the tokenization info).

Additionally, knowing more details about how the default tokenizer works would be good as well. Particularly how whitespace is handled. I would have no idea how to make a whitespace parser from looking at the current examples and readme.

Compiler warning in generated code

FYI: I just tried re-building this owl experiment

https://github.com/WarrenWeckesser/experiments/tree/master/c/gufunc-out-expr

using 93a6b7c. I get the following compiler warning (converted to an error because of -Werror):

$ make
gcc -Wall -Werror -pedantic -std=c99 main.c -lm -o main
In file included from main.c:9:
dim-expr-parser.h: In function ‘owl_default_tokenizer_advance’:
dim-expr-parser.h:787:25: error: variable ‘string’ set but not used [-Werror=unused-but-set-variable]
  787 |             const char *string = text + content_offset;
      |                         ^~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:5: main] Error 1

The following is the generated code around line 787:

        else if (token == 4294967295U) {
            size_t content_offset = offset + 1;
            size_t content_length = token_length - 2;
            const char *string = text + content_offset;  // <<< line 787
            size_t string_length = content_length;
            if (has_escapes) {
                for (size_t i = 0;
                i < content_length;
                ++i) {
                    if (text[content_offset + i] == '\\') {
                        string_length--;
                        i++;
                    }
                }
                char *unescaped = allocate_string_contents(string_length, tokenizer->info);
                size_t j = 0;
                for (size_t i = 0;
                i < content_length;
                ++i) {
                    if (text[content_offset + i] == '\\') i++;
                    unescaped[j++] = ESCAPE_CHAR(text[content_offset + i], tokenizer->info);
                }
                string = unescaped;
            }
            IGNORE_TOKEN_WRITE(offset, token_length, string, string_length, has_escapes, tokenizer->info);
        }

This is not a major issue for me, since this is still just an experiment with owl, but I thought you'd like to know.

Example: Dart lexer.

I have a rough VPL specification of the lexical structure of Dart, which I think is a VPL. It looks like it just needs a little massaging to make it a great real world example grammar for owl.

One of my parsing systems attempts to simulate VPAs by having a set of regular automata, and a stack to track which regular automaton is active. Push and pop actions associated with certain literals switch between regular automata. The specification is designed around that idea.

Here's a snippet from my declarative specification for the lexical structure of Dart written in Dart (it's a little verbose, I apologize)

    final key_blockcomment = LexiKey();
    final key_mldq = LexiKey();
    final key_mlsq = LexiKey();
    final key_sldq = LexiKey();
    final key_slsq = LexiKey();
    final key_rmldq = LexiKey();
    final key_rmlsq = LexiKey();
    final key_rsldq = LexiKey();
    final key_rslsq = LexiKey();
    final key_base = LexiKey();
    return LexiLexerInfoImpl(
      root: inner(
        name: "top",
        atoms: [],
        children: () {
          final lang_base = leaf(
            name: "base",
            atoms: [
              apgm(v: "'", n: "startslsq", k: key_slsq),
              apgm(v: '"', n: "startsldq", k: key_sldq),
              argm(r: r"'''((\\? )*\\?(\n|\r|\r\n))?", n: "startmlsq", k: key_mlsq),
              argm(r: r'"""((\\? )*\\?(\n|\r|\r\n))?', n: "startmldq", k: key_mldq),
              apgm(v: "r'", n: "startrslsq", k: key_rslsq),
              apgm(v: 'r"', n: "startrsldq", k: key_rsldq),
              argm(r: r"r'''((\\? )*\\?(\n|\r|\r\n))?", n: "startrmlsq", k: key_rmlsq),
              argm(r: r'r"""((\\? )*\\?(\n|\r|\r\n))?', n: "startrmldq", k: key_rmldq),
              arnm(r: r'[0-9]+([eE][+\-]?[0-9]+)?', n: "integer"),
              arnm(r: r'[0-9]*\.[0-9]+([eE][+\-]?[0-9]+)?', n: "decimal"),
              arnm(r: r'0[xX][0-9ABCDEFabcdef]+', n: "hex"),
              apgm(v: r'{', k: key_base, n: "lcur"),
              aplm(v: r'}', n: "rcur"),
            ],
            handle: key_base,
          );
          return [
            () {
              final base_identifier = arnm(r: r'[_$a-zA-Z][_$0-9a-zA-Z]*', n: "identifier");
              final base_kabstract = apnm(v: r'abstract', n: "kabstract");
              final base_kas = apnm(v: r'as', n: "kas");
              final base_kassert = apnm(v: r'assert', n: "kassert");
              final base_kasync = apnm(v: r'async', n: "kasync");
              final base_kawait = apnm(v: r'await', n: "kawait");
              final base_ksealed = apnm(v: r'sealed', n: "ksealed");
              final base_kbase = apnm(v: r'base', n: "kbase");
              final base_kwhen = apnm(v: r'when', n: "kwhen");
              final base_kbreak = apnm(v: r'break', n: "kbreak");
              final base_kcase = apnm(v: r'case', n: "kcase");
              final base_kcatch = apnm(v: r'catch', n: "kcatch");
              final base_kclass = apnm(v: r'class', n: "kclass");
              final base_kconst = apnm(v: r'const', n: "kconst");
              final base_kcontinue = apnm(v: r'continue', n: "kcontinue");
              final base_kcovariant = apnm(v: r'covariant', n: "kcovariant");
              final base_kdefault = apnm(v: r'default', n: "kdefault");
              final base_kdeferred = apnm(v: r'deferred', n: "kdeferred");
              final base_kdo = apnm(v: r'do', n: "kdo");
              final base_kdynamic = apnm(v: r'dynamic', n: "kdynamic");
              final base_kelse = apnm(v: r'else', n: "kelse");
              final base_kenum = apnm(v: r'enum', n: "kenum");
              final base_kexport = apnm(v: r'export', n: "kexport");
              final base_kextends = apnm(v: r'extends', n: "kextends");
              final base_kextension = apnm(v: r'extension', n: "kextension");
              final base_kexternal = apnm(v: r'external', n: "kexternal");
              final base_kfactory = apnm(v: r'factory', n: "kfactory");
              final base_kfalse = apnm(v: r'false', n: "kfalse");
              final base_kfinal = apnm(v: r'final', n: "kfinal");
              final base_kfinally = apnm(v: r'finally', n: "kfinally");
              final base_kfor = apnm(v: r'for', n: "kfor");
              final base_kfunction = apnm(v: r'Function', n: "kfunction");
              final base_kget = apnm(v: r'get', n: "kget");
              final base_khide = apnm(v: r'hide', n: "khide");
              final base_kif = apnm(v: r'if', n: "kif");
              final base_kimplements = apnm(v: r'implements', n: "kimplements");
              final base_kimport = apnm(v: r'import', n: "kimport");
              final base_kin = apnm(v: r'in', n: "kin");
              final base_kinterface = apnm(v: r'interface', n: "kinterface");
              final base_kis = apnm(v: r'is', n: "kis");
              final base_klate = apnm(v: r'late', n: "klate");
              final base_klibrary = apnm(v: r'library', n: "klibrary");
              final base_kmixin = apnm(v: r'mixin', n: "kmixin");
              final base_knew = apnm(v: r'new', n: "knew");
              final base_knull = apnm(v: r'null', n: "knull");
              final base_kof = apnm(v: r'of', n: "kof");
              final base_kon = apnm(v: r'on', n: "kon");
              final base_koperator = apnm(v: r'operator', n: "koperator");
              final base_kpart = apnm(v: r'part', n: "kpart");
              final base_krequired = apnm(v: r'required', n: "krequired");
              final base_krethrow = apnm(v: r'rethrow', n: "krethrow");
              final base_kreturn = apnm(v: r'return', n: "kreturn");
              final base_kset = apnm(v: r'set', n: "kset");
              final base_kshow = apnm(v: r'show', n: "kshow");
              final base_kstatic = apnm(v: r'static', n: "kstatic");
              final base_ksuper = apnm(v: r'super', n: "ksuper");
              final base_kswitch = apnm(v: r'switch', n: "kswitch");
              final base_ksync = apnm(v: r'sync', n: "ksync");
              final base_kthis = apnm(v: r'this', n: "kthis");
              final base_kthrow = apnm(v: r'throw', n: "kthrow");
              final base_ktrue = apnm(v: r'true', n: "ktrue");
              final base_ktry = apnm(v: r'try', n: "ktry");
              final base_ktypedef = apnm(v: r'typedef', n: "ktypedef");
              final base_kvar = apnm(v: r'var', n: "kvar");
              final base_kvoid = apnm(v: r'void', n: "kvoid");
              final base_kwhile = apnm(v: r'while', n: "kwhile");
              final base_kwith = apnm(v: r'with', n: "kwith");
              final base_kyield = apnm(v: r'yield', n: "kyield");
              return inner(
                name: "kw",
                atoms: [
                  base_identifier,
                  // region keywords
                  base_kabstract,
                  base_kas,
                  base_kassert,
                  base_kasync,
                  base_kawait,
                  base_ksealed,
                  base_kbase,
                  base_kwhen,
                  base_kbreak,
                  base_kcase,
                  base_kcatch,
                  base_kclass,
                  base_kconst,
                  base_kcontinue,
                  base_kcovariant,
                  base_kdefault,
                  base_kdeferred,
                  base_kdo,
                  base_kdynamic,
                  base_kelse,
                  base_kenum,
                  base_kexport,
                  base_kextends,
                  base_kextension,
                  base_kexternal,
                  base_kfactory,
                  base_kfalse,
                  base_kfinal,
                  base_kfinally,
                  base_kfor,
                  base_kfunction,
                  base_kget,
                  base_khide,
                  base_kif,
                  base_kimplements,
                  base_kimport,
                  base_kin,
                  base_kinterface,
                  base_kis,
                  base_klate,
                  base_klibrary,
                  base_kmixin,
                  base_knew,
                  base_knull,
                  base_kof,
                  base_kon,
                  base_koperator,
                  base_kpart,
                  base_krequired,
                  base_krethrow,
                  base_kreturn,
                  base_kset,
                  base_kshow,
                  base_kstatic,
                  base_ksuper,
                  base_kswitch,
                  base_ksync,
                  base_kthis,
                  base_kthrow,
                  base_ktrue,
                  base_ktry,
                  base_ktypedef,
                  base_kvar,
                  base_kvoid,
                  base_kwhile,
                  base_kwith,
                  base_kyield,
                  // endregion
                ],
                children: [
                  lang_base,
                ],
                r: LexiResolutionHandlerExplicitImpl(
                  resolvers: [
                    LexiResolverImpl(match: [base_identifier, base_kabstract], prefer: base_kabstract),
                    LexiResolverImpl(match: [base_identifier, base_kas], prefer: base_kas),
                    LexiResolverImpl(match: [base_identifier, base_kwhen], prefer: base_kwhen),
                    LexiResolverImpl(match: [base_identifier, base_ksealed], prefer: base_ksealed),
                    LexiResolverImpl(match: [base_identifier, base_kbase], prefer: base_kbase),
                    LexiResolverImpl(match: [base_identifier, base_kassert], prefer: base_kassert),
                    LexiResolverImpl(match: [base_identifier, base_kasync], prefer: base_kasync),
                    LexiResolverImpl(match: [base_identifier, base_kawait], prefer: base_kawait),
                    LexiResolverImpl(match: [base_identifier, base_kbreak], prefer: base_kbreak),
                    LexiResolverImpl(match: [base_identifier, base_kcase], prefer: base_kcase),
                    LexiResolverImpl(match: [base_identifier, base_kcatch], prefer: base_kcatch),
                    LexiResolverImpl(match: [base_identifier, base_kclass], prefer: base_kclass),
                    LexiResolverImpl(match: [base_identifier, base_kconst], prefer: base_kconst),
                    LexiResolverImpl(match: [base_identifier, base_kcontinue], prefer: base_kcontinue),
                    LexiResolverImpl(match: [base_identifier, base_kcovariant], prefer: base_kcovariant),
                    LexiResolverImpl(match: [base_identifier, base_kdefault], prefer: base_kdefault),
                    LexiResolverImpl(match: [base_identifier, base_kdeferred], prefer: base_kdeferred),
                    LexiResolverImpl(match: [base_identifier, base_kdo], prefer: base_kdo),
                    LexiResolverImpl(match: [base_identifier, base_kdynamic], prefer: base_kdynamic),
                    LexiResolverImpl(match: [base_identifier, base_kelse], prefer: base_kelse),
                    LexiResolverImpl(match: [base_identifier, base_kenum], prefer: base_kenum),
                    LexiResolverImpl(match: [base_identifier, base_kexport], prefer: base_kexport),
                    LexiResolverImpl(match: [base_identifier, base_kextends], prefer: base_kextends),
                    LexiResolverImpl(match: [base_identifier, base_kextension], prefer: base_kextension),
                    LexiResolverImpl(match: [base_identifier, base_kexternal], prefer: base_kexternal),
                    LexiResolverImpl(match: [base_identifier, base_kfactory], prefer: base_kfactory),
                    LexiResolverImpl(match: [base_identifier, base_kfalse], prefer: base_kfalse),
                    LexiResolverImpl(match: [base_identifier, base_kfinal], prefer: base_kfinal),
                    LexiResolverImpl(match: [base_identifier, base_kfinally], prefer: base_kfinally),
                    LexiResolverImpl(match: [base_identifier, base_kfor], prefer: base_kfor),
                    LexiResolverImpl(match: [base_identifier, base_kfunction], prefer: base_kfunction),
                    LexiResolverImpl(match: [base_identifier, base_kget], prefer: base_kget),
                    LexiResolverImpl(match: [base_identifier, base_khide], prefer: base_khide),
                    LexiResolverImpl(match: [base_identifier, base_kif], prefer: base_kif),
                    LexiResolverImpl(match: [base_identifier, base_kimplements], prefer: base_kimplements),
                    LexiResolverImpl(match: [base_identifier, base_kimport], prefer: base_kimport),
                    LexiResolverImpl(match: [base_identifier, base_kin], prefer: base_kin),
                    LexiResolverImpl(match: [base_identifier, base_kinterface], prefer: base_kinterface),
                    LexiResolverImpl(match: [base_identifier, base_kis], prefer: base_kis),
                    LexiResolverImpl(match: [base_identifier, base_klate], prefer: base_klate),
                    LexiResolverImpl(match: [base_identifier, base_klibrary], prefer: base_klibrary),
                    LexiResolverImpl(match: [base_identifier, base_kmixin], prefer: base_kmixin),
                    LexiResolverImpl(match: [base_identifier, base_knew], prefer: base_knew),
                    LexiResolverImpl(match: [base_identifier, base_knull], prefer: base_knull),
                    LexiResolverImpl(match: [base_identifier, base_kof], prefer: base_kof),
                    LexiResolverImpl(match: [base_identifier, base_kon], prefer: base_kon),
                    LexiResolverImpl(match: [base_identifier, base_koperator], prefer: base_koperator),
                    LexiResolverImpl(match: [base_identifier, base_kpart], prefer: base_kpart),
                    LexiResolverImpl(match: [base_identifier, base_krequired], prefer: base_krequired),
                    LexiResolverImpl(match: [base_identifier, base_krethrow], prefer: base_krethrow),
                    LexiResolverImpl(match: [base_identifier, base_kreturn], prefer: base_kreturn),
                    LexiResolverImpl(match: [base_identifier, base_kset], prefer: base_kset),
                    LexiResolverImpl(match: [base_identifier, base_kshow], prefer: base_kshow),
                    LexiResolverImpl(match: [base_identifier, base_kstatic], prefer: base_kstatic),
                    LexiResolverImpl(match: [base_identifier, base_ksuper], prefer: base_ksuper),
                    LexiResolverImpl(match: [base_identifier, base_kswitch], prefer: base_kswitch),
                    LexiResolverImpl(match: [base_identifier, base_ksync], prefer: base_ksync),
                    LexiResolverImpl(match: [base_identifier, base_kthis], prefer: base_kthis),
                    LexiResolverImpl(match: [base_identifier, base_kthrow], prefer: base_kthrow),
                    LexiResolverImpl(match: [base_identifier, base_ktrue], prefer: base_ktrue),
                    LexiResolverImpl(match: [base_identifier, base_ktry], prefer: base_ktry),
                    LexiResolverImpl(match: [base_identifier, base_ktypedef], prefer: base_ktypedef),
                    LexiResolverImpl(match: [base_identifier, base_kvar], prefer: base_kvar),
                    LexiResolverImpl(match: [base_identifier, base_kvoid], prefer: base_kvoid),
                    LexiResolverImpl(match: [base_identifier, base_kwhile], prefer: base_kwhile),
                    LexiResolverImpl(match: [base_identifier, base_kwith], prefer: base_kwith),
                    LexiResolverImpl(match: [base_identifier, base_kyield], prefer: base_kyield),
                  ],
                ),
              );
            }(),
            lang_base,
            inner(
              name: "sy",
              atoms: [
                apnm(v: r'[', n: "slbra"),
                apnm(v: r']', n: "srbra"),
                apnm(v: r'(', n: "slpar"),
                apnm(v: r')', n: "srpar"),
                apnm(v: r',', n: "scomma"),
                apnm(v: r':', n: "scolon"),
                apnm(v: r';', n: "ssemicolon"),
                apnm(v: r'&', n: "sand"),
                apnm(v: r'&=', n: "sandeq"),
                apnm(v: r'^', n: "scaret"),
                apnm(v: r'^=', n: "scareteq"),
                apnm(v: r'>', n: "sg"),
                apnm(v: r'<', n: "sl"),
                apnm(v: r'%', n: "spercent"),
                apnm(v: r'%=', n: "spercenteq"),
                apnm(v: r'+', n: "splus"),
                apnm(v: r'+=', n: "spluseq"),
                apnm(v: r'|', n: "spipe"),
                apnm(v: r'|=', n: "spipeeq"),
                apnm(v: r'??', n: "sqq"),
                apnm(v: r'??=', n: "sqqeq"),
                apnm(v: r'~/', n: "stildeslash"),
                apnm(v: r'~/=', n: "stildeslasheq"),
                apnm(v: r'*', n: "sstar"),
                apnm(v: r'*=', n: "sstareq"),
                apnm(v: r'/', n: "sslash"),
                apnm(v: r'/=', n: "sslasheq"),
                apnm(v: r'-', n: "sminus"),
                apnm(v: r'-=', n: "sminuseq"),
                apnm(v: r'=', n: "seq"),
                apnm(v: r'==', n: "seqeq"),
                apnm(v: r'&&', n: "sandand"),
                apnm(v: r'@', n: "sat"),
                apnm(v: r'.', n: "sdot"),
                apnm(v: r'..', n: "sdotdot"),
                apnm(v: r'...', n: "sdotdotdot"),
                apnm(v: r'...?', n: "sdotdotdotq"),
                apnm(v: r'--', n: "sminusminus"),
                apnm(v: r'!', n: "sbang"),
                apnm(v: r'!=', n: "sbangeq"),
                apnm(v: r'++', n: "splusplus"),
                apnm(v: r'#', n: "shash"),
                apnm(v: r'||', n: "spipepipe"),
                apnm(v: r'?', n: "squestion"),
                apnm(v: r'?.', n: "squestiondot"),
                apnm(v: r'?..', n: "squestiondotdot"),
                apnm(v: r'~', n: "stilde"),
              ],
              children: [
                lang_base,
              ],
            ),
            leaf(
              name: "blockcomment",
              handle: key_blockcomment,
              atoms: [
                argm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*/*/\*", k: key_blockcomment, n: "bcstart"),
                arlm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*\**\*/", n: "bcend"),
              ],
            ),
            inner(
              name: "t",
              atoms: [
                arnm(r: r'( |\t|\n|\r)+', n: "ws"),
                arnm(r: r'#![^\n\r]*(\r|\n|\r\n)?', n: "scripttag"),
                // TODO • can't parse '/**' because this will mess up parsing '/**/' I think I'll need a new language for the body of bcstart if it starts with *.
                apgm(v: r'/*', k: key_blockcomment, n: "bcinit"),
                // TODO • have a separate language for line comments to parse '///' correctly.
                arnm(r: r'//([^\n\r]+)?(\r|\n|\r\n)?', n: "singlelinenondoccomment"),
              ],
              children: [
                lang_base,
              ],
              r: const LexiResolutionHandlerExplicitImpl(
                resolvers: [],
              ),
            ),
            inner(
              name: "bsc",
              atoms: [
                arnm(r: r'\\n', n: "escaped_lf"),
                arnm(r: r'\\r', n: "escaped_cr"),
                arnm(r: r'\\b', n: "escaped_b"),
                arnm(r: r'\\t', n: "escaped_ht"),
                arnm(r: r'\\v', n: "escaped_vt"),
                arnm(r: r'\\x[0-9a-fA-F][0-9a-fA-F]', n: "escaped_byte"),
                arnm(r: r'\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]', n: "escaped_unicodesimple"),
                arnm(r: r'\\u{[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?}', n: "escaped_unicoderaw"),
                arnm(r: r'\\[^bnrtuvx]', n: "escaped_unescaped"),
                arnm(r: r'$[_a-zA-Z][_0-9a-zA-Z]*', n: "simpleinterpolation"),
                argm(r: r'${', k: key_base, n: "advancedinterpolation"),
              ],
              children: [
                leaf(
                  name: "slsq",
                  atoms: [
                    aplm(v: "'", n: "slsqpop"),
                    arnm(r: r"[^'$\\\n\r]+", n: "slsqcontent"),
                  ],
                  handle: key_slsq,
                ),
                leaf(
                  name: "sldq",
                  atoms: [
                    aplm(v: '"', n: "sldqpop"),
                    arnm(r: r'[^"$\\\n\r]+', n: "sldqcontent"),
                  ],
                  handle: key_sldq,
                ),
                leaf(
                  name: "mlsq",
                  atoms: [
                    aplm(v: "'''", n: "mlsqpop"),
                    arnm(r: r"[^'$\\]+", n: "mlsqcontent"),
                    arnm(r: r"('|'')", n: "mlsqq"),
                  ],
                  handle: key_mlsq,
                ),
                leaf(
                  name: "mldq",
                  atoms: [
                    aplm(v: '"""', n: "mldqpop"),
                    arnm(r: r'[^"$\\]+', n: "mldqcontent"),
                    arnm(r: r'("|"")', n: "mldqq"),
                  ],
                  handle: key_mldq,
                ),
              ],
            ),
            leaf(
              name: "rslsq",
              atoms: [
                aplm(v: "'", n: "rslsqend"),
                arnm(r: r"[^\n\r']+", n: "rslsqcontent"),
              ],
              handle: key_rslsq,
            ),
            leaf(
              name: "rsldq",
              atoms: [
                aplm(v: '"', n: "rsldqend"),
                arnm(r: r'[^\n\r"]+', n: "rsldqcontent"),
              ],
              handle: key_rsldq,
            ),
            leaf(
              name: "rmlsq",
              atoms: [
                aplm(v: r"'''", n: "rmlsqend"),
                arnm(r: r"(('|'')?[^'])+", n: "rmlsqcontent"),
              ],
              handle: key_rmlsq,
            ),
            leaf(
              name: "rmldq",
              atoms: [
                aplm(v: '"""', n: "rmldqend"),
                arnm(r: r'(("|"")?[^"])+', n: "rmldqcontent"),
              ],
              handle: key_rmldq,
            ),
          ];
        }(),
      ),
      start: key_base,
    );

Here's a visual representation that I use to visualize that specification:
dot.pdf

Here's the ANTLR specification of Dart.

ANTLR uses order to disambiguate between identifiers and keywords. The way I handle the specification, only keywords and identifiers need to be disambiguated in favor of the keyword.

I plan to translate that specification to owl just to see how it works out.

Reserved words for field names

The grammar

a = identifier@return

won't compile because return is a C keyword.

We need to reserve all the C keywords, along with range and type (because these conflict with built-in fields).

Unexpected result when getting string or identifier

Given the following text

emit(<"a" )
Calling
parsed_string_get(parsed_expr_get(parsed_arg_get(arg).expr).string).string
results in
a" ) , though a was expected

Given the following grammar:

#using owl.v4 program = stmt* stmt = identifier ['(' arg* ')'] : func identifier '=' expr : assign 'stream' identifier ':' identifier : creates 'tick' identifier : ticks 'move' move : moves arg = '<' expr move = identifier '<' expr : movel expr '>' identifier : mover expr = [ '(' expr ')' ] : parens number : literal identifier : ident string : str .operators prefix '-' : negative .operators infix left '*' : times '/' : divided-by .operators infix left '+' : plus '-' : minus

Identical "bracket" symbols not supported?

Consider the following grammar:

sl = [ "'" "a"* "'" ]

which is meant to model a single quote single line string ('', 'aaa', ...).

My mental model of how it would work with VPLs is the following:

sl = [ "'" "a"* "'" ]
^      ^        |  |     
|----- |        |  |
|      |        |  |
'-> state 0     |  '-> state 0 again
       |        |
       |        |
       |        '-> we see a "'", we pop to state 0
       |
       '-> state 1 (any "'" will pop to state 0)


- we start in state 0.
- we switch to state 1 on an "'".
- we do our regular parsing in state 1.
- we see a "'" in state 1, so we pop state 1 and we go back to state 0.

The goal here is to support recursive definitions like the following and, as expected, the following works:

sl = [ "<" ("a" | [ "{" sl "}" ] )* ">" ]

However, if the "bracket symbols" are the same, it doesn't work anymore:

sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]

error: token ''' can't be used as both a start and an end 
keyword

  sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]
         ~~~                          ~~~

I'm wondering, is my mental model of VPLs wrong, and having identical "start" and "end" symbols is simply not possible, or is this a limitation of owl and not a limitation of VPLs?

Build errors and warnings

I am running PopOS (an Ubuntu derivative):

$ uname -a
Linux pop-os 5.11.0-7620-generic #21~1626191760~20.10~55de9c3-Ubuntu SMP Wed Jul 21 20:34:49 UTC  x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 10.3.0-1ubuntu1~20.10) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

When I run make with commit db462c6, I get the following errors and warnings:

$ make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings   -o owl src/*.c -ldl
In file included from src/owl.c:9:
src/test.h:12:5: error: unknown type name ‘pid_t’
   12 |     pid_t child;
      |     ^~~~~
In file included from src/test.c:2:
src/test.h:12:5: error: unknown type name ‘pid_t’
   12 |     pid_t child;
      |     ^~~~~
src/test.c: In function ‘spawn_child’:
src/test.c:70:19: warning: implicit declaration of function ‘fdopen’; did you mean ‘fopen’? [-Wimplicit-function-declaration]
   70 |         t->file = fdopen(fd[1], "w");
      |                   ^~~~~~
      |                   fopen
src/test.c:70:17: warning: assignment to ‘FILE *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
   70 |         t->file = fdopen(fd[1], "w");
      |                 ^
src/test.c:58:5: warning: ignoring return value of ‘pipe’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
   58 |     pipe(fd);
      |     ^~~~~~~~
make: *** [Makefile:24: owl] Error 1

See, for example this stackoverflow question for a discussion of the problem with the definition of pid_t. I can fix that by adding #include <sys/types.h> in src/test.h.

The issue with fdopen is that it is POSIX-only, and not part of C11.

	node->slots = calloc(number_of_slots, sizeof(size_t));
	if (!node->slots) abort();

ianh / owl Goto Github PK

owl's People

Contributors

Stargazers

Watchers

Forkers

owl's Issues

Steps to reproduce

Patches

Given the following text

Given the following grammar:

Recommend Projects

Recommend Topics

Recommend Org