ianh / owl Goto Github PK
View Code? Open in Web Editor NEWA parser generator for visibly pushdown languages.
License: MIT License
A parser generator for visibly pushdown languages.
License: MIT License
I was thinking about adding support for hexadecimal arguments in my grammar, when I noticed that OWL already parses things like "0xff". But why is every hex input parsed as a number and not an integer? In my application it messes with the type checking system, because some function might expect an integer and would refuse to accept "0xff".
For example, the grammar
input = argument{','}
argument =
integer : int
number : num
with the input
1, 1.2, 0xff, 0xff.f
yields:
1 , 1.2 , 0xff , 0xff.f
argument:int argument:num argument:num argument:num
Would it be possible to distinguish hex integers and hex floating point numbers without writing a new tokenize function?
owl
is fantastic, thanks for sharing it! I've started experimenting with it for creating a little language that defines a system of differential equations.
The predefined number
is convenient, but I would like to have more flexible parsing of numbers. In some expressions, I want integers only. For example, in parsing the expression in an index operation such as x[k - 1]
, the numerical values inside the brackets must be integers. I want to treat x[k - 1.5]
as invalid.
For other expressions, I would like to parse complex numbers, where the imaginary part of the number is indicated with a j
, e.g. z = -1.0 + 2.5j
(I'm using the Python convention for complex numbers).
Is there already an obvious way to handle these cases, or will I have to modify the tokenizer to not automatically consume numerical values as doubles? After only a brief look at the code, I suspect modifying the tokenizer would involve adding two more numerical token types, say integer
and imaginary
. If the sequence of characters in the input stream contains a number that is terminated by j
(with no whitespace), create an imaginary
token, and if the number is only digits (and not terminated by j
), create an integer
.
Hi there,
Awesome project. I love the strict approach taken to validation and the owl
utility is really nifty and makes writing a definition for an existing language a breeze.
However I'm running into an issue that I think can be rather easily solved in the library: the current token classes are non-comprehensive, in that there are symbols that don't fall into the category number | string | identifier
.
For context, I'm actually trying to use owl to model a shell scripting language, where there are two issues that don't play well with owl's current architecture: tokens not matching a reserved pattern should be treated as strings ("literal data") and line breaks are special characters that shouldn't be treated as generic whitespace (I'll file another issue for that separately).
This first issue should be solvable by simply introducing a final token class other
or unmatched
which should take literally anything so long as no keyword is expected at that point, and can be used to accept un-quoted content where it does not introduce ambiguities into the parser.
In construct_node_alloc()
there is a possibly empty allocation, because number_of_slots
can be 0:
Lines 1675 to 1676 in b7f1cf4
In this case the behavior of calloc()
is not defined:
If nmemb or size is 0, then calloc() returns either NULL, or a unique pointer value that can later be successfully passed to free().
https://linux.die.net/man/3/calloc
I was trying to compile owl for an embedded system, but abort()
was called when creating the tree. So I fixed the parser with the following sed command, considering the number of slots within the if-statement:
sed -i '' 's/(!node->slots)/(number_of_slots \&\& !node->slots)/' src/parser.h
I guess a similar fix would help to ensure a more consistent behavior across different compilers/architectures.
Hello @ianh,
I'm wondering, do you have any insights into incremental parsing where VPLs could offer some kind of advantage over, e.g., parsing context free languages using, e.g., LR/LL parsing?
Or, specifically, how would you go about making owl work in an incremental fashion, that is, if owl has parsed some source once, and only small parts of it have changed, subsequent parses should not have to reparse everything from scratch.
I'd appreciate any thoughts you might have on this.
Consider the following grammar:
sl = [ "'" "a"* "'" ]
which is meant to model a single quote single line string ('', 'aaa', ...
).
My mental model of how it would work with VPLs is the following:
sl = [ "'" "a"* "'" ]
^ ^ | |
|----- | | |
| | | |
'-> state 0 | '-> state 0 again
| |
| |
| '-> we see a "'", we pop to state 0
|
'-> state 1 (any "'" will pop to state 0)
- we start in state 0.
- we switch to state 1 on an "'".
- we do our regular parsing in state 1.
- we see a "'" in state 1, so we pop state 1 and we go back to state 0.
The goal here is to support recursive definitions like the following and, as expected, the following works:
sl = [ "<" ("a" | [ "{" sl "}" ] )* ">" ]
However, if the "bracket symbols" are the same, it doesn't work anymore:
sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]
error: token ''' can't be used as both a start and an end
keyword
sl = [ "'" ("a" | [ "{" sl "}" ] )* "'" ]
~~~ ~~~
I'm wondering, is my mental model of VPLs wrong, and having identical "start" and "end" symbols is simply not possible, or is this a limitation of owl and not a limitation of VPLs?
Hello @ianh,
Here you wrote:
So the last thing to do is to incrementally update the tree itself based on the change to the sequence of construction actions. I'm not sure how to do this, but I'm sure someone has written about it somewhere.
This reminds me a lot of GLR parsing. One thing that has made GLR parsing much simpler for me is not to construct an explicit tree, but to maintain a postordered list of tree events (see "Post-Order Parser Output" for a brief explanation).
A postordered list of events is inherent to LR parsing. A preordered list of events is inherent to LL parsing. I was wondering, how feasible would it be to support generating such a list of events with the approach that owl takes? That is, would it be possible to compress your construction actions into just 2 categories of events: begin_subtree_of_type_x
and end_subtree_of_type_x
?
If so, do you think that your approach would support a preordered or a postordered list of events?
With such a list of events, updating subtrees could be made very simple with a Piece table, which I'm currently investigating in a different project.
I can't find any resources on how to define a custom tokenizer other than the hex-color example. That example seems to be considerable simpler than custom tokenizers owl can currently support (for example not using the tokenization info).
Additionally, knowing more details about how the default tokenizer works would be good as well. Particularly how whitespace is handled. I would have no idea how to make a whitespace parser from looking at the current examples and readme.
I would like to point out that identifiers like “_AUTOMATON_H_
” and “_TERMINAL_H_
” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?
Hello,
I am currentlry working on OpenCRS, and I used your app for testing the Vulnerability Detection module.
While doing so, I discovered an out-of-bounds read that could prove itself as a vulnerability.
valgrind ./owl -T -g -
.I already forked the repository and proposed a patch in the Pull Request #32
Owl is awesome, thank you!
My proposals:
range(cp1, cp2)
or range[cp1, cp2]
- cp1
and cp2
are codepoints here (hex or decimal)block(name)
- Unicode's script name (Basic_Latin
, Latin-1_Supplement
, etc.)property(name)
- Unicode's property name (White_Space
, Hyphen
, Ps
, Mn
, etc.)script(name)
- Unicode's script name (Common
, Latin
, etc.)What do you think?
The grammar:
.token foo
.token bar
program = 'f'
Produces a parser.h that fails to compile due to defining one write_custom_token
function for each custom token:
static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) {
struct owl_tree *tree = info;
size_t token_offset = tree->next_offset;
write_tree(tree, token_offset - tree->next_foo_token_offset);
write_tree(tree, offset);
write_tree(tree, length);
write_tree(tree, data);
tree->next_foo_token_offset = token_offset;
}
static void write_custom_token(size_t offset, size_t length, uint32_t token, uint64_t data, void *info) { // <== Redefinition of write_custom_token
struct owl_tree *tree = info;
size_t token_offset = tree->next_offset;
write_tree(tree, token_offset - tree->next_bar_token_offset);
write_tree(tree, offset);
write_tree(tree, length);
write_tree(tree, data);
tree->next_bar_token_offset = token_offset;
}
The grammar
a = identifier@return
won't compile because return
is a C keyword.
We need to reserve all the C keywords, along with range
and type
(because these conflict with built-in fields).
Consider the grammar of smileys with an arbitrary long space as a nose:
face = eyes nose mouth
eyes =
':' : normal
';' : winking
nose =
' '* : space
mouth =
')' : smile
The playground says:
error: token ' ' can't be used as both a whitespace and a normal keyword
' '* : space
~~~
It's not quite clear to me why a nose can't be a space. Are space characters like ' ' used as whitespace by default? Is there a way to disable any default whitespace rules?
According to #43 (comment), call and return symbols need to be disjoint.
Therefore, #38 is not possible because string literals viewed as a VPL would have an identical call and return symbol, and would not be a VPL.
However, I'm wondering, could it be that this requirement by Rajeev Alur & P. Madhusudan for VPLs is too restrictive?
I don't have an intuitive understanding of the inner workings of owl, and this might be obvious, but @ianh can you think of a counterexample where having identical call and return symbols would break any of the invariants that owl depends on?
What would happen (where would owl break) if owl ignored this requirement, and a grammar like [ '[' ... '[' ]
would be accepted and compiled by owl?
e.g.?:
Owl's generated parsers declare function names with external linkage. This is a problem if you want to include two different parsers in the same program.
emit(<"a" )
Calling
parsed_string_get(parsed_expr_get(parsed_arg_get(arg).expr).string).string
results in
a" )
, though a
was expected
#using owl.v4 program = stmt* stmt = identifier ['(' arg* ')'] : func identifier '=' expr : assign 'stream' identifier ':' identifier : creates 'tick' identifier : ticks 'move' move : moves arg = '<' expr move = identifier '<' expr : movel expr '>' identifier : mover expr = [ '(' expr ')' ] : parens number : literal identifier : ident string : str .operators prefix '-' : negative .operators infix left '*' : times '/' : divided-by .operators infix left '+' : plus '-' : minus
Hello @ianh,
This is essentially the same issue as this one.
I've used that feature to collect a bunch of grammars here. I think it would be useful if the owl playground could support that too.
Lots of programming languages support integers, and it'd be nice to support them in Owl too. If both types of tokens were used in the same grammar, both would be matched, with the longer match winning—if integer
weren't used in the grammar, number
would behave the same as it does now.
If I try to implement another "built-in token class" like hexinteger (I will follow the implementation for other builtins), would the PR be excepted?
Unfortunately for my usecase I can not use 0xABC
integer matcher with 0x
.
I'm trying to implement a macro keyword. Is it possible to do something like?
'def' identifier remaining_line: define
If not, where would be a good place for me to start looking to add such a feature?
Also this is a really cool project! I've tested out quite a few different parser generators and this one just blows everything out of the water so far for what I'm trying to use it for.
(@ianh I hope you don't me reporting any issues I find)
This one is really not urgent, but the following grammar:
foo = a
Leads to the following error message:
error: 'ᅡᅠ' isn't a valid token
foo = aᅡᅠ
~~
The space after the a is a non-breaking space (command+space), and it is being reported as a right turnstile.
This is a really nice project, I am considering porting a shell scripting language to a formal owl definition and an owl-based parser as I really like the design of this project.
However, one blocker for us is that in scripting languages with an optional terminating semicolon, a line break may be used instead of an explicit ;
. The hard-coded interpretation of \n
as a mere token separator is already a blocking issue here, but even if it weren't, I feel as though it would require a separate class of tokens resting somewhere between whitespace and literal tokens to handle, if I'm not mistaken?
In particular, I am looking to parse syntax like
for f in foo; bar; baz; end
as being equal to
for f in foo
bar # or bar;
baz # or baz;
end
This example can work by replacing all linebreaks with semicolons prior to parsing (feasibility/overhead aside) but that doesn't always work, for example:
echo hello |
cat
as some symbols are allowed to wrap on to the next line without hard-escaping the new line with a backslash, but wouldn't be accepted if it were input as echo hello |; cat
as that's two separate statements.
Any ideas?
Would you like to add more error handling for return values from functions like the following?
offset gets shadowed here:
Line 618 in 0622ab1
This results in warning in most compilers. Don't know the project, but at first glance offset%%rule-index
might work.
Consider the following grammar:
start = kabstract | ident
kabstract = "abstract"
ident = atoz_lower+
atoz_lower = "a"|"b"|"c"|"d"|"e"|"f"|"g"|"h"|"i"|"j"|"k"|"l"|"m"|"n"|"o"|"p"|"q"|"r"|"s"|"t"|"u"|"v"|"w"|"x"|"y"|"z"
If we use the playground to parse abstract
, we get the following result:
. abstract
kabstract
this is a little confusing to me:
"a" "b" "s" "t" "r" "a" "c" "t"
via ident. Why does owl not report any ambiguities here?I find myself writing some variation of RuleName ? ("," RuleName)*
repeatedly, even in simple grammars. This is not a tremendous hardship, but it would be more elegant to have a "separated by" operator, similar to those that come in a lot of parser combinator libraries. For example:
FnDefinition = "function" identifier "(" identifier ? ("," identifier)* ")" Block
would instead be:
FnDefinition = "function" identifier "(" identifier % "," ")" Block
Again, I recognize that this is merely syntactic sugar, but the +
and ?
operators are also merely syntactic sugar, but they reduce repetition and increase readability.
I have a rough VPL specification of the lexical structure of Dart, which I think is a VPL. It looks like it just needs a little massaging to make it a great real world example grammar for owl.
One of my parsing systems attempts to simulate VPAs by having a set of regular automata, and a stack to track which regular automaton is active. Push and pop actions associated with certain literals switch between regular automata. The specification is designed around that idea.
final key_blockcomment = LexiKey();
final key_mldq = LexiKey();
final key_mlsq = LexiKey();
final key_sldq = LexiKey();
final key_slsq = LexiKey();
final key_rmldq = LexiKey();
final key_rmlsq = LexiKey();
final key_rsldq = LexiKey();
final key_rslsq = LexiKey();
final key_base = LexiKey();
return LexiLexerInfoImpl(
root: inner(
name: "top",
atoms: [],
children: () {
final lang_base = leaf(
name: "base",
atoms: [
apgm(v: "'", n: "startslsq", k: key_slsq),
apgm(v: '"', n: "startsldq", k: key_sldq),
argm(r: r"'''((\\? )*\\?(\n|\r|\r\n))?", n: "startmlsq", k: key_mlsq),
argm(r: r'"""((\\? )*\\?(\n|\r|\r\n))?', n: "startmldq", k: key_mldq),
apgm(v: "r'", n: "startrslsq", k: key_rslsq),
apgm(v: 'r"', n: "startrsldq", k: key_rsldq),
argm(r: r"r'''((\\? )*\\?(\n|\r|\r\n))?", n: "startrmlsq", k: key_rmlsq),
argm(r: r'r"""((\\? )*\\?(\n|\r|\r\n))?', n: "startrmldq", k: key_rmldq),
arnm(r: r'[0-9]+([eE][+\-]?[0-9]+)?', n: "integer"),
arnm(r: r'[0-9]*\.[0-9]+([eE][+\-]?[0-9]+)?', n: "decimal"),
arnm(r: r'0[xX][0-9ABCDEFabcdef]+', n: "hex"),
apgm(v: r'{', k: key_base, n: "lcur"),
aplm(v: r'}', n: "rcur"),
],
handle: key_base,
);
return [
() {
final base_identifier = arnm(r: r'[_$a-zA-Z][_$0-9a-zA-Z]*', n: "identifier");
final base_kabstract = apnm(v: r'abstract', n: "kabstract");
final base_kas = apnm(v: r'as', n: "kas");
final base_kassert = apnm(v: r'assert', n: "kassert");
final base_kasync = apnm(v: r'async', n: "kasync");
final base_kawait = apnm(v: r'await', n: "kawait");
final base_ksealed = apnm(v: r'sealed', n: "ksealed");
final base_kbase = apnm(v: r'base', n: "kbase");
final base_kwhen = apnm(v: r'when', n: "kwhen");
final base_kbreak = apnm(v: r'break', n: "kbreak");
final base_kcase = apnm(v: r'case', n: "kcase");
final base_kcatch = apnm(v: r'catch', n: "kcatch");
final base_kclass = apnm(v: r'class', n: "kclass");
final base_kconst = apnm(v: r'const', n: "kconst");
final base_kcontinue = apnm(v: r'continue', n: "kcontinue");
final base_kcovariant = apnm(v: r'covariant', n: "kcovariant");
final base_kdefault = apnm(v: r'default', n: "kdefault");
final base_kdeferred = apnm(v: r'deferred', n: "kdeferred");
final base_kdo = apnm(v: r'do', n: "kdo");
final base_kdynamic = apnm(v: r'dynamic', n: "kdynamic");
final base_kelse = apnm(v: r'else', n: "kelse");
final base_kenum = apnm(v: r'enum', n: "kenum");
final base_kexport = apnm(v: r'export', n: "kexport");
final base_kextends = apnm(v: r'extends', n: "kextends");
final base_kextension = apnm(v: r'extension', n: "kextension");
final base_kexternal = apnm(v: r'external', n: "kexternal");
final base_kfactory = apnm(v: r'factory', n: "kfactory");
final base_kfalse = apnm(v: r'false', n: "kfalse");
final base_kfinal = apnm(v: r'final', n: "kfinal");
final base_kfinally = apnm(v: r'finally', n: "kfinally");
final base_kfor = apnm(v: r'for', n: "kfor");
final base_kfunction = apnm(v: r'Function', n: "kfunction");
final base_kget = apnm(v: r'get', n: "kget");
final base_khide = apnm(v: r'hide', n: "khide");
final base_kif = apnm(v: r'if', n: "kif");
final base_kimplements = apnm(v: r'implements', n: "kimplements");
final base_kimport = apnm(v: r'import', n: "kimport");
final base_kin = apnm(v: r'in', n: "kin");
final base_kinterface = apnm(v: r'interface', n: "kinterface");
final base_kis = apnm(v: r'is', n: "kis");
final base_klate = apnm(v: r'late', n: "klate");
final base_klibrary = apnm(v: r'library', n: "klibrary");
final base_kmixin = apnm(v: r'mixin', n: "kmixin");
final base_knew = apnm(v: r'new', n: "knew");
final base_knull = apnm(v: r'null', n: "knull");
final base_kof = apnm(v: r'of', n: "kof");
final base_kon = apnm(v: r'on', n: "kon");
final base_koperator = apnm(v: r'operator', n: "koperator");
final base_kpart = apnm(v: r'part', n: "kpart");
final base_krequired = apnm(v: r'required', n: "krequired");
final base_krethrow = apnm(v: r'rethrow', n: "krethrow");
final base_kreturn = apnm(v: r'return', n: "kreturn");
final base_kset = apnm(v: r'set', n: "kset");
final base_kshow = apnm(v: r'show', n: "kshow");
final base_kstatic = apnm(v: r'static', n: "kstatic");
final base_ksuper = apnm(v: r'super', n: "ksuper");
final base_kswitch = apnm(v: r'switch', n: "kswitch");
final base_ksync = apnm(v: r'sync', n: "ksync");
final base_kthis = apnm(v: r'this', n: "kthis");
final base_kthrow = apnm(v: r'throw', n: "kthrow");
final base_ktrue = apnm(v: r'true', n: "ktrue");
final base_ktry = apnm(v: r'try', n: "ktry");
final base_ktypedef = apnm(v: r'typedef', n: "ktypedef");
final base_kvar = apnm(v: r'var', n: "kvar");
final base_kvoid = apnm(v: r'void', n: "kvoid");
final base_kwhile = apnm(v: r'while', n: "kwhile");
final base_kwith = apnm(v: r'with', n: "kwith");
final base_kyield = apnm(v: r'yield', n: "kyield");
return inner(
name: "kw",
atoms: [
base_identifier,
// region keywords
base_kabstract,
base_kas,
base_kassert,
base_kasync,
base_kawait,
base_ksealed,
base_kbase,
base_kwhen,
base_kbreak,
base_kcase,
base_kcatch,
base_kclass,
base_kconst,
base_kcontinue,
base_kcovariant,
base_kdefault,
base_kdeferred,
base_kdo,
base_kdynamic,
base_kelse,
base_kenum,
base_kexport,
base_kextends,
base_kextension,
base_kexternal,
base_kfactory,
base_kfalse,
base_kfinal,
base_kfinally,
base_kfor,
base_kfunction,
base_kget,
base_khide,
base_kif,
base_kimplements,
base_kimport,
base_kin,
base_kinterface,
base_kis,
base_klate,
base_klibrary,
base_kmixin,
base_knew,
base_knull,
base_kof,
base_kon,
base_koperator,
base_kpart,
base_krequired,
base_krethrow,
base_kreturn,
base_kset,
base_kshow,
base_kstatic,
base_ksuper,
base_kswitch,
base_ksync,
base_kthis,
base_kthrow,
base_ktrue,
base_ktry,
base_ktypedef,
base_kvar,
base_kvoid,
base_kwhile,
base_kwith,
base_kyield,
// endregion
],
children: [
lang_base,
],
r: LexiResolutionHandlerExplicitImpl(
resolvers: [
LexiResolverImpl(match: [base_identifier, base_kabstract], prefer: base_kabstract),
LexiResolverImpl(match: [base_identifier, base_kas], prefer: base_kas),
LexiResolverImpl(match: [base_identifier, base_kwhen], prefer: base_kwhen),
LexiResolverImpl(match: [base_identifier, base_ksealed], prefer: base_ksealed),
LexiResolverImpl(match: [base_identifier, base_kbase], prefer: base_kbase),
LexiResolverImpl(match: [base_identifier, base_kassert], prefer: base_kassert),
LexiResolverImpl(match: [base_identifier, base_kasync], prefer: base_kasync),
LexiResolverImpl(match: [base_identifier, base_kawait], prefer: base_kawait),
LexiResolverImpl(match: [base_identifier, base_kbreak], prefer: base_kbreak),
LexiResolverImpl(match: [base_identifier, base_kcase], prefer: base_kcase),
LexiResolverImpl(match: [base_identifier, base_kcatch], prefer: base_kcatch),
LexiResolverImpl(match: [base_identifier, base_kclass], prefer: base_kclass),
LexiResolverImpl(match: [base_identifier, base_kconst], prefer: base_kconst),
LexiResolverImpl(match: [base_identifier, base_kcontinue], prefer: base_kcontinue),
LexiResolverImpl(match: [base_identifier, base_kcovariant], prefer: base_kcovariant),
LexiResolverImpl(match: [base_identifier, base_kdefault], prefer: base_kdefault),
LexiResolverImpl(match: [base_identifier, base_kdeferred], prefer: base_kdeferred),
LexiResolverImpl(match: [base_identifier, base_kdo], prefer: base_kdo),
LexiResolverImpl(match: [base_identifier, base_kdynamic], prefer: base_kdynamic),
LexiResolverImpl(match: [base_identifier, base_kelse], prefer: base_kelse),
LexiResolverImpl(match: [base_identifier, base_kenum], prefer: base_kenum),
LexiResolverImpl(match: [base_identifier, base_kexport], prefer: base_kexport),
LexiResolverImpl(match: [base_identifier, base_kextends], prefer: base_kextends),
LexiResolverImpl(match: [base_identifier, base_kextension], prefer: base_kextension),
LexiResolverImpl(match: [base_identifier, base_kexternal], prefer: base_kexternal),
LexiResolverImpl(match: [base_identifier, base_kfactory], prefer: base_kfactory),
LexiResolverImpl(match: [base_identifier, base_kfalse], prefer: base_kfalse),
LexiResolverImpl(match: [base_identifier, base_kfinal], prefer: base_kfinal),
LexiResolverImpl(match: [base_identifier, base_kfinally], prefer: base_kfinally),
LexiResolverImpl(match: [base_identifier, base_kfor], prefer: base_kfor),
LexiResolverImpl(match: [base_identifier, base_kfunction], prefer: base_kfunction),
LexiResolverImpl(match: [base_identifier, base_kget], prefer: base_kget),
LexiResolverImpl(match: [base_identifier, base_khide], prefer: base_khide),
LexiResolverImpl(match: [base_identifier, base_kif], prefer: base_kif),
LexiResolverImpl(match: [base_identifier, base_kimplements], prefer: base_kimplements),
LexiResolverImpl(match: [base_identifier, base_kimport], prefer: base_kimport),
LexiResolverImpl(match: [base_identifier, base_kin], prefer: base_kin),
LexiResolverImpl(match: [base_identifier, base_kinterface], prefer: base_kinterface),
LexiResolverImpl(match: [base_identifier, base_kis], prefer: base_kis),
LexiResolverImpl(match: [base_identifier, base_klate], prefer: base_klate),
LexiResolverImpl(match: [base_identifier, base_klibrary], prefer: base_klibrary),
LexiResolverImpl(match: [base_identifier, base_kmixin], prefer: base_kmixin),
LexiResolverImpl(match: [base_identifier, base_knew], prefer: base_knew),
LexiResolverImpl(match: [base_identifier, base_knull], prefer: base_knull),
LexiResolverImpl(match: [base_identifier, base_kof], prefer: base_kof),
LexiResolverImpl(match: [base_identifier, base_kon], prefer: base_kon),
LexiResolverImpl(match: [base_identifier, base_koperator], prefer: base_koperator),
LexiResolverImpl(match: [base_identifier, base_kpart], prefer: base_kpart),
LexiResolverImpl(match: [base_identifier, base_krequired], prefer: base_krequired),
LexiResolverImpl(match: [base_identifier, base_krethrow], prefer: base_krethrow),
LexiResolverImpl(match: [base_identifier, base_kreturn], prefer: base_kreturn),
LexiResolverImpl(match: [base_identifier, base_kset], prefer: base_kset),
LexiResolverImpl(match: [base_identifier, base_kshow], prefer: base_kshow),
LexiResolverImpl(match: [base_identifier, base_kstatic], prefer: base_kstatic),
LexiResolverImpl(match: [base_identifier, base_ksuper], prefer: base_ksuper),
LexiResolverImpl(match: [base_identifier, base_kswitch], prefer: base_kswitch),
LexiResolverImpl(match: [base_identifier, base_ksync], prefer: base_ksync),
LexiResolverImpl(match: [base_identifier, base_kthis], prefer: base_kthis),
LexiResolverImpl(match: [base_identifier, base_kthrow], prefer: base_kthrow),
LexiResolverImpl(match: [base_identifier, base_ktrue], prefer: base_ktrue),
LexiResolverImpl(match: [base_identifier, base_ktry], prefer: base_ktry),
LexiResolverImpl(match: [base_identifier, base_ktypedef], prefer: base_ktypedef),
LexiResolverImpl(match: [base_identifier, base_kvar], prefer: base_kvar),
LexiResolverImpl(match: [base_identifier, base_kvoid], prefer: base_kvoid),
LexiResolverImpl(match: [base_identifier, base_kwhile], prefer: base_kwhile),
LexiResolverImpl(match: [base_identifier, base_kwith], prefer: base_kwith),
LexiResolverImpl(match: [base_identifier, base_kyield], prefer: base_kyield),
],
),
);
}(),
lang_base,
inner(
name: "sy",
atoms: [
apnm(v: r'[', n: "slbra"),
apnm(v: r']', n: "srbra"),
apnm(v: r'(', n: "slpar"),
apnm(v: r')', n: "srpar"),
apnm(v: r',', n: "scomma"),
apnm(v: r':', n: "scolon"),
apnm(v: r';', n: "ssemicolon"),
apnm(v: r'&', n: "sand"),
apnm(v: r'&=', n: "sandeq"),
apnm(v: r'^', n: "scaret"),
apnm(v: r'^=', n: "scareteq"),
apnm(v: r'>', n: "sg"),
apnm(v: r'<', n: "sl"),
apnm(v: r'%', n: "spercent"),
apnm(v: r'%=', n: "spercenteq"),
apnm(v: r'+', n: "splus"),
apnm(v: r'+=', n: "spluseq"),
apnm(v: r'|', n: "spipe"),
apnm(v: r'|=', n: "spipeeq"),
apnm(v: r'??', n: "sqq"),
apnm(v: r'??=', n: "sqqeq"),
apnm(v: r'~/', n: "stildeslash"),
apnm(v: r'~/=', n: "stildeslasheq"),
apnm(v: r'*', n: "sstar"),
apnm(v: r'*=', n: "sstareq"),
apnm(v: r'/', n: "sslash"),
apnm(v: r'/=', n: "sslasheq"),
apnm(v: r'-', n: "sminus"),
apnm(v: r'-=', n: "sminuseq"),
apnm(v: r'=', n: "seq"),
apnm(v: r'==', n: "seqeq"),
apnm(v: r'&&', n: "sandand"),
apnm(v: r'@', n: "sat"),
apnm(v: r'.', n: "sdot"),
apnm(v: r'..', n: "sdotdot"),
apnm(v: r'...', n: "sdotdotdot"),
apnm(v: r'...?', n: "sdotdotdotq"),
apnm(v: r'--', n: "sminusminus"),
apnm(v: r'!', n: "sbang"),
apnm(v: r'!=', n: "sbangeq"),
apnm(v: r'++', n: "splusplus"),
apnm(v: r'#', n: "shash"),
apnm(v: r'||', n: "spipepipe"),
apnm(v: r'?', n: "squestion"),
apnm(v: r'?.', n: "squestiondot"),
apnm(v: r'?..', n: "squestiondotdot"),
apnm(v: r'~', n: "stilde"),
],
children: [
lang_base,
],
),
leaf(
name: "blockcomment",
handle: key_blockcomment,
atoms: [
argm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*/*/\*", k: key_blockcomment, n: "bcstart"),
arlm(r: r"(\*+[^/*]+|/+[^/*]+|[^/*]+)*\**\*/", n: "bcend"),
],
),
inner(
name: "t",
atoms: [
arnm(r: r'( |\t|\n|\r)+', n: "ws"),
arnm(r: r'#![^\n\r]*(\r|\n|\r\n)?', n: "scripttag"),
// TODO • can't parse '/**' because this will mess up parsing '/**/' I think I'll need a new language for the body of bcstart if it starts with *.
apgm(v: r'/*', k: key_blockcomment, n: "bcinit"),
// TODO • have a separate language for line comments to parse '///' correctly.
arnm(r: r'//([^\n\r]+)?(\r|\n|\r\n)?', n: "singlelinenondoccomment"),
],
children: [
lang_base,
],
r: const LexiResolutionHandlerExplicitImpl(
resolvers: [],
),
),
inner(
name: "bsc",
atoms: [
arnm(r: r'\\n', n: "escaped_lf"),
arnm(r: r'\\r', n: "escaped_cr"),
arnm(r: r'\\b', n: "escaped_b"),
arnm(r: r'\\t', n: "escaped_ht"),
arnm(r: r'\\v', n: "escaped_vt"),
arnm(r: r'\\x[0-9a-fA-F][0-9a-fA-F]', n: "escaped_byte"),
arnm(r: r'\\u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]', n: "escaped_unicodesimple"),
arnm(r: r'\\u{[0-9a-fA-F][0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?[0-9a-fA-F]?}', n: "escaped_unicoderaw"),
arnm(r: r'\\[^bnrtuvx]', n: "escaped_unescaped"),
arnm(r: r'$[_a-zA-Z][_0-9a-zA-Z]*', n: "simpleinterpolation"),
argm(r: r'${', k: key_base, n: "advancedinterpolation"),
],
children: [
leaf(
name: "slsq",
atoms: [
aplm(v: "'", n: "slsqpop"),
arnm(r: r"[^'$\\\n\r]+", n: "slsqcontent"),
],
handle: key_slsq,
),
leaf(
name: "sldq",
atoms: [
aplm(v: '"', n: "sldqpop"),
arnm(r: r'[^"$\\\n\r]+', n: "sldqcontent"),
],
handle: key_sldq,
),
leaf(
name: "mlsq",
atoms: [
aplm(v: "'''", n: "mlsqpop"),
arnm(r: r"[^'$\\]+", n: "mlsqcontent"),
arnm(r: r"('|'')", n: "mlsqq"),
],
handle: key_mlsq,
),
leaf(
name: "mldq",
atoms: [
aplm(v: '"""', n: "mldqpop"),
arnm(r: r'[^"$\\]+', n: "mldqcontent"),
arnm(r: r'("|"")', n: "mldqq"),
],
handle: key_mldq,
),
],
),
leaf(
name: "rslsq",
atoms: [
aplm(v: "'", n: "rslsqend"),
arnm(r: r"[^\n\r']+", n: "rslsqcontent"),
],
handle: key_rslsq,
),
leaf(
name: "rsldq",
atoms: [
aplm(v: '"', n: "rsldqend"),
arnm(r: r'[^\n\r"]+', n: "rsldqcontent"),
],
handle: key_rsldq,
),
leaf(
name: "rmlsq",
atoms: [
aplm(v: r"'''", n: "rmlsqend"),
arnm(r: r"(('|'')?[^'])+", n: "rmlsqcontent"),
],
handle: key_rmlsq,
),
leaf(
name: "rmldq",
atoms: [
aplm(v: '"""', n: "rmldqend"),
arnm(r: r'(("|"")?[^"])+', n: "rmldqcontent"),
],
handle: key_rmldq,
),
];
}(),
),
start: key_base,
);
Here's a visual representation that I use to visualize that specification:
dot.pdf
Here's the ANTLR specification of Dart.
ANTLR uses order to disambiguate between identifiers and keywords. The way I handle the specification, only keywords and identifiers need to be disambiguated in favor of the keyword.
I plan to translate that specification to owl just to see how it works out.
Utterly random question, how difficult do you think it would be to extend owl with indentation sensitive parsing in the style of this paper: https://michaeldadams.org/papers/layout_parsing/LayoutParsing.pdf?
I'm mainly curious since this is a seriously cool project and I've learned a fair bit from it already. Basically, I'm hoping to absorb your expertise by osmosis. ^_^
I am running PopOS (an Ubuntu derivative):
$ uname -a
Linux pop-os 5.11.0-7620-generic #21~1626191760~20.10~55de9c3-Ubuntu SMP Wed Jul 21 20:34:49 UTC x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 10.3.0-1ubuntu1~20.10) 10.3.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
When I run make
with commit db462c6, I get the following errors and warnings:
$ make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings -o owl src/*.c -ldl
In file included from src/owl.c:9:
src/test.h:12:5: error: unknown type name ‘pid_t’
12 | pid_t child;
| ^~~~~
In file included from src/test.c:2:
src/test.h:12:5: error: unknown type name ‘pid_t’
12 | pid_t child;
| ^~~~~
src/test.c: In function ‘spawn_child’:
src/test.c:70:19: warning: implicit declaration of function ‘fdopen’; did you mean ‘fopen’? [-Wimplicit-function-declaration]
70 | t->file = fdopen(fd[1], "w");
| ^~~~~~
| fopen
src/test.c:70:17: warning: assignment to ‘FILE *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
70 | t->file = fdopen(fd[1], "w");
| ^
src/test.c:58:5: warning: ignoring return value of ‘pipe’ declared with attribute ‘warn_unused_result’ [-Wunused-result]
58 | pipe(fd);
| ^~~~~~~~
make: *** [Makefile:24: owl] Error 1
See, for example this stackoverflow question for a discussion of the problem with the definition of pid_t
. I can fix that by adding #include <sys/types.h>
in src/test.h
.
The issue with fdopen
is that it is POSIX-only, and not part of C11.
It occurs to me that a C-based custom token matcher is overkill in most cases. For example:
.token hex '0x0BADCAFE' '0x248c'
.token oct '0644' '0777771'
.token bin '11001011b' '10b'
Could be completely defined in the language specification using regexes:
.token hex /0x[0-9A-Fa-f]+/
.token oct /0[1-7][0-7]*/
.token bin /[01]+b/
I suggest to reuse a higher level build system than your current make script so that powerful checks for software features will become easier.
I'm trying to use owl in a C++ project. Therefore I need to include parser.h (at least without implementation) in the main.cpp, where I call e.g. owl_tree_create_from_string()
. The implementation is included in an almost empty file parser.c. The C++ compiler, however, has trouble parsing parser.h
while compiling main.cpp due to the keyword class
, which occurs at unexpected places.
I fixed the generated parser.h with the following sed command, appending an underscore after every "class":
sed -i '' 's/class/class_/g' src/parser.h
This is a bit more than necessary, because "class" also occurs as part of longer words. But this fix works.
It might be reasonable to slightly change the naming in 1-parse.h to make it C++ compatible. After all owl works nicely in C as well as C++.
>make
cc -O3 -std=c11 -pedantic -Wall -Wno-missing-braces -Wno-overlength-strings -DNO
T_UNIX -o owl src/*.c
cc: error: src/*.c: Invalid argument
cc: fatal error: no input files
compilation terminated.
make: *** [owl] Error 1
What build tools do I need to get this to work?
FYI: I just tried re-building this owl experiment
https://github.com/WarrenWeckesser/experiments/tree/master/c/gufunc-out-expr
using 93a6b7c. I get the following compiler warning (converted to an error because of -Werror
):
$ make
gcc -Wall -Werror -pedantic -std=c99 main.c -lm -o main
In file included from main.c:9:
dim-expr-parser.h: In function ‘owl_default_tokenizer_advance’:
dim-expr-parser.h:787:25: error: variable ‘string’ set but not used [-Werror=unused-but-set-variable]
787 | const char *string = text + content_offset;
| ^~~~~~
cc1: all warnings being treated as errors
make: *** [Makefile:5: main] Error 1
The following is the generated code around line 787:
else if (token == 4294967295U) {
size_t content_offset = offset + 1;
size_t content_length = token_length - 2;
const char *string = text + content_offset; // <<< line 787
size_t string_length = content_length;
if (has_escapes) {
for (size_t i = 0;
i < content_length;
++i) {
if (text[content_offset + i] == '\\') {
string_length--;
i++;
}
}
char *unescaped = allocate_string_contents(string_length, tokenizer->info);
size_t j = 0;
for (size_t i = 0;
i < content_length;
++i) {
if (text[content_offset + i] == '\\') i++;
unescaped[j++] = ESCAPE_CHAR(text[content_offset + i], tokenizer->info);
}
string = unescaped;
}
IGNORE_TOKEN_WRITE(offset, token_length, string, string_length, has_escapes, tokenizer->info);
}
This is not a major issue for me, since this is still just an experiment with owl, but I thought you'd like to know.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.