dmitrysoshnikov / syntax Goto Github PK
View Code? Open in Web Editor NEWSyntactic analysis toolkit, language-agnostic parser generator.
License: MIT License
Syntactic analysis toolkit, language-agnostic parser generator.
License: MIT License
Hi, I'm trying to implement an experimental parser with custom tokenizer, in rust, via syntax-cli
.
Here's a simple grammar (partially ripped from ruby's)
0. $accept -> program
-----------------------
1. program -> top_compstmt
2. top_compstmt -> top_stmts opt_terms
3. top_stmts -> top_stmt
4. top_stmt -> stmt
5. stmt -> expr
6. expr -> arg
7. arg -> primary
8. primary -> literal
9. literal -> numeric
10. numeric -> simple_numeric
11. simple_numeric -> tINTEGER
12. opt_terms -> terms
13. term -> tNL
14. terms -> term
┌────────────────┬───────────┐
│ Symbol │ First set │
├────────────────┼───────────┤
│ $accept │ tINTEGER │
├────────────────┼───────────┤
│ program │ tINTEGER │
├────────────────┼───────────┤
│ top_compstmt │ tINTEGER │
├────────────────┼───────────┤
│ top_stmts │ tINTEGER │
├────────────────┼───────────┤
│ top_stmt │ tINTEGER │
├────────────────┼───────────┤
│ stmt │ tINTEGER │
├────────────────┼───────────┤
│ expr │ tINTEGER │
├────────────────┼───────────┤
│ arg │ tINTEGER │
├────────────────┼───────────┤
│ primary │ tINTEGER │
├────────────────┼───────────┤
│ literal │ tINTEGER │
├────────────────┼───────────┤
│ numeric │ tINTEGER │
├────────────────┼───────────┤
│ simple_numeric │ tINTEGER │
├────────────────┼───────────┤
│ tINTEGER │ tINTEGER │
├────────────────┼───────────┤
│ opt_terms │ tNL │
├────────────────┼───────────┤
│ terms │ tNL │
├────────────────┼───────────┤
│ term │ tNL │
├────────────────┼───────────┤
│ tNL │ tNL │
└────────────────┴───────────┘
And some productions:
...
top_compstmt
: top_stmts opt_terms {
|$1: Node; $2: Token| -> Node;
$$ = Node::Dummy;
}
;
...
Would produce handlers like:
enum SV {
Undefined,
_0(Token),
_1(Node)
}
...
fn _handler2(&mut self) -> SV {
// Semantic values prologue.
let mut _1 = pop!(self.values_stack, _1);
let mut _2 = pop!(self.values_stack, _0);
let __ = Node::Dummy;
SV::_1(__)
}
...
The issue I encountered is, at the beginning of top_compstmt
aka _handler2
, the values stack is like:
[
_1(Dummy),
_0(Token { kind: 15, value: "\n", ... })
]
It seems legit to me, the first is the reduced result value for top_stmt <- ... <- tINTEGER
and the second one is the result of opt_term <- ... <- tNL
.
Then the statement let mut _1 = pop!(self.values_stack, _1);
is assuming a _1(Node)
is on the top of the stack, meanwhile the reality is the top of the stack is _0(Token)
, thus the issue.
So do you think this is an issue in syntax-cli or somewhere else in my implementation? Thanks.
Similar to 28880c9 for Example plugin.
A lex rule may return different token types, based on the logic of a handler. E.g.
[`\\n( *)`, `
const matchedIndent = yyleng - 1;
if (matchedIndent === currentIndent) {
return 'NL';
}
return 'INDENT';
`]
We need to support as well returning several tokens from a rule handler:
[`\\n( *)`, `
const matchedIndent = yyleng - 1;
if (matchedIndent === currentIndent) {
return 'NL';
}
return ['INDENT', 'NL'];
`]
This allows structuring grammar rules with "fake" tokens, making grammar rules simpler.
Currently one have to explicitly define an optional rule as:
Foo
: Bar OptBaz
;
OptBaz
: Baz
| /* empty */
;
This makes it inconvenient to define a bunch of Opt<Symbol>
rules. We can support instead EBNF notation for optional rules:
Foo
: Bar Baz?
;
This can be done as grammar transform by inserting the optional rule (basically transforming to the former version internally).
PHP reports an unexpected error instead of raising an exception when encountering unbalanced parenthesis.
This occurs because the PHP template is attempting to reference a global function instead of a local, static function.
See lr.template.php:165.
Some languages like PHP do not adhere to the POSIX grammar. Since lex-rule.js instantiates a field using JavaScript's RegExp, it implicitly expects the expression to conform to the POSIX grammar and will raise a "Invalid regular expression" exception if it does not.
Example:
PCRE: (?<=[:space:])[Aa][Nn][Dd](?=[:space:])
POSIX: (?:[:space:])[Aa][Nn][Dd](?=[:space:])
Notes:
Currently for productions like:
e : e '+' e
3 instances of the e
symbol are created. We can track a global registry of symbols, and don't allocate a new symbol if it's already created.
Could be useful for integration with separate, e.g. hand written lexers - which is exactly my case now.
Currently code is not statically type-checked. We can use Flow to provide static type safety.
Rules in yacc/bison may contain embedded actions, it's useful when constructing complex grammar.
Do you think we could have this feature in Syntax?
reference: http://dinosaur.compilertools.net/bison/bison_6.html#SEC48
The templates contain insertion points delimited by '<<' and '>>'. I have not delved into the code to see how this is handled but '<<' and '>>' represent reserved operators in several C-based languages (e.g., C, C++, C# and Java use '<<' to represent a left-shift). This could make it awkward to write a proper template.
Consider using something that is not already used by any language or framework. Most double character tokens are already used. For example:
{{
: Mustache[[
: PHP, Javascript, etc.@@
: I cannot rememberI recommend using {{{
since I do not know of any language or framework using this particular delimiter.
PHP does not handle Unicode very well so I found it necessary to preprocess the supplied text by folding the Unicode characters into the ASCII character set using iconv('UTF-8', 'ASCII//TRANSLIT', $string)
.
Instead of passing a string to the parser, I would prefer passing a character stream or specifying a translation function so that I do not have to modify the generated code.
When we get "Unexpected token" error from the tokenizer or parser, we currently just show:
Unexpected token: "..." at <line>:<column>
In addition it'd be good to show an actual line from the original source, and a pointer to the invalid token:
foo + &7;
^
Unexpected token: "&" at <line>:<column>
Currently there is no way to refer a specific matched group in the lex rule handler. We need to support them as variables: $1
, $2
, etc. Example:
[`\\n( *)`, `yytext = $1; return 'INDENT'`]
We should target the latest Node version, which already implements most of the ES6/E7, and use only needed Babel plugins (if any) instead of full ES2015 preset.
Currently to check "shift-reduce", and other conflicts one needs to either pass --collection
, or --table
options, which actually builds, and outputs parsing table.
We need to add --validate
options which will just just show:
✓ Grammar is valid
Or
✕ Grammar has the following errors:
<list-of-errors>
Currently we have a semantic action per the whole production, which is called onReduce
action. For a better parser-lexer communication we could build a semantic action per each symbol on RHS ("before", and "after" hooks). Alternatively we can provide a generic onShift
hook which can receive a production, symbol, lexer/parser instance, etc, and do some action (e.g. set a lexer state).
function onBeforeShift(grammarSymbol, production, lexer, parser, ...) {
// do actions on shift
}
function onShift(...) { ... }
Object : "{" Props "}" { $$ = $1 + $2 }
Here onBeforeShift
, and onShift
is called for {
and }
.
To support only Tokenizer code generation in #7, it's good to have lexical, and syntactic grammars as two separate classes (currently both are mixed into one Grammar
class, we can introduce LexGrammar
for the lexical one).
We should support start
/ end
locations for tokens, and for AST nodes, and also loc
structure with line
and column
numbers.
It can be either properties sitting on the handler arguments themselves, e.g.:
$1.start; $1.loc.start.line;
Or an extra "location" structure ("global" loc
variable) passed as the last parameter to a handler function:
loc.$1.start.line;
For the example code:
foo;
A resulting AST node can have a structure:
{
type: "Identifier",
name: "foo",
start: 0,
end: 3,
loc: {
start: {
line: 1,
column: 0
},
end: {
line: 1,
column: 3
}
}
}
Some tools may need to analyze original source code corresponding to AST nodes (combined with node locations, it is possible to get a substring of a node in the source code).
We need to expose the source code on parser, tokenizer, and in the handlers. This can be just a global (or scoped) yysource
variable.
Program : StatementList
{ $$ = {
type: 'Program',
body: $StatementList,
source: yysource,
}
}
Currently handlers use number-notation for production variables. E.g.:
add
: mult { $$ = $1 }
| mult + add { $$ = $1 + $3 }
;
We need also to support variable names corresponding to non-terminals:
add
: mult { $$ = $mult }
| mult + add { $$ = $mult + $add }
;
We should expose the onToken
event, so the clients can subscribe, and route the parsing events based on the token.
yy.tokenizer.onToken = token => {
if (token.type === 'L_PAREN') {
...
}
};
Hello,
I have just cloned the repo, run npm install
and npm run build
which both executed successfully. However, when I run the utility with ./bin/syntax
it gives me the following output:
[syntax-master]$ ./bin/syntax --help
syntax-master/bin/syntax:196
const [reducePart, shiftPart] = table.splitSRParts(conflict);
^
SyntaxError: Unexpected token [
at exports.runInThisContext (vm.js:53:16)
at Module._compile (module.js:374:25)
at Object.Module._extensions..js (module.js:405:10)
at Module.load (module.js:344:32)
at Function.Module._load (module.js:301:12)
at Function.Module.runMain (module.js:430:10)
at startup (node.js:141:18)
at node.js:980:3
I am running Node v5.3.0 on Fedora 24 x64.
Similar to 28880c9 for Example plugin.
While studying your algorithm for computing the First set (source), I wondered whether it is correct.
Consider a grammar like this:
S -> A | x
A -> S | y
From my understanding of your algorithm, you initially set _firstSets[S]
to the empty set. Thus, when doing the recursion on A
, the First set of A
is computed to be the union of { }
(as _firstSets[S]
is still empty) and { y }
.
However, the First set of A
is { x, y }
.
What am I missing?
We need to support case-insensitive
in the lexical grammar declaration:
"lex": {
"rules": [
[`\\band\\b`, `return 'AND'`],
],
"options": {
"case-insensitive": true,
},
}
In addition to global option for all rules, we can support this flag per each rule.
Disclaimer: I am not a PHP developer but I am writing a translator for our PHP team that converts a Sphinx search expression into an ElasticSearch search expression as part of their migration from Sphinx to ElasticSearch.
The PHP team has provided a little over 1000 unit tests. Some are positive tests and some are negative tests. An example negative test is mismatched parenthesis. What we would like to do is instruct PHPUnit to expect an exception in this case, however, PHPUnit does not allow expecting \Exception since the authors claim that it leads to unintended consequences. PHPUnit wants you to use something more concrete like SyntaxException or SemanticException.
It could be that I am misunderstanding what PHPUnit is requesting so I am open to correction. However, if I understand this correctly please use a more specific exception type.
Currently building of the canonical collection, and compression from CLR(1) to LALR(1) are the longest operation, and for some grammars can be very slow (up to several minutes). We should investigate better compression algorithms here.
time syntax-cli -g ~/grammar.g -d -m lalr1 -o ~/test.py
DEBUG mode is: ON
[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.081ms
Parsing mode: LALR(1).
[DEBUG] Building canonical collection: 46531.131ms
[DEBUG] Number of states in the collection: 5311
[DEBUG] Compressing CLR to LALR: 15057.025ms
[DEBUG] Number of states after compression: 299
[DEBUG] Building LR parsing table: 12.715ms
[DEBUG] Generating parser module: 24.013ms
✓ Successfully generated: /Users/dmitrys/test.py
real 1m2.014s
user 1m1.273s
sys 0m1.768s
Currently a lex rule handler returns only a token type (potentially modifying yytext
), E.g.
[`[0-9]+`, `yytext = yytext.slice(1); return 'NUMBER';`]
In this case "100"
would result to:
{type: 'NUMBER', value: "00"}
We need also to support returning a full token (type-value pair) from a handler:
[`[0-9]+`, `return {type: 'NUMBER', value: 'Override!'}]
A use-case: matching spaces would return number of spaces, instead of the spaces themselves:
[`\\n( *)`, `return {type: 'INDENT', value: $1.length}]
Currently npm scripts work only on Unix/Mac. While developers could use Cygwin to work with it, it be great to support Windows out of the box.
We can introduce a build system (gulp
or similar) for this. Or just rewrite the scripts using Node's modules for file operations, instead of raw Bash scripts.
Currently the parser fails on the first found error. For being more practical, we should consider building some error recovery mechanisms, and restore parsing after finding an error. We can start from the simplest panic mode recovery, and allow catching several errors in one pass:
1 +& 2; 5 #* 3; 5 + 4;
Should issue two errors (currently issues only first one):
Unexpected token "&" at 1:3
Unexpected token "#" at 1:10
Record all debug, and timing information with --debug
(-d
) option.
Calculating an LALR(1) by compressing from CLR(1) for a big grammar can take some time. It'd be good to log each steps making calculation more interactive. Potentially this may also be an actual progress bar.
Similar to 28880c9 for Example plugin.
Currently tokenizer automatically add ^
at the beginning of all regexes. Some regexes, e.g. lookbehind assertions, can use it explicitly:
'/(?<=^[[:space:]])A(?=[[:space:]])/'
In this case we don't need to append the ^
.
Note: this doesn't work:
'/^(?<=[[:space:]])A(?=[[:space:]])/'
We should provide tooling for automatic left-recursion elimination (in cases where it's possible).
Lexical grammar can already be specified as a separate file, however at code generation a tokenizer is always embedded into the generated parser file. We need to support generating a module for a standalone tokenizer (so it can be required as from any other parser).
./bin/syntax --lex ~/lexer.l --tokenizer-only --output ~/Tokenizer.js
Which can be required later in any parser:
const Tokenizer = require('Tokenizer');
const lexer = new Tokenizer('a b c');
console.log(lexer.getNextToken()); // {type: '...', value: 'a'}
console.log(lexer.getNextToken()); // {type: '...', value: 'b'}
...
Should support API from the custom tokenizer section:
// initString is supported to reuse the same tokenizer instance
lexer.initString('x y z');
console.log(lexer.getNextToken()); // {type: '...', value: 'x'}
...
We can use standard template for the tokenizer at code generation.
It looks like currently the tool just generates table-driven parsers for various languages, which is often suboptimal.
Recursive ascent parsers are fully compatible with (LA)LR grammars, but they instead use native language features (functions, switches, etc.) so that they can be statically analysed and, as a result, much better optimised by the target compiler or JIT.
It would be great if syntax
could generate such parsers for various langs.
Currently it's possible to use raw (wrapped in quotes) literal tokens in BNF.
%%
E : N;
N : D N | /* empty */ ;
D : "1" | "2" | "3" ... ;
We should also allow embedding regexp rules from lexical grammar directly to the BNF.
%%
E : N;
N : D N | /* empty */ ;
D : [0-9]+ ;
While writing a translator that converts from Sphinx's expression syntax to ElasticSearch's expression syntax, I found that I needed to pass the document's field as a scope parameter. For example, I needed to instruct ElasticSearch to limit it's search to just the account_number field.
{
"query": { "match": { "account_number": 20 } }
}
Other times, I needed to search the entire document.
{
"query": { "match": { "_all": 20 } }
}
As a client, I need the parser to accept additional arguments and make them available in the semantic actions. For example:
{"query":{"match":{$context:$1}}}
Hi!
The generated handler for an empty production
contains the line self.values_stack.pop();
. I guess that's not the expected behavior. I'd like to coin a repro-case if you need one. Thanks!
Currently "global" variables like yytext
, yyleng
, $$
, etc. are implemented either as module-level variables (in JS), or static class variables (in some plugins, C#, PHP, etc).
This makes parsing several files in parallel (e.g. in threads) impossible, since they mutate the same "global" data. We should make them instance variables, and always export a class for Tokenizer and the Parser (instead of a singleton instance which just access the global or static class variables).
This will require rewriting global code in handles like:
yytext = yytext.slice(1, -1);
to an instance property access. Currently we already rewrite it to static props access in some plugins):
Yyparse.yytext = Yyparse.yytext.slice(1, -1);
Eventually should be instance prop (specific to each plugin) instead of static one:
this.yytext = this.yytext.slice(1, -1);
I have the following BNF rules and I have the functions disjunction
and conjunction
declared in the moduleInclude
section.
["E OR E", "$$ = disjunction($1, $3)"],
["E AND E", "$$ = conjunction($1, $3)"],
This places the functions in the global namespace. Is there a way to add these functions to the generated parser or can you recommend an idiom that does not pollute the global namespace? In other words, given two parsers that both declare the functions disjunction
and conjunction
how would I isolate the two parsers?
I know it's an odd issue but it took me 30 minutes to find this project! Finally recovered it from my GitHub stars.
P.S. Trying to write a brand new YAML parser with this. Do you have any recommendations?
Currently lookbehind assertions do not work, since ^
is inserted at the beginning of the whole regexp (tokenizer applies all regexes from the beginning of the rest of a string).
This should match foo
("foo" surrounded by spaces):
(?<= )foo(?= )
This incorrectly transformed to:
^(?<= )foo(?= )
While should be:
(?<=^ )foo(?= )
We need to support LR-parser generator for Java. See example plugin, or C# plugin which is closer to Java.
We should provide an ability to access lexer (e.g. to change its state) from semantic actions of productions:
e: e + e { yy.lexer.pushState('condition'); ... }
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.