dmitrysoshnikov / syntax Goto Github PK

Syntactic analysis toolkit, language-agnostic parser generator.

License: MIT License

JavaScript 73.10% Python 2.36% PHP 3.09% Ruby 2.37% C# 4.70% GAP 1.37% Lex 1.22% Rust 3.06% Shell 0.08% Java 3.47% Makefile 0.08% C++ 2.44% Julia 2.66%

syntax's People

Contributors

Stargazers

Watchers

Forkers

wangjingyi cscott daewon andrebaltazar8 tjvr tpirc3 hanksantford tobie worudso bglgwyng springstar ollon fuadk mashplant hothat florian-bd ruby-on-rust shimaore dywsjtu ckanner gschulze anru rasteiner marcelbostelaar roman2git algobot76 darrenfu irwin1985 solidest vitaminniy kevindyke disco0 gs-akhan sigitaspersonal longde123 kodematthieu paul-31415 ntedgi kevinwkc tutumagi wenpeng8019 1nk1 deveshpankaj nunyaklah jkumarraj davidmclifton apetcho amaryadav11 filipecalasans misdocumeno varlardohaeris pincats microsvuln falahdrrhmn dancomp71 inspirnathan richlysakowski zer0chance armantakmazyan d1tto gankoji nesuniken patelpreet422 gerbrich01 jsm3031 georgreen jocerfranquiz kisliybogdan991 knavels caramell0 sanyaade-teachings mxsjoberg mattymar234 iran13165 rodrigopedra padresmurfa msd2619 zzsoszz smcelwee74 petechurch maelhos wenjunhuang fahadprogrammer aadorian

syntax's Issues

rust plugin: wrong type annotation for args

Hi, I'm trying to implement an experimental parser with custom tokenizer, in rust, via syntax-cli.

Here's a simple grammar (partially ripped from ruby's)

     0. $accept -> program
    -----------------------
     1. program -> top_compstmt
     2. top_compstmt -> top_stmts opt_terms
     3. top_stmts -> top_stmt
     4. top_stmt -> stmt
     5. stmt -> expr
     6. expr -> arg
     7. arg -> primary
     8. primary -> literal
     9. literal -> numeric
    10. numeric -> simple_numeric
    11. simple_numeric -> tINTEGER
    12. opt_terms -> terms
    13. term -> tNL
    14. terms -> term

┌────────────────┬───────────┐
│ Symbol         │ First set │
├────────────────┼───────────┤
│ $accept        │ tINTEGER  │
├────────────────┼───────────┤
│ program        │ tINTEGER  │
├────────────────┼───────────┤
│ top_compstmt   │ tINTEGER  │
├────────────────┼───────────┤
│ top_stmts      │ tINTEGER  │
├────────────────┼───────────┤
│ top_stmt       │ tINTEGER  │
├────────────────┼───────────┤
│ stmt           │ tINTEGER  │
├────────────────┼───────────┤
│ expr           │ tINTEGER  │
├────────────────┼───────────┤
│ arg            │ tINTEGER  │
├────────────────┼───────────┤
│ primary        │ tINTEGER  │
├────────────────┼───────────┤
│ literal        │ tINTEGER  │
├────────────────┼───────────┤
│ numeric        │ tINTEGER  │
├────────────────┼───────────┤
│ simple_numeric │ tINTEGER  │
├────────────────┼───────────┤
│ tINTEGER       │ tINTEGER  │
├────────────────┼───────────┤
│ opt_terms      │ tNL       │
├────────────────┼───────────┤
│ terms          │ tNL       │
├────────────────┼───────────┤
│ term           │ tNL       │
├────────────────┼───────────┤
│ tNL            │ tNL       │
└────────────────┴───────────┘

And some productions:

...

top_compstmt
    : top_stmts opt_terms {
        |$1: Node; $2: Token| -> Node;

        $$ = Node::Dummy;
    }
;

...

Would produce handlers like:

enum SV {
    Undefined,
    _0(Token),
    _1(Node)
}

...

fn _handler2(&mut self) -> SV {
// Semantic values prologue.
let mut _1 = pop!(self.values_stack, _1);
let mut _2 = pop!(self.values_stack, _0);

        let __ = Node::Dummy;
SV::_1(__)
}

...

The issue I encountered is, at the beginning of top_compstmt aka _handler2, the values stack is like:

[
    _1(Dummy),
    _0(Token { kind: 15, value: "\n", ... })
]

It seems legit to me, the first is the reduced result value for top_stmt <- ... <- tINTEGER and the second one is the result of opt_term <- ... <- tNL.

Then the statement let mut _1 = pop!(self.values_stack, _1); is assuming a _1(Node) is on the top of the stack, meanwhile the reality is the top of the stack is _0(Token), thus the issue.

So do you think this is an issue in syntax-cli or somewhere else in my implementation? Thanks.

C# plugin: support locations in semantic actions

Similar to 28880c9 for Example plugin.

Tokenizer: support returning a list of tokens from a lex rule

A lex rule may return different token types, based on the logic of a handler. E.g.

[`\\n( *)`, `
  const matchedIndent = yyleng - 1;

  if (matchedIndent === currentIndent) {
    return 'NL';
  }

  return 'INDENT';
`]

We need to support as well returning several tokens from a rule handler:

[`\\n( *)`, `
  const matchedIndent = yyleng - 1;

  if (matchedIndent === currentIndent) {
    return 'NL';
  }

  return ['INDENT', 'NL'];
`]

This allows structuring grammar rules with "fake" tokens, making grammar rules simpler.

Grammar: Support "optional symbol" EBNF notation

Currently one have to explicitly define an optional rule as:

Foo
  : Bar OptBaz
  ;

OptBaz
  : Baz
  | /* empty */
  ;

This makes it inconvenient to define a bunch of Opt<Symbol> rules. We can support instead EBNF notation for optional rules:

Foo
  : Bar Baz?
  ;

This can be done as grammar transform by inserting the optional rule (basically transforming to the former version internally).

PHP: Unresolved function

PHP reports an unexpected error instead of raising an exception when encountering unbalanced parenthesis.

This occurs because the PHP template is attempting to reference a global function instead of a local, static function.

See lr.template.php:165.

Support non-capturing group in native language.

Some languages like PHP do not adhere to the POSIX grammar. Since lex-rule.js instantiates a field using JavaScript's RegExp, it implicitly expects the expression to conform to the POSIX grammar and will raise a "Invalid regular expression" exception if it does not.

Example:
PCRE: (?<=[:space:])[Aa][Nn][Dd](?=[:space:])
POSIX: (?:[:space:])[Aa][Nn][Dd](?=[:space:])

Notes:

The upper- and lower-case variants exist because I was unable to determine how to make the parser case insensitive.
The reason for the non-capturing groups is because the parser matched "and" in "android".

Reuse same grammar symbol objects

Currently for productions like:

e : e '+' e

3 instances of the e symbol are created. We can track a global registry of symbols, and don't allocate a new symbol if it's already created.

Add possibility to generate the parser only, omitting the lexer

Could be useful for integration with separate, e.g. hand written lexers - which is exactly my case now.

Integrate Flow type checker

Currently code is not statically type-checked. We can use Flow to provide static type safety.

feature request: embedded in-rule actions

Rules in yacc/bison may contain embedded actions, it's useful when constructing complex grammar.

Do you think we could have this feature in Syntax?

reference: http://dinosaur.compilertools.net/bison/bison_6.html#SEC48

Ruby plugin: support start conditions in tokenizer

The TODO is to implement start conditions. Similar to PR #2 for PHP.

Consider using different tokens to delimit insertion points

The templates contain insertion points delimited by '<<' and '>>'. I have not delved into the code to see how this is handled but '<<' and '>>' represent reserved operators in several C-based languages (e.g., C, C++, C# and Java use '<<' to represent a left-shift). This could make it awkward to write a proper template.

Consider using something that is not already used by any language or framework. Most double character tokens are already used. For example:

{{ : Mustache
[[: PHP, Javascript, etc.
@@: I cannot remember

I recommend using {{{ since I do not know of any language or framework using this particular delimiter.

Provide a way to pass a character stream to the tokenizer or a way to preprocess the supplied text

PHP does not handle Unicode very well so I found it necessary to preprocess the supplied text by folding the Unicode characters into the ASCII character set using iconv('UTF-8', 'ASCII//TRANSLIT', $string).

Instead of passing a string to the parser, I would prefer passing a character stream or specifying a translation function so that I do not have to modify the generated code.

Show source line and pointer to the invalid token in error message

When we get "Unexpected token" error from the tokenizer or parser, we currently just show:

Unexpected token: "..." at <line>:<column>

In addition it'd be good to show an actual line from the original source, and a pointer to the invalid token:

foo + &7;
      ^ 
Unexpected token: "&" at <line>:<column>

Tokenizer: Pass matched RegExp groups as variables

Currently there is no way to refer a specific matched group in the lex rule handler. We need to support them as variables: $1, $2, etc. Example:

[`\\n( *)`, `yytext = $1; return 'INDENT'`]

Build: use only needed Babel plugins, reducing build time

We should target the latest Node version, which already implements most of the ES6/E7, and use only needed Babel plugins (if any) instead of full ES2015 preset.

CLI: Add --validate option to check a grammar

Currently to check "shift-reduce", and other conflicts one needs to either pass --collection, or --table options, which actually builds, and outputs parsing table.

We need to add --validate options which will just just show:

✓ Grammar is valid

✕ Grammar has the following errors:

<list-of-errors>

Support `onShift` hook

Currently we have a semantic action per the whole production, which is called onReduce action. For a better parser-lexer communication we could build a semantic action per each symbol on RHS ("before", and "after" hooks). Alternatively we can provide a generic onShift hook which can receive a production, symbol, lexer/parser instance, etc, and do some action (e.g. set a lexer state).

function onBeforeShift(grammarSymbol, production, lexer, parser, ...) {
  // do actions on shift
}

function onShift(...) { ... }


Object : "{"  Props "}" { $$ = $1 + $2 }

Here onBeforeShift, and onShift is called for { and }.

Separate LexGrammar into own class

To support only Tokenizer code generation in #7, it's good to have lexical, and syntactic grammars as two separate classes (currently both are mixed into one Grammar class, we can introduce LexGrammar for the lexical one).

Support locations in tokens and AST nodes

We should support start / end locations for tokens, and for AST nodes, and also loc structure with line and column numbers.

It can be either properties sitting on the handler arguments themselves, e.g.:

$1.start; $1.loc.start.line;

Or an extra "location" structure ("global" loc variable) passed as the last parameter to a handler function:

loc.$1.start.line;

For the example code:

foo;

A resulting AST node can have a structure:

{
  type: "Identifier",
   name: "foo",
  start: 0,
  end: 3,
  loc: {
    start: {
      line: 1,
      column: 0
    },
    end: {
      line: 1,
      column: 3
    }
  }
}

Expose original source code to user-level

Some tools may need to analyze original source code corresponding to AST nodes (combined with node locations, it is possible to get a substring of a node in the source code).

We need to expose the source code on parser, tokenizer, and in the handlers. This can be just a global (or scoped) yysource variable.

Program : StatementList
  { $$ = {
      type: 'Program',
      body: $StatementList,
      source: yysource,
    }
  }

Python plugin: support start conditions in tokenizer

The TODO is to implement start conditions. Similar to PR #2 for PHP.

Implement Rust plugin

See https://www.rust-lang.org/en-US/.

Support named variables in handlers

Currently handlers use number-notation for production variables. E.g.:

add
  : mult          { $$ = $1 }
  | mult + add    { $$ = $1 + $3 }
  ;

We need also to support variable names corresponding to non-terminals:

add
  : mult          { $$ = $mult }
  | mult + add    { $$ = $mult + $add }
  ;

Tokenizer: support onToken event

We should expose the onToken event, so the clients can subscribe, and route the parsing events based on the token.

yy.tokenizer.onToken = token => {
  if (token.type === 'L_PAREN') {
    ...
  }
};

Latest build not working

Hello,

I have just cloned the repo, run npm install and npm run build which both executed successfully. However, when I run the utility with ./bin/syntax it gives me the following output:

[syntax-master]$ ./bin/syntax --help
syntax-master/bin/syntax:196
            const [reducePart, shiftPart] = table.splitSRParts(conflict);
                  ^

SyntaxError: Unexpected token [
    at exports.runInThisContext (vm.js:53:16)
    at Module._compile (module.js:374:25)
    at Object.Module._extensions..js (module.js:405:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:430:10)
    at startup (node.js:141:18)
    at node.js:980:3

I am running Node v5.3.0 on Fedora 24 x64.

Python plugin: support locations in semantic actions

Similar to 28880c9 for Example plugin.

First set computation - algorithm correct?

While studying your algorithm for computing the First set (source), I wondered whether it is correct.

Consider a grammar like this:

S -> A | x
A -> S | y

From my understanding of your algorithm, you initially set _firstSets[S] to the empty set. Thus, when doing the recursion on A, the First set of A is computed to be the union of { } (as _firstSets[S] is still empty) and { y }.
However, the First set of A is { x, y }.
What am I missing?

Lexer: support case-insensitive option

We need to support case-insensitive in the lexical grammar declaration:

"lex": {
  "rules": [
    [`\\band\\b`,    `return 'AND'`],
  ],

  "options": {
    "case-insensitive": true,
  },
}

In addition to global option for all rules, we can support this flag per each rule.

Use more specific exception types

Disclaimer: I am not a PHP developer but I am writing a translator for our PHP team that converts a Sphinx search expression into an ElasticSearch search expression as part of their migration from Sphinx to ElasticSearch.

The PHP team has provided a little over 1000 unit tests. Some are positive tests and some are negative tests. An example negative test is mismatched parenthesis. What we would like to do is instruct PHPUnit to expect an exception in this case, however, PHPUnit does not allow expecting \Exception since the authors claim that it leads to unintended consequences. PHPUnit wants you to use something more concrete like SyntaxException or SemanticException.

It could be that I am misunderstanding what PHPUnit is requesting so I am open to correction. However, if I understand this correctly please use a more specific exception type.

Improve performance of canonical collection building and CLR to LALR compression

Currently building of the canonical collection, and compression from CLR(1) to LALR(1) are the longest operation, and for some grammars can be very slow (up to several minutes). We should investigate better compression algorithms here.

time syntax-cli -g ~/grammar.g -d -m lalr1 -o ~/test.py

DEBUG mode is: ON

[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.081ms

Parsing mode: LALR(1).
[DEBUG] Building canonical collection: 46531.131ms
[DEBUG] Number of states in the collection: 5311
[DEBUG] Compressing CLR to LALR: 15057.025ms
[DEBUG] Number of states after compression: 299
[DEBUG] Building LR parsing table: 12.715ms
[DEBUG] Generating parser module: 24.013ms

✓ Successfully generated: /Users/dmitrys/test.py


real  1m2.014s
user  1m1.273s
sys 0m1.768s

Tokenizer: support returning an actual token from a handler

Currently a lex rule handler returns only a token type (potentially modifying yytext), E.g.

[`[0-9]+`,  `yytext = yytext.slice(1); return 'NUMBER';`]

In this case "100" would result to:

{type: 'NUMBER', value: "00"}

We need also to support returning a full token (type-value pair) from a handler:

[`[0-9]+`,  `return {type: 'NUMBER', value: 'Override!'}]

A use-case: matching spaces would return number of spaces, instead of the spaces themselves:

[`\\n( *)`,  `return {type: 'INDENT', value: $1.length}]

Support Windows for development process

Currently npm scripts work only on Unix/Mac. While developers could use Cygwin to work with it, it be great to support Windows out of the box.

We can introduce a build system (gulp or similar) for this. Or just rewrite the scripts using Node's modules for file operations, instead of raw Bash scripts.

Investigate different error recovery strategies

Currently the parser fails on the first found error. For being more practical, we should consider building some error recovery mechanisms, and restore parsing after finding an error. We can start from the simplest panic mode recovery, and allow catching several errors in one pass:

1 +& 2; 5 #* 3; 5 + 4;

Should issue two errors (currently issues only first one):

Unexpected token "&" at 1:3
Unexpected token "#" at 1:10

Implement Debug mode

Record all debug, and timing information with --debug (-d) option.

Show progress steps for longer operations

Calculating an LALR(1) by compressing from CLR(1) for a big grammar can take some time. It'd be good to log each steps making calculation more interactive. Potentially this may also be an actual progress bar.

Ruby plugin: support locations in semantic actions

Similar to 28880c9 for Example plugin.

Tokenizer: support beginning of a string ^ in actual regexp

Currently tokenizer automatically add ^ at the beginning of all regexes. Some regexes, e.g. lookbehind assertions, can use it explicitly:

'/(?<=^[[:space:]])A(?=[[:space:]])/'

In this case we don't need to append the ^.

Note: this doesn't work:

'/^(?<=[[:space:]])A(?=[[:space:]])/'

LL parsing: eliminate left-recursion

We should provide tooling for automatic left-recursion elimination (in cases where it's possible).

Support only tokenizer code generation

Lexical grammar can already be specified as a separate file, however at code generation a tokenizer is always embedded into the generated parser file. We need to support generating a module for a standalone tokenizer (so it can be required as from any other parser).

./bin/syntax --lex ~/lexer.l --tokenizer-only --output ~/Tokenizer.js

Which can be required later in any parser:

const Tokenizer = require('Tokenizer');

const lexer = new Tokenizer('a b c');
console.log(lexer.getNextToken()); // {type: '...', value: 'a'}
console.log(lexer.getNextToken()); // {type: '...', value: 'b'}
...

Should support API from the custom tokenizer section:

// initString is supported to reuse the same tokenizer instance
lexer.initString('x y z');
console.log(lexer.getNextToken()); // {type: '...', value: 'x'}
...

We can use standard template for the tokenizer at code generation.

Add support for recursive ascent parser generation

It looks like currently the tool just generates table-driven parsers for various languages, which is often suboptimal.

Recursive ascent parsers are fully compatible with (LA)LR grammars, but they instead use native language features (functions, switches, etc.) so that they can be statically analysed and, as a result, much better optimised by the target compiler or JIT.

It would be great if syntax could generate such parsers for various langs.

Allow regexp definitions directly in BNF grammar

Currently it's possible to use raw (wrapped in quotes) literal tokens in BNF.

%%

E : N;

N : D N | /* empty */ ;

D : "1" | "2" | "3" ... ;

We should also allow embedding regexp rules from lexical grammar directly to the BNF.

%%

E : N;

N : D N | /* empty */ ;

D : [0-9]+ ;

Allow clients to pass additional variables for use in semantic actions

While writing a translator that converts from Sphinx's expression syntax to ElasticSearch's expression syntax, I found that I needed to pass the document's field as a scope parameter. For example, I needed to instruct ElasticSearch to limit it's search to just the account_number field.

{
  "query": { "match": { "account_number": 20 } }
}

Other times, I needed to search the entire document.

{
  "query": { "match": { "_all": 20 } }
}

As a client, I need the parser to accept additional arguments and make them available in the semantic actions. For example:

{"query":{"match":{$context:$1}}}

rust plugin: epsilon handling

@DmitrySoshnikov

Hi!

The generated handler for an empty production contains the line self.values_stack.pop();. I guess that's not the expected behavior. I'd like to coin a repro-case if you need one. Thanks!

Make generated parsers thread-safe

Currently "global" variables like yytext, yyleng, $$, etc. are implemented either as module-level variables (in JS), or static class variables (in some plugins, C#, PHP, etc).

This makes parsing several files in parallel (e.g. in threads) impossible, since they mutate the same "global" data. We should make them instance variables, and always export a class for Tokenizer and the Parser (instead of a singleton instance which just access the global or static class variables).

This will require rewriting global code in handles like:

yytext = yytext.slice(1, -1);

to an instance property access. Currently we already rewrite it to static props access in some plugins):

Yyparse.yytext = Yyparse.yytext.slice(1, -1);

Eventually should be instance prop (specific to each plugin) instead of static one:

this.yytext = this.yytext.slice(1, -1);

Need recommendation for namespace management

I have the following BNF rules and I have the functions disjunction and conjunction declared in the moduleInclude section.

["E OR E",  "$$ = disjunction($1, $3)"],
["E AND E",  "$$ = conjunction($1, $3)"],

This places the functions in the global namespace. Is there a way to add these functions to the generated parser or can you recommend an idiom that does not pollute the global namespace? In other words, given two parsers that both declare the functions disjunction and conjunction how would I isolate the two parsers?

Project name is so very hard to search on internet!

I know it's an odd issue but it took me 30 minutes to find this project! Finally recovered it from my GitHub stars.

P.S. Trying to write a brand new YAML parser with this. Do you have any recommendations?

Tokenizer: support lookbehind assertions

Currently lookbehind assertions do not work, since ^ is inserted at the beginning of the whole regexp (tokenizer applies all regexes from the beginning of the rest of a string).

This should match foo ("foo" surrounded by spaces):

(?<= )foo(?= )

This incorrectly transformed to:

^(?<= )foo(?= )

While should be:

(?<=^ )foo(?= )

Implement Java plugin

We need to support LR-parser generator for Java. See example plugin, or C# plugin which is closer to Java.

Expose lexer in parser semantic actions

We should provide an ability to access lexer (e.g. to change its state) from semantic actions of productions:

e: e + e { yy.lexer.pushState('condition'); ... }

dmitrysoshnikov / syntax Goto Github PK

syntax's People

Contributors

Stargazers

Watchers

Forkers

syntax's Issues

Recommend Projects

Recommend Topics

Recommend Org