Giter Site home page Giter Site logo

dmitrysoshnikov / syntax Goto Github PK

View Code? Open in Web Editor NEW
597.0 15.0 84.0 1.34 MB

Syntactic analysis toolkit, language-agnostic parser generator.

License: MIT License

JavaScript 73.10% Python 2.36% PHP 3.09% Ruby 2.37% C# 4.70% GAP 1.37% Lex 1.22% Rust 3.06% Shell 0.08% Java 3.47% Makefile 0.08% C++ 2.44% Julia 2.66%

syntax's People

Contributors

andrebaltazar8 avatar anru avatar cscott avatar danielruf avatar davidmclifton avatar dependabot[bot] avatar dmitrysoshnikov avatar fdutton avatar gankoji avatar inspirnathan avatar mashplant avatar misdocumeno avatar namiwang avatar paul-31415 avatar rasteiner avatar robertszefler avatar rodrigopedra avatar shimaore avatar tjvr avatar tobie avatar tpirc3 avatar zer0chance avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

syntax's Issues

rust plugin: wrong type annotation for args

Hi, I'm trying to implement an experimental parser with custom tokenizer, in rust, via syntax-cli.

Here's a simple grammar (partially ripped from ruby's)

     0. $accept -> program
    -----------------------
     1. program -> top_compstmt
     2. top_compstmt -> top_stmts opt_terms
     3. top_stmts -> top_stmt
     4. top_stmt -> stmt
     5. stmt -> expr
     6. expr -> arg
     7. arg -> primary
     8. primary -> literal
     9. literal -> numeric
    10. numeric -> simple_numeric
    11. simple_numeric -> tINTEGER
    12. opt_terms -> terms
    13. term -> tNL
    14. terms -> term

┌────────────────┬───────────┐
│ Symbol         │ First set │
├────────────────┼───────────┤
│ $accept        │ tINTEGER  │
├────────────────┼───────────┤
│ program        │ tINTEGER  │
├────────────────┼───────────┤
│ top_compstmt   │ tINTEGER  │
├────────────────┼───────────┤
│ top_stmts      │ tINTEGER  │
├────────────────┼───────────┤
│ top_stmt       │ tINTEGER  │
├────────────────┼───────────┤
│ stmt           │ tINTEGER  │
├────────────────┼───────────┤
│ expr           │ tINTEGER  │
├────────────────┼───────────┤
│ arg            │ tINTEGER  │
├────────────────┼───────────┤
│ primary        │ tINTEGER  │
├────────────────┼───────────┤
│ literal        │ tINTEGER  │
├────────────────┼───────────┤
│ numeric        │ tINTEGER  │
├────────────────┼───────────┤
│ simple_numeric │ tINTEGER  │
├────────────────┼───────────┤
│ tINTEGER       │ tINTEGER  │
├────────────────┼───────────┤
│ opt_terms      │ tNL       │
├────────────────┼───────────┤
│ terms          │ tNL       │
├────────────────┼───────────┤
│ term           │ tNL       │
├────────────────┼───────────┤
│ tNL            │ tNL       │
└────────────────┴───────────┘

And some productions:

...

top_compstmt
    : top_stmts opt_terms {
        |$1: Node; $2: Token| -> Node;

        $$ = Node::Dummy;
    }
;

...

Would produce handlers like:

enum SV {
    Undefined,
    _0(Token),
    _1(Node)
}

...

fn _handler2(&mut self) -> SV {
// Semantic values prologue.
let mut _1 = pop!(self.values_stack, _1);
let mut _2 = pop!(self.values_stack, _0);

        let __ = Node::Dummy;
SV::_1(__)
}

...

The issue I encountered is, at the beginning of top_compstmt aka _handler2, the values stack is like:

[
    _1(Dummy),
    _0(Token { kind: 15, value: "\n", ... })
]

It seems legit to me, the first is the reduced result value for top_stmt <- ... <- tINTEGER and the second one is the result of opt_term <- ... <- tNL.

Then the statement let mut _1 = pop!(self.values_stack, _1); is assuming a _1(Node) is on the top of the stack, meanwhile the reality is the top of the stack is _0(Token), thus the issue.

So do you think this is an issue in syntax-cli or somewhere else in my implementation? Thanks.

Tokenizer: support returning a list of tokens from a lex rule

A lex rule may return different token types, based on the logic of a handler. E.g.

[`\\n( *)`, `
  const matchedIndent = yyleng - 1;

  if (matchedIndent === currentIndent) {
    return 'NL';
  }

  return 'INDENT';
`]

We need to support as well returning several tokens from a rule handler:

[`\\n( *)`, `
  const matchedIndent = yyleng - 1;

  if (matchedIndent === currentIndent) {
    return 'NL';
  }

  return ['INDENT', 'NL'];
`]

This allows structuring grammar rules with "fake" tokens, making grammar rules simpler.

Grammar: Support "optional symbol" EBNF notation

Currently one have to explicitly define an optional rule as:

Foo
  : Bar OptBaz
  ;

OptBaz
  : Baz
  | /* empty */
  ;

This makes it inconvenient to define a bunch of Opt<Symbol> rules. We can support instead EBNF notation for optional rules:

Foo
  : Bar Baz?
  ;

This can be done as grammar transform by inserting the optional rule (basically transforming to the former version internally).

PHP: Unresolved function

PHP reports an unexpected error instead of raising an exception when encountering unbalanced parenthesis.

This occurs because the PHP template is attempting to reference a global function instead of a local, static function.

See lr.template.php:165.

Support non-capturing group in native language.

Some languages like PHP do not adhere to the POSIX grammar. Since lex-rule.js instantiates a field using JavaScript's RegExp, it implicitly expects the expression to conform to the POSIX grammar and will raise a "Invalid regular expression" exception if it does not.

Example:
PCRE: (?<=[:space:])[Aa][Nn][Dd](?=[:space:])
POSIX: (?:[:space:])[Aa][Nn][Dd](?=[:space:])

Notes:

  1. The upper- and lower-case variants exist because I was unable to determine how to make the parser case insensitive.
  2. The reason for the non-capturing groups is because the parser matched "and" in "android".

Reuse same grammar symbol objects

Currently for productions like:

e : e '+' e

3 instances of the e symbol are created. We can track a global registry of symbols, and don't allocate a new symbol if it's already created.

Consider using different tokens to delimit insertion points

The templates contain insertion points delimited by '<<' and '>>'. I have not delved into the code to see how this is handled but '<<' and '>>' represent reserved operators in several C-based languages (e.g., C, C++, C# and Java use '<<' to represent a left-shift). This could make it awkward to write a proper template.

Consider using something that is not already used by any language or framework. Most double character tokens are already used. For example:

  • {{ : Mustache
  • [[: PHP, Javascript, etc.
  • @@: I cannot remember

I recommend using {{{ since I do not know of any language or framework using this particular delimiter.

Show source line and pointer to the invalid token in error message

When we get "Unexpected token" error from the tokenizer or parser, we currently just show:

Unexpected token: "..." at <line>:<column>

In addition it'd be good to show an actual line from the original source, and a pointer to the invalid token:

foo + &7;
      ^ 
Unexpected token: "&" at <line>:<column>

CLI: Add --validate option to check a grammar

Currently to check "shift-reduce", and other conflicts one needs to either pass --collection, or --table options, which actually builds, and outputs parsing table.

We need to add --validate options which will just just show:

✓ Grammar is valid

Or

✕ Grammar has the following errors:

<list-of-errors>

Support `onShift` hook

Currently we have a semantic action per the whole production, which is called onReduce action. For a better parser-lexer communication we could build a semantic action per each symbol on RHS ("before", and "after" hooks). Alternatively we can provide a generic onShift hook which can receive a production, symbol, lexer/parser instance, etc, and do some action (e.g. set a lexer state).

function onBeforeShift(grammarSymbol, production, lexer, parser, ...) {
  // do actions on shift
}

function onShift(...) { ... }


Object : "{"  Props "}" { $$ = $1 + $2 }

Here onBeforeShift, and onShift is called for { and }.

Separate LexGrammar into own class

To support only Tokenizer code generation in #7, it's good to have lexical, and syntactic grammars as two separate classes (currently both are mixed into one Grammar class, we can introduce LexGrammar for the lexical one).

Support locations in tokens and AST nodes

We should support start / end locations for tokens, and for AST nodes, and also loc structure with line and column numbers.

It can be either properties sitting on the handler arguments themselves, e.g.:

$1.start; $1.loc.start.line;

Or an extra "location" structure ("global" loc variable) passed as the last parameter to a handler function:

loc.$1.start.line;

For the example code:

foo;

A resulting AST node can have a structure:

{
  type: "Identifier",
   name: "foo",
  start: 0,
  end: 3,
  loc: {
    start: {
      line: 1,
      column: 0
    },
    end: {
      line: 1,
      column: 3
    }
  }
}

Expose original source code to user-level

Some tools may need to analyze original source code corresponding to AST nodes (combined with node locations, it is possible to get a substring of a node in the source code).

We need to expose the source code on parser, tokenizer, and in the handlers. This can be just a global (or scoped) yysource variable.

Program : StatementList
  { $$ = {
      type: 'Program',
      body: $StatementList,
      source: yysource,
    }
  }

Support named variables in handlers

Currently handlers use number-notation for production variables. E.g.:

add
  : mult          { $$ = $1 }
  | mult + add    { $$ = $1 + $3 }
  ;

We need also to support variable names corresponding to non-terminals:

add
  : mult          { $$ = $mult }
  | mult + add    { $$ = $mult + $add }
  ;

Tokenizer: support onToken event

We should expose the onToken event, so the clients can subscribe, and route the parsing events based on the token.

yy.tokenizer.onToken = token => {
  if (token.type === 'L_PAREN') {
    ...
  }
};

Latest build not working

Hello,

I have just cloned the repo, run npm install and npm run build which both executed successfully. However, when I run the utility with ./bin/syntax it gives me the following output:

[syntax-master]$ ./bin/syntax --help
syntax-master/bin/syntax:196
            const [reducePart, shiftPart] = table.splitSRParts(conflict);
                  ^

SyntaxError: Unexpected token [
    at exports.runInThisContext (vm.js:53:16)
    at Module._compile (module.js:374:25)
    at Object.Module._extensions..js (module.js:405:10)
    at Module.load (module.js:344:32)
    at Function.Module._load (module.js:301:12)
    at Function.Module.runMain (module.js:430:10)
    at startup (node.js:141:18)
    at node.js:980:3

I am running Node v5.3.0 on Fedora 24 x64.

First set computation - algorithm correct?

While studying your algorithm for computing the First set (source), I wondered whether it is correct.

Consider a grammar like this:

S -> A | x
A -> S | y

From my understanding of your algorithm, you initially set _firstSets[S] to the empty set. Thus, when doing the recursion on A, the First set of A is computed to be the union of { } (as _firstSets[S] is still empty) and { y }.
However, the First set of A is { x, y }.
What am I missing?

Lexer: support case-insensitive option

We need to support case-insensitive in the lexical grammar declaration:

"lex": {
  "rules": [
    [`\\band\\b`,    `return 'AND'`],
  ],

  "options": {
    "case-insensitive": true,
  },
}

In addition to global option for all rules, we can support this flag per each rule.

Use more specific exception types

Disclaimer: I am not a PHP developer but I am writing a translator for our PHP team that converts a Sphinx search expression into an ElasticSearch search expression as part of their migration from Sphinx to ElasticSearch.

The PHP team has provided a little over 1000 unit tests. Some are positive tests and some are negative tests. An example negative test is mismatched parenthesis. What we would like to do is instruct PHPUnit to expect an exception in this case, however, PHPUnit does not allow expecting \Exception since the authors claim that it leads to unintended consequences. PHPUnit wants you to use something more concrete like SyntaxException or SemanticException.

It could be that I am misunderstanding what PHPUnit is requesting so I am open to correction. However, if I understand this correctly please use a more specific exception type.

Improve performance of canonical collection building and CLR to LALR compression

Currently building of the canonical collection, and compression from CLR(1) to LALR(1) are the longest operation, and for some grammars can be very slow (up to several minutes). We should investigate better compression algorithms here.

time syntax-cli -g ~/grammar.g -d -m lalr1 -o ~/test.py

DEBUG mode is: ON

[DEBUG] Grammar (bnf) is in JS format
[DEBUG] Grammar loaded in: 1.081ms

Parsing mode: LALR(1).
[DEBUG] Building canonical collection: 46531.131ms
[DEBUG] Number of states in the collection: 5311
[DEBUG] Compressing CLR to LALR: 15057.025ms
[DEBUG] Number of states after compression: 299
[DEBUG] Building LR parsing table: 12.715ms
[DEBUG] Generating parser module: 24.013ms

✓ Successfully generated: /Users/dmitrys/test.py


real  1m2.014s
user  1m1.273s
sys 0m1.768s

Tokenizer: support returning an actual token from a handler

Currently a lex rule handler returns only a token type (potentially modifying yytext), E.g.

[`[0-9]+`,  `yytext = yytext.slice(1); return 'NUMBER';`]

In this case "100" would result to:

{type: 'NUMBER', value: "00"}

We need also to support returning a full token (type-value pair) from a handler:

[`[0-9]+`,  `return {type: 'NUMBER', value: 'Override!'}]

A use-case: matching spaces would return number of spaces, instead of the spaces themselves:

[`\\n( *)`,  `return {type: 'INDENT', value: $1.length}]

Support Windows for development process

Currently npm scripts work only on Unix/Mac. While developers could use Cygwin to work with it, it be great to support Windows out of the box.

We can introduce a build system (gulp or similar) for this. Or just rewrite the scripts using Node's modules for file operations, instead of raw Bash scripts.

Investigate different error recovery strategies

Currently the parser fails on the first found error. For being more practical, we should consider building some error recovery mechanisms, and restore parsing after finding an error. We can start from the simplest panic mode recovery, and allow catching several errors in one pass:

1 +& 2; 5 #* 3; 5 + 4;

Should issue two errors (currently issues only first one):

Unexpected token "&" at 1:3
Unexpected token "#" at 1:10

Show progress steps for longer operations

Calculating an LALR(1) by compressing from CLR(1) for a big grammar can take some time. It'd be good to log each steps making calculation more interactive. Potentially this may also be an actual progress bar.

Tokenizer: support beginning of a string ^ in actual regexp

Currently tokenizer automatically add ^ at the beginning of all regexes. Some regexes, e.g. lookbehind assertions, can use it explicitly:

'/(?<=^[[:space:]])A(?=[[:space:]])/'

In this case we don't need to append the ^.

Note: this doesn't work:

'/^(?<=[[:space:]])A(?=[[:space:]])/'

Support only tokenizer code generation

Lexical grammar can already be specified as a separate file, however at code generation a tokenizer is always embedded into the generated parser file. We need to support generating a module for a standalone tokenizer (so it can be required as from any other parser).

./bin/syntax --lex ~/lexer.l --tokenizer-only --output ~/Tokenizer.js

Which can be required later in any parser:

const Tokenizer = require('Tokenizer');

const lexer = new Tokenizer('a b c');
console.log(lexer.getNextToken()); // {type: '...', value: 'a'}
console.log(lexer.getNextToken()); // {type: '...', value: 'b'}
...

Should support API from the custom tokenizer section:

// initString is supported to reuse the same tokenizer instance
lexer.initString('x y z');
console.log(lexer.getNextToken()); // {type: '...', value: 'x'}
...

We can use standard template for the tokenizer at code generation.

Add support for recursive ascent parser generation

It looks like currently the tool just generates table-driven parsers for various languages, which is often suboptimal.

Recursive ascent parsers are fully compatible with (LA)LR grammars, but they instead use native language features (functions, switches, etc.) so that they can be statically analysed and, as a result, much better optimised by the target compiler or JIT.

It would be great if syntax could generate such parsers for various langs.

Allow regexp definitions directly in BNF grammar

Currently it's possible to use raw (wrapped in quotes) literal tokens in BNF.

%%

E : N;

N : D N | /* empty */ ;

D : "1" | "2" | "3" ... ;

We should also allow embedding regexp rules from lexical grammar directly to the BNF.

%%

E : N;

N : D N | /* empty */ ;

D : [0-9]+ ;

Allow clients to pass additional variables for use in semantic actions

While writing a translator that converts from Sphinx's expression syntax to ElasticSearch's expression syntax, I found that I needed to pass the document's field as a scope parameter. For example, I needed to instruct ElasticSearch to limit it's search to just the account_number field.

{
  "query": { "match": { "account_number": 20 } }
}

Other times, I needed to search the entire document.

{
  "query": { "match": { "_all": 20 } }
}

As a client, I need the parser to accept additional arguments and make them available in the semantic actions. For example:

{"query":{"match":{$context:$1}}}

rust plugin: epsilon handling

@DmitrySoshnikov

Hi!

The generated handler for an empty production contains the line self.values_stack.pop();. I guess that's not the expected behavior. I'd like to coin a repro-case if you need one. Thanks!

Make generated parsers thread-safe

Currently "global" variables like yytext, yyleng, $$, etc. are implemented either as module-level variables (in JS), or static class variables (in some plugins, C#, PHP, etc).

This makes parsing several files in parallel (e.g. in threads) impossible, since they mutate the same "global" data. We should make them instance variables, and always export a class for Tokenizer and the Parser (instead of a singleton instance which just access the global or static class variables).

This will require rewriting global code in handles like:

yytext = yytext.slice(1, -1); 

to an instance property access. Currently we already rewrite it to static props access in some plugins):

Yyparse.yytext = Yyparse.yytext.slice(1, -1); 

Eventually should be instance prop (specific to each plugin) instead of static one:

this.yytext = this.yytext.slice(1, -1); 

Need recommendation for namespace management

I have the following BNF rules and I have the functions disjunction and conjunction declared in the moduleInclude section.

["E OR E",  "$$ = disjunction($1, $3)"],
["E AND E",  "$$ = conjunction($1, $3)"],

This places the functions in the global namespace. Is there a way to add these functions to the generated parser or can you recommend an idiom that does not pollute the global namespace? In other words, given two parsers that both declare the functions disjunction and conjunction how would I isolate the two parsers?

Project name is so very hard to search on internet!

I know it's an odd issue but it took me 30 minutes to find this project! Finally recovered it from my GitHub stars.

P.S. Trying to write a brand new YAML parser with this. Do you have any recommendations?

Tokenizer: support lookbehind assertions

Currently lookbehind assertions do not work, since ^ is inserted at the beginning of the whole regexp (tokenizer applies all regexes from the beginning of the rest of a string).

This should match foo ("foo" surrounded by spaces):

(?<= )foo(?= )

This incorrectly transformed to:

^(?<= )foo(?= )

While should be:

(?<=^ )foo(?= )

Expose lexer in parser semantic actions

We should provide an ability to access lexer (e.g. to change its state) from semantic actions of productions:

e: e + e { yy.lexer.pushState('condition'); ... }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.