philippesigaud / pegged Goto Github PK

View Code? Open in Web Editor NEW

534.0 534.0 66.0 6.45 MB

A Parsing Expression Grammar (PEG) module, using the D programming language.

Makefile 0.19% D 97.35% TeX 0.50% Python 1.35% Shell 0.36% Batchfile 0.26%

pegged's People

Contributors

Stargazers

Watchers

Forkers

extrawurst dprog byakkun meh jkm lcodes valloric chadjoan epi timotheecour chaosim jsmdnq martinnowak chibuisimaduka bakkegaard seacgroup azaubaevviktor duselbaer franklike legrosbuffle robinwils awesome-fork zeusdeux veelo mzsk blockchainos vasily-kirichenko stretto domain toreinar renezwanenburg growlercab rjmcguire marcioalmada technykim zoadiancollection marler8997 devebookingservices sobels dldavid wilzbach rajaanb john-colvin cym13 aphexus adrianperreault pinver panke chanakyayadav sdvcn skoppe zardoz89 yuy-m shadolight symmetryinvestments gallafrancesco tagion bolpat mw66 ljmf00-wekaio ik4tsu geod24 ishax-kos justephens ricardaxel

pegged's Issues

Missing docs on memoization

There should be some docs on what exactly the memoization feature does, how it can affect a grammar/parser, when it's useful/not useful, etc.

Parenthesizing part of sequence fails

Hi, I was working on a Ruby grammar some, but when I factored out some parts of a float literal, the grammar stopped compiling. (dmd 2.058) This works:

FloatLiteral <~ DecLiteral ('e' [+-]? [0-9] [0-9]*)?
DecLiteral <- [0-9] [0-9]*

But this doesn't:

FloatLiteral <~ DecLiteral ('e' [+-]? DecLiteral)?
DecLiteral <- [0-9] [0-9]*

Nor does this:

FloatLiteral <~ DecLiteral ('e' [+-]? ([0-9] [0-9]*))?
DecLiteral <- [0-9] [0-9]*

Shouldn't the latter two examples work?

I get this as a compile error on the line that calls FloatLiteral.parse():

ruby_parse.d(25): Error: undefined identifier FloatLiteral

So generating the parser fails silently and then I get an error when I try to use it? It is hard to track down the part of the grammar that Pegged doesn't like. It'd be nice to have an error message, but that would be a separate issue. I know you've mentioned the error handling needs improved.

simple main func does not parse

import pegged.examples.ddump;

enum dcode = q{void main() {}};

enum parseTree1 = Module.parse(dcode);

pragma(msg, parseTree1.capture);

// outputs "null"

i am trying to get my head around the grammar and PEG in general but i dont understand why FunctionBody does not work here...

Comments and Spacing

I believe this is a silly question but at the moment, I'm just trying to understand what the best practices are for discarding comments across all rules. At the moment I have the following code and it appears to just pause during the CTFE pragma evaluation:

mixin(grammar(`
TEST:
    Spacing     <- (spacing / Comment)* 
    Comment     <- ';' (!eol .*) eol
`));

version (unittest) {
pragma(msg, TEST(`
    ; a comment
`));
}

I'm not sure what I'm doing wrong? :/

Add <^ operator for orthogonality

I know it's a bit of a far-fetched argument, but I think that for orthogonality purposes, Pegged should support a <^ operator. It follows the principle of least surprise when/if a user tries to use it.

Make EOI a built-in rule

Since EOL is a built-in rule, wouldn't it make sense to make EOI the same?

How to use the ParseTree for expression evaluation

Given a simple grammar for math expressions (e.g. the one defined on https://github.com/PhilippeSigaud/Pegged/wiki/Writing-Your-Own-Grammar) I do not see how to evaluate an expression given its ParseTree. The problem is that I do not know whether Add corresponds to the operator - or +. How does one handle this?

arithmetic example does not work

i dont know if i am too stupid but after a couple of months i wanted to give pegged another try and failed miserably. even the arithmetic example does not work for me in dmd2059 or current gdc based on v2057.

it simply does not generate the grammer. dmd simply gives "out of memory" if i try to mixin the generated grammer and gdc says:
Error 1 Error: template pegged.grammar.PEGGED!(ParseTree).PEGGED.parse(ParseLevel pl = ParseLevel.parsing) parse(ParseLevel pl = ParseLevel.parsing) matches more than one template declaration, ../../_extern/pegged/grammar.d(111):parse(ParseLevel pl = ParseLevel.parsing) and ../../_extern/pegged/grammar.d(126):parse(ParseLevel pl = ParseLevel.parsing) c:\Users\Dilly\Downloads_code\Projects_extern\pegged\grammar.d 128

i have cut it down to this:
[CODE]
enum grammarStr =
"Expr < Factor AddExpr*
AddExpr < ^('+'/'-') Factor
Factor < Primary MulExpr*
MulExpr < ^('*'/'/') Primary
Primary < Parens / Number / Variable / ^'-' Primary

Parens   <  '(' Expr ')'
Number   <~ [0-9]+
Variable <- Identifier";

enum genGrammar = grammar(grammarStr);
pragma(msg,"grammar: " ~ genGrammar);
[/CODE]

even without mixin it in it does not work. the string coming out of grammar already seems to be wrong...

Casing for rules and documentation

It seems that Pegged has moved to camelCase for rule identifiers, which is great. Documentation needs to be updated, though; some of it still points to stuff like Spacing.

Grammar name seemingly required

It seems that a grammar name is required for Pegged to accept any grammar now. This isn't a problem, but it took some digging to figure out what was wrong. If this is intended behavior, the docs should probably be updated to reflect it.

Use Pegged's grammar() with an external file

I'm not really sure what the D programming language's limitations are for opening external files with CTFE.

I've seen this: http://www.dsource.org/projects/tutorials/wiki/ImportFile which suggests that it's possible to use the import statement to include the contents of a file at compile time.

Is there a way to use this to include a PEG grammar definition? (e.g.):

mixin(grammar(import("path/to/file.peg")));
// other code

Example doesn't work

Probably I do smth wrong, but simple example doesn't work:

module pegged.test;

import std.algorithm;
import std.conv;
import std.stdio;
import std.traits;
import std.typecons;
import std.typetuple;

import pegged.grammar;


void main()
{
    mixin(grammar(
   `Expr     <- Num AddExpr*
    AddExpr  <- ('+'/'-') Num
    Num   <~ [0-9]+`
));

    auto tree = Expr.parse(`1 + 2-3+4`);
    writeln(tree);
}

During compilation (dmd t2.d) I get error:
t2.d(15): Error: undefined identifier Num, did you mean struct No?

Any ideas?

Parameterized rules still have problems

This is a'post'it' issue:

There are problems with the way parameterized rules are generated: string/TParseTree overloading does not work.

Expression-level '>' operator?

So I just got back to working on an old Pegged grammar and noticed that I used a > operator at the expression level. I cannot for the life of me recall what that does and it seems to be undocumented (possibly removed?).

@PhilippeSigaud can you shed some light on this?

Silly, I know. I probably shouldn't have used something undocumented to begin with...

Ranges trigger ICE under DMD 2.059

Under DMD 2.059, using a Range expression triggers an internal compiler exception. I'd be happy to file off a bug report under the DMD project, but I'm having a hard time isolating the fault.

import pegged.grammar;
mixin(grammar(Number < [0-9]*));
enum result = Number.parse("123");

dmd: interpret.c:6642: bool isCtfeValueValid(Expression*): Assertion `((ArrayLiteralExp *)se->e1)->ownedByCtfe' failed.

I believe the fault is in the return expression of the Range class, in peg.d. The content of okfailMixin() is kind of beyond me at this point. I'd like to help get a bug filed on this one, as I'm eager to use Pegged in my future work in D.

json example bug

import pegged.examples.json;
import pegged.examples.jsonExample;

import std.stdio;

void main()
{
// Parsing at compile-time:
enum parseTree1 = JSON.parse(example1);

pragma(msg, parseTree1.capture);
writeln(parseTree1);

}

when i try to build with win32 dmd i get:
"Error setting up build: Invalid UTF-8 sequence (at index 1)"

request: pretty-print error on parsing failure

Example of current output:
Arithmetic.Instantiate failure at pos [index: 15, line: 0, col: 15]

Example of a better, pretty-print output:
Arithmetic.Instantiate failure at pos [index: 15, line: 0, col: 15]
instantiate(A,B(int))
^
(ie, show where the parsing error is with a "^" as in iron python. that should be really easy to do)
(and perhaps the current parse result and valid rules at that point, but I guess that's already there)

The example from the README doesn't work

The example from the README file does not work.

#!/usr/bin/env rdmd

import std.stdio;
import pegged.grammar;

mixin( grammar( `
Arithmetic:
    Expr     <  Factor AddExpr*
    AddExpr  <  ^('+'/'-') Factor
    Factor   <  Primary MulExpr*
    MulExpr  <  ^('*'/'/') Primary
    Primary  <  '(' Expr ')' / Number / Variable / ^'-' Primary

    Number   <~ [0-9]+
    Variable <- identifier
` ) );

void main() {
  auto tree = Arithmetic(" 0 + 123 - 456 ");
  writeln( tree ); 
  writeln( tree.matches ); // prints ["0"]
  writeln( tree.matches == ["0", "+", "123", "-", "456"] ); // prints false
}

I recommend adding more unit tests to the project so this kind of breakage doesn't happen in the future. At the very least, the examples from the docs should always work. Seeing the main example break doesn't instill a lot of confidence in the project.

This is with dmd v2.060 on Mac OS 10.8.

Examples should be unit-tested

That means

having all code example as independent D modules.
- inside these modules, the interesting part is between two markers ;
- only this part will be extracted and copied into the docs ;
- but, as part of the doc-generation process, the entire module shall be compiled and unit-tested.

-having the basic docs just containing a special 'import thisExample' statement.

having a special D script that finds the import statements, compiles the corresponding modules, extracts the example code and inserts it into a new version of the docs.

Which parsing algorithm is used? Is it possible to add more efficient ones?

I have a grammar where some input is parsed very slowly.
Consider a grammar like:

Expr < Plus / Minus / Term
Plus ...
Minus ...

Term < Mul / Div / Factor
Mul ...
Div ...

Factor < UnaryMinus / UnaryPlus / Function
UnaryMinus ...
UnaryPlus ...

Function < BinaryFunction / UnaryFunction / Primary
BinaryFunction < Identifier '(' Expr ',' Expr ')'
UnaryFunction < Identifier '(' Expr ')'

Assume you want to parse the expression min(0, max(2, 3)). Then each alternative is checked first. E.g. before "min" is parsed 3 * 3 * 3. I.e it is especially costly if the last alternative will always succeed. I assume this why pegs can take exponential time to parse.
I think a packrat parser solves those issues by some memory trade-off. For some (probably most) grammars a LR parser may work even better in practice. Are there any plans to support more efficient parsing schemes? Are they possible to integrate with the current design?

Out of memory error when compiling mixed in grammar

The following mixed-in grammar compiled at commit 04e2b60 (a couple of days ago) but gives an out of memory error now when compiling with git head. I'll revert to the point it last worked at, but here is the grammar if you want to track down the issue:

mixin(grammar(`
Parse:
        Line <- (Spaces (Keyword / Other / Number / String / Parens / Symbol) Spaces)*
        Other <~ [a-zA-Z_]+
        Number <~ digit+ / (digit+ '.' digit*) / (digit* '.' digit+)
        String < FullString / PartialString
        FullString <~ quote (!quote .)* quote
                    / backquote (!backquote .)* backquote
                    / doublequote (!doublequote .)* doublequote
        PartialString <~ quote (!quote .)*
                       / backquote (!backquote .)*
                       / doublequote (!doublequote .)*
        Symbol <- '~' / '!' / '@' / '#' / '$' / '%' / '^' / '&' / '*' / '/' /
                  '+' / '=' / '<' / '.' / '>' / ',' / ':' / ';' / backslash
        Parens <- '(' / ')' / '{' / '}' / '[' / ']'
        Spaces <~ (' ' / '\n' / '\t')*
        Keyword <- "abstract" / "alias" / "align" / "asm" / "assert" / "auto" / "body" / "bool" / "break" / "byte"
                 / "case" / "cast" / "catch" / "cdouble" / "cent" / "cfloat" / "char" / "class" / "const" / "continue" / "creal" / "dchar"
                 / "debug" / "default" / "delegate" / "delete" / "deprecated" / "double" / "do" / "else" / "enum" / "export" / "extern"
                 / "false" / "finally" / "final" / "float" / "foreach_reverse" / "foreach" / "for" / "function" / "goto" / "idouble" / "if"
                 / "ifloat" / "immutable" / "import" / "inout" / "interface" / "invariant" / "int" / "in" / "ireal" / "is" / "lazy"
                 / "long" / "macro" / "mixin" / "module" / "new" / "nothrow" / "null" / "out" / "override" / "package" / "pragma"
                 / "private" / "protected" / "public" / "pure" / "real" / "ref" / "return" / "scope" / "shared" / "short" / "static"
                 / "struct" / "super" / "switch" / "synchronized" / "template" / "this" / "throw" / "true" / "try" / "typedef" / "typeid"
                 / "typeof" / "ubyte" / "ucent" / "uint" / "ulong" / "union" / "unittest" / "ushort" / "version" / "void" / "volatile"
                 / "wchar" / "while" / "with" / "__FILE__" / "__LINE__" / "__gshared" / "__thread" / "__traits"
`));

Wishlist: debug output for grammar parsing

Hi,

I'm trying to parse a file that has standard Unix-style '#-to-EOL' comments; but it's not working, and I believe at this point it's a bug in Pegged (though I'm not 100% sure).

This is my grammar:

ENI:
Grammar <- Statement (EOL Statement)* EOL? EOI
Statement <- ( AutoStatement / HotplugStatement / EmptyLine / Comment )
Comment <: '#'
EmptyLine <: S*
AutoStatement <- 'auto' S ^Identifier
HotplugStatement <- 'allow-hotplug' S ^Identifier

S <: ' ' / '\t'

When I give that the following file:

auto lo

allow-hotplug eth0

(i.e., the first line is empty, the third line starts with #)

then it won't parse correctly:

Parse output: failure
named captures: []
position: [index: 0, line: 0, col: 0]
ENI.Grammar failure at pos [index: 9, line: 2, col: 0]
Pegged.EOI failure at pos [index: 9, line: 2, col: 0]

When taking out the #, it works.

Maybe I'm doing something wrong, but I believe this should just work ,no?

Operator for merging a node's children into the node's parent

[as per discussion in issue 60]

Given a rule tree A -> B -> C, it would be great if Pegged could provide a rule operator (along the lines of ;, : and ^) that, when applied to B, would turn the tree into A -> C. In other words, the marked rule disappears and all of its children are attached to its parent at the point where the node was in the list of A's children.

Performance improvements

The performance of Pegged is currently not fantastic in some laboratory cases. For instance, given a simple D-like grammar, a file with 100.000 lines of public class A {} takes several minutes to parse. This generally isn't good enough for a compiler, and needs improvement.

I don't know if it's related at all, but if some of the slowness comes from Pegged needing to be CTFE-compatible, you could special-case the runtime path by using if (!__ctfe).

Keywords

OK, so I know the whole idea behind a "keyword" may not make sense in PEG, but most computer languages do need to specify reserved keywords such that non-PEG tools have an easier time figuring out the language.

Could Pegged get a feature to specify certain reserved words?

Error handling improvements

Just creating this as a sort of meta-bug. A couple of things that come to mind:

A solid error handling interface needs to be devised.
Error handling needs to be documented.
Error handling must give enough contextual information to give useful error messages to the user.
Error handling should preferably be pluggable in some way so that the library user can tell Pegged what to do on a failure (e.g. try to continue parsing at a sync point or similar*).

See http://www.ssw.uni-linz.ac.at/coco/Doc/UserManual.pdf - the documentation on the SYNC and WEAK keywords.

Failed compilation with '"'

It fails compilation if I put '"', using '\"' compiles fine.

Readme.md example does not work anymore

the example from the readme.md does not work anymore. the parser just matches the "1".
why dont u have unittests for those things ?

Line/column information

It would be nice to have some way to record line/column information for every capture. This is essential for producing good errors in a compiler.

why multiple naming conventions for rule generation?

Why do we have all the following allowed:
A < B
A <~ B
A <- B

instead of just, say, A < B?
wouldn't it be simpler if we just had the latter?

dgrammar does not work anymore

with the current master i cannot generate the parser of the dgrammar.

mixin(grammar(Dgrammar)); leads to an out of memory. exception....

Problems with parsing spaces and tabs

I've been playing around with the example markdown grammar[1] when I noticed some peculiarities. I've gotten it down to this test case:

#!/usr/bin/env rdmd

import std.stdio;
import pegged.grammar;

mixin( grammar( `
Test:
  Inlines <- Inline+
  Inline  <- String / Spaces

  String     <~ NormalChar+
  Spaces     <~ Spacechar+
  Spacechar  <- " " / "\t"
  NormalChar <- !( Spacechar ) .
`));

void main() {
  auto tree = Test("foo bar baz ");
  writeln( tree );
  writeln( tree.matches );
}

I've noticed several issues:

First, the following output is observed for the given program:

Test  [0, 12]["foo bar baz "]
 +-Test.Inlines  [0, 12]["foo bar baz "]
    +-Test.Inline  [0, 12]["foo bar baz "]
       +-Test.String  [0, 12]["foo bar baz "]

["foo bar baz "]

This seems very wrong. I would expect it to alternate between Strings of NormalChars which don't have spaces, then a space, then a String again etc. It seems that a NormalChar will match a space even though it shouldn't.

Second, if I change the input to " foo bar baz " (notice the starting space!), the program hangs.

Lastly, if I change Spacechar rule to Spacechar <- " ", everything works. So why is the \t killing things?

[1] Which is just terrible BTW. I know, it's not your fault, the original peg-markdown grammar has the bugs, I checked. I'm improving it so that it's correct and uses the very nice Pegged extensions and I'll pull-request the new grammar once I'm done.

Whitespace woes.

enum string testGrammar = ` 
TestGrammar:

Root < 'a' '.'

Spacing <- blank*
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

I would expect the above grammar to recognize the example. Instead, it never halts.

On the other hand, this works:

enum string testGrammar = ` 
TestGrammar:

Root < 'a' '.'

Spacing <- blank+
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

/+
Prints:
TestGrammar.Root [0, 2]["a", "."]
 +-literal!("a") [0, 1]["a"]
 +-literal!(".") [1, 2]["."]
+/

This surprises me because the Spacing symbol explicit requests at least 1 blank, yet the text it recognizes has zero blanks.

There's another issue that might be the same thing: I am unable to place the right-hand-side of rules on a different line than the lhs like I used to:

enum string testGrammar = ` 
TestGrammar:

Root <
    'a'
    '.'

Spacing <- blank+
`;

import pegged.grammar;
import std.stdio;

mixin(grammar(testGrammar));

void main()
{
    stdout.writefln("%s", TestGrammar.Root("a."));
}

/+
During compilation:
test.d(14): Error: static assert  "Pegged (failure)
 +-Pegged.Grammar (failure)
    +-Pegged.GrammarName [2, 13]["TestGrammar"]
       +-Pegged.Identifier [2, 13]["TestGrammar"]
    +-oneOrMore!(Pegged.Definition) (failure)
       +-Pegged.Definition (failure)
       |  +-Pegged.LhsName [16, 20]["Root"]
       |     +-Pegged.Identifier [16, 20]["Root"]
       |  +-Pegged.Arrow (failure)
       |     +-literal!("< ") failure at line 3, col 5, after "ar:

Root" expected "< ", but got "<
        'a'
        '."
"
+/

This makes it difficult to align rules that are best represented vertically:

`
Branch < '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

It might not be clear what's going on there, but there is no possible way for me to tab the alternations over to line up with the '^' terminal (nor would I want to: tabs are terrible/evil for alignment, but great for indentation). I'd rather just put the entire rhs into its own indentation level. That way my editor won't choke on spaces that don't match an indentation level. I'd like to be able to write it this way:

`
Branch < 
      '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

Or, in explicit form:

`
Branch < 
{tab}{space}{space}'^' ^identifier
{tab}/ '->' Node
{tab}/ '{' Node+ '}'
`

The Pegged grammar looks like it can handle this, but it doesn't.

Ideally I'd even be able to do something like this:

`
Branch < ThisSymbolNeverMatches
    / '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

which would entirely eliminate the desire to have tabs adjacent to spaces (tabs for indentation, spaces for alignment).

`
Branch < []==
    / '^' ^identifier
    / '->' Node
    / '{' Node+ '}'
`

It's the hammer operator, because you can't touch this ;)

This is all from commit 0406fd1:

commit 0406fd19e6c6261f2adab213d57e91d608fcf8f9
Author: Philippe Sigaud <[email protected]>
Date:   Wed Oct 17 21:15:21 2012 +0200

    Testing callumenator out-of-memory error.

reference example broken

In your reference example (https://github.com/PhilippeSigaud/Pegged/wiki), the provided code doesn't compile:
enum parseTree1 = Expr.parse("1 + 2 - (3_x-5)_6");

instead this compiles:
enum parseTree1 = Arithmetic.parse("1 + 2 - (3_x-5)_6");

Also, like mentioned elsewhere, it should be written in the tutorial not to put the definition of the grammar inside a function.

Creating comments not practical

It's currently not very practical to create comments in Pegged. One would have to insert them in every single rule in the program to allow what most programming languages let you do.

Generate 64-bit friendly parser code?

Hi Philippe,

I've found some time to experiment with the speedup1 work that you've incorporated onto the master branch. I wrote some test code (a copy of your arithmetic.d example) like so:

module test.parser;

import pegged.grammar;

enum parser = grammar(`
TEST:
    Term    < Factor (Add / Sub)*
    Add     < "+" Factor
    Sub     < "-" Factor
    Factor  < Primary (Mul / Div)*
    Mul     < "*" Primary
    Div     < "/" Primary
    Primary < Parens / Neg / Number
    Parens  < :"(" Term :")"
    Neg     < "-" Primary
    Number  < ~([0-9]+)
`);

pragma(msg, parser);
mixin(parser);

I compiled it with the following options:

dmd -c -ofsrc/parser.o -fPIC -O -inline -release -w -wi -I./src/ -I./Pegged/ src/parser.d

and I get the following errors during compilation (of the generated code):

src/parser.d(50): Error: cannot implicitly convert expression (tuple("Term",p.end)) of type Tuple!(string,ulong) to Tuple!(string,uint)
src/parser.d(55): Error: cannot implicitly convert expression (tuple("Term",p.end)) of type Tuple!(string,ulong) to Tuple!(string,uint)
src/parser.d(68): Error: cannot implicitly convert expression (tuple("Add",p.end)) of type Tuple!(string,ulong) to Tuple!(string,uint)
src/parser.d(73): Error: cannot implicitly convert expression (tuple("Add",p.end)) of type Tuple!(string,ulong) to Tuple!(string,uint)
// snip

But if I pass the -m32 option to the compiler it builds and links fine. I'm not sure what the problem is, you can find the generated code (before the -m32- flag) here: http://paste.ubuntu.com/1192434/

Let me know if there's anything else I can do to test. Cheers.

Memoization interferes with explicit sub-rule calls

Just wanted to point this out, as it took me a while to track down the cause, so is good to be aware of.

Currently I use a generated parser to parse some text, then later run individual rules within the parser to analyze specific parts of different text, so I do this:

auto p = Glint.decimateTree(Glint.Type(ParseTree(``, false, [], s, 0, 0)));

where s is some new string I want to parse with a given sub-rule of the Glint grammar. The parser doesn't always pick up the new text, but recycles an old result, from a previous (different) text input which was parsed using Glint(input_text). This is caused by memoization coupled with the fact that I am calling a sub-rule explicitly. The way I get around this at the moment is to do this:

Glint.memo = null

before the call to the explicit sub-rules. Currently memo is nulled before a new call to the main parser (when using a string as input). It would be cool if a similar facility existed to call sub-rules using just a string (or maybe this exists already?) which also null's out the memo.

Thanks!

isRule undefined, then forward reference

Trying to compile Pegged master branch, I get the error:

..\Pegged\pegged\peg.d 374 Error: undefined identifier isRule, did you mean function Rule?

However I try to fix this (importing pegged\parser.d, etc) I get a forward reference error on GenericPegged.Pegged.

This is with an up-to-date master, but it has been happening for a couple of weeks now. This is simply when compiling Pegged, not trying to generate a parser or anything.

Problem with simple grammar

This may be related to another recent issue:

mixin(grammar(`
    Parse:
        Line < Keyword*
        Keyword <- "one" / "two"
`));

void main()
{
    string input =  "one two";
    auto res = Parse(input);
    writeln(res);
}

This hangs indefinitely. Changing the grammar to:

mixin(grammar(`
    Parse:
        Line < Keyword Keyword
        Keyword <- "one" / "two"
`));

Returns:

Parse  [0, 4]["one", "one"]
 +-Parse.Line  [0, 4]["one", "one"]
    +-Parse.Keyword  [0, 3]["one"]
    +-Parse.Keyword  [0, 3]["one"]

Like it is not consuming the input or something.

Compilation fails on Linux 64 bit, with git DMD.

dmd -version=select -w -Ivendor/pegged  -ofvendor/pegged/pegged/peg.o -c vendor/pegged/pegged/peg.d
dmd -version=select -w -Ivendor/pegged  -ofvendor/pegged/pegged/grammar.o -c vendor/pegged/pegged/grammar.d
vendor/pegged/pegged/grammar.d(2550): Error: cannot implicitly convert expression (diag.infiniteLoops.length()) of type ulong to int
vendor/pegged/pegged/grammar.d(2593): Error: cannot implicitly convert expression (cast(ulong)(breaker + 1) % diag.infiniteLoops.length()) of type ulong to int
vendor/pegged/pegged/grammar.d(2608): Error: cannot implicitly convert expression (cast(ulong)(breaker + 1) % diag.infiniteLoops.length()) of type ulong to int

semantic actions

When i try to create a grammar module with semantic actions, like

asModule("parser","
        Number < [0-9]+ {doStuff} "
             );

The generated grammar always gives:

static assert(false, `Bad grammar: ["PEGGED.Grammar failure at pos [index: 25, line: 1, col: 24]", "Pegged.EOI failure at pos [index: 25, line: 1, col: 24]"]`);

Are semantic actions currently implemented? What is the correct way to use them?

Bug in comment example?

On: https://github.com/PhilippeSigaud/Pegged/wiki/Extended-PEG-Syntax

Text    <~ (!("/*"/"/*") .)*

Was probably meant to be:

Text    <~ (!("/*"/"*/") .)*

Utility function to get error location

It would be nice to have a utility function to get the location of a fatal failure during parsing, at least as a temporary measure until we work out a proper error handling mechanism.

(This might exist already and I just can't find it...)

Using memoization breaks parsing in some cases

It took me two hours to track this bug down... Here's a test case:

#!/usr/bin/env rdmd

import std.stdio;
import pegged.grammar;

/*mixin( grammar( `*/
mixin( grammar!(Memoization.yes)( `
Test:
  Div <- HtmlBlockTag( 'div' )
  HtmlBlockTag( Tag ) <- HtmlTagOpen( Tag )
                        ( HtmlBlockTag( Tag ) /
                          AllIfNot( HtmlTagClose( Tag ) ) )*
                        HtmlTagClose( Tag )

  HtmlTag( Contents ) <- Lt Spnl ^Contents Spnl Gt
  HtmlTagClose( Tag ) <- HtmlTag( ^slash ^Tag )

  # The version of HtmTagOpen that uses the HtmlTag rule fails under memoization;
  # both versions work under no memoization
# HtmlTagOpen( Tag )  <- Lt Spnl ^Tag Spnl ( HtmlAttribute Spnl )* Spnl Gt
  HtmlTagOpen( Tag )  <- HtmlTag( ^Tag Spnl HtmlAttribute* )

  HtmlAttributeValue <~ (Quoted / (!"/" !">" Nonspacechar)+)
  HtmlAttributeName <~ (AlphanumericAscii / "-")+
  HtmlAttribute <- HtmlAttributeName Spnl (^"=" Spnl HtmlAttributeValue)? Spnl

  Lt <- "<"
  Gt <- ">"

  Quoted <-    ^doublequote FuseAllUntil(doublequote) ^doublequote
             / ^quote FuseAllUntil(quote) ^quote
  BlankLine <~     Spaces Newline
  AlphanumericAscii <~ [A-Za-z0-9]
  Nonspacechar <~  !Spacechar !Newline .
  Spacechar <~     " " / "\t"
  Newline <~       "\n" / "\r" "\n"?
  Spaces <~        Spacechar*
  Spnl <~          Spaces (Newline Spaces)?

  AllIfNot(Predicate) <- (!Predicate .)
  AllUntil(Predicate) <- AllIfNot(Predicate)*
  FuseAllUntil(Predicate) <~ AllUntil(Predicate)
`));

void main() {
  auto tree = Test(
`<div id="bar">
  foo <br/> bar
</div>

`);

  writeln( tree );
  writeln( tree.matches );
}

With memoization turned on, this test code fails to parse the input. With memoization off, parsing succeeds.

I also tracked it down to a single rule; look at the comments in the grammar code. The version of HtmTagOpen that uses the HtmlTag rule fails under memoization; both versions work under no memoization. Switch between the two version of HtmlTagOpen to see the behavior.

Action documentation/examples doesn't work

rdmd -I. pegged/examples/xml
pegged/examples/xml.d(12): Error: struct pegged.peg.Output(TParseTree) if (isParseTree!(TParseTree)) is used as a type
pegged/examples/xml.d(12): Error: struct pegged.peg.Output(TParseTree) if (isParseTree!(TParseTree)) is used as a type
pegged/examples/xml.d(22): Error: struct pegged.peg.Output(TParseTree) if (isParseTree!(TParseTree)) is used as a type
pegged/examples/xml.d(22): Error: struct pegged.peg.Output(TParseTree) if (isParseTree!(TParseTree)) is used as a type
Failed: 'dmd' '-I.' '-v' '-o-' 'pegged/examples/xml.d' '-Ipegged/examples'

'line' semantics

I'm a little confused as to what 'line' means in a ParseTree's begin position. For example when parsing via the JSON example:

import std.stdio;
import pegged.grammar;
import json;

enum example3 =
`{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": 1
        }
    }
}`;


void main()
{
    auto pt = JSON.parse(example3);

    foreach (child; pt.children)
    {
        foreach (sub1; child.children)
        {
            foreach (sub2; sub1.children)
            {
                foreach (sub3; sub2.children)
                {
                    writefln("%s : '%s'", sub3.begin, sub3.capture);
                }
            }
        }
    }
}

This prints:

[index: 18, line: 1, col: 16] : '["title", "example glossary", "GlossDiv", "title", "S", "GlossList", "1"]'

I thought "title" would begin at line 4 or 5, but not 1. If I'm misunderstanding the meaning of this it could be a good idea to document it for newbies like myself. :)

Predefined Identifier rule is wrong

I stumbled over strange behavior and I believe the cause is the built-in Identifier rule:

Identifier <~ Alpha Alphanum*

This rule ignores spacing. That's why a something is recognized as an Identifier which is wrong. The rule should be

Identifier <- ~(Alpha Alphanum*)

I'm not completely confident that I'm right. I'm sorry, for not writing a test case and verifying it further.

Constants for tree node names

It would be nice if each generated tree node had an enum string field holding the name used in the grammar. This would be much more maintainable/clean when switching over node names (since you'd get errors when you rename a node).

Error: template instance Char!('0') Char is not a template declaration, it is a class

Code:

mixin(grammar("Binary <- '0' ('b' / 'B') [01]+"));

Am I doing it wrong or is this a bug?

Infinite loop.

import std.stdio : writeln;
import pegged.grammar;

mixin(grammar(`
Test:
    A <- B*
    B <- .*
`));

void main () {
    writeln(Test.A.parse("lol"));
}

This is the reduced testcase for the culprit in my grammar.