Giter Site home page Giter Site logo

yhirose / cpp-peglib Goto Github PK

View Code? Open in Web Editor NEW
836.0 28.0 105.0 3.68 MB

A single file C++ header-only PEG (Parsing Expression Grammars) library

License: MIT License

C++ 98.95% CMake 0.61% Pascal 0.28% Makefile 0.03% Vim Script 0.13%
cpp peg parsing-expression-grammars parsing c-plus-plus header-only parser-generator cpp17

cpp-peglib's Introduction

cpp-peglib

Build Status

C++17 header-only PEG (Parsing Expression Grammars) library. You can start using it right away just by including peglib.h in your project.

Since this library only supports C++17 compilers, please make sure that compiler the option -std=c++17 is enabled. (/std:c++17 /Zc:__cplusplus for MSVC)

You can also try the online version, PEG Playground at https://yhirose.github.io/cpp-peglib.

The PEG syntax is well described on page 2 in the document by Bryan Ford. cpp-peglib also supports the following additional syntax for now:

  • '...'i (Case-insensitive literal operator)
  • [...]i (Case-insensitive character class operator)
  • [^...] (Negated character class operator)
  • [^...]i (Case-insensitive negated character class operator)
  • {2,5} (Regex-like repetition operator)
  • < ... > (Token boundary operator)
  • ~ (Ignore operator)
  • \x20 (Hex number char)
  • \u10FFFF (Unicode char)
  • %whitespace (Automatic whitespace skipping)
  • %word (Word expression)
  • $name( ... ) (Capture scope operator)
  • $name< ... > (Named capture operator)
  • $name (Backreference operator)
  • | (Dictionary operator)
  • (Cut operator)
  • MACRO_NAME( ... ) (Parameterized rule or Macro)
  • { precedence L - + L / * } (Parsing infix expression)
  • %recovery( ... ) (Error recovery operator)
  • exp⇑label or exp^label (Syntax sugar for (exp / %recover(label)))
  • label { error_message "..." } (Error message instruction)
  • { no_ast_opt } (No AST node optimization instruction)

'End of Input' check will be done as default. In order to disable the check, please call disable_eoi_check.

This library supports the linear-time parsing known as the Packrat parsing.

IMPORTANT NOTE for some Linux distributions such as Ubuntu and CentOS: Need -pthread option when linking. See #23, #46 and #62.

I am sure that you will enjoy this excellent "Practical parsing with PEG and cpp-peglib" article by bert hubert!

How to use

This is a simple calculator sample. It shows how to define grammar, associate semantic actions to the grammar, and handle semantic values.

// (1) Include the header file
#include <peglib.h>
#include <assert.h>
#include <iostream>

using namespace peg;
using namespace std;

int main(void) {
  // (2) Make a parser
  parser parser(R"(
    # Grammar for Calculator...
    Additive    <- Multiplicative '+' Additive / Multiplicative
    Multiplicative   <- Primary '*' Multiplicative / Primary
    Primary     <- '(' Additive ')' / Number
    Number      <- < [0-9]+ >
    %whitespace <- [ \t]*
  )");

  assert(static_cast<bool>(parser) == true);

  // (3) Setup actions
  parser["Additive"] = [](const SemanticValues &vs) {
    switch (vs.choice()) {
    case 0: // "Multiplicative '+' Additive"
      return any_cast<int>(vs[0]) + any_cast<int>(vs[1]);
    default: // "Multiplicative"
      return any_cast<int>(vs[0]);
    }
  };

  parser["Multiplicative"] = [](const SemanticValues &vs) {
    switch (vs.choice()) {
    case 0: // "Primary '*' Multiplicative"
      return any_cast<int>(vs[0]) * any_cast<int>(vs[1]);
    default: // "Primary"
      return any_cast<int>(vs[0]);
    }
  };

  parser["Number"] = [](const SemanticValues &vs) {
    return vs.token_to_number<int>();
  };

  // (4) Parse
  parser.enable_packrat_parsing(); // Enable packrat parsing.

  int val;
  parser.parse(" (1 + 2) * 3 ", val);

  assert(val == 9);
}

To show syntax errors in grammar text:

auto grammar = R"(
  # Grammar for Calculator...
  Additive    <- Multiplicative '+' Additive / Multiplicative
  Multiplicative   <- Primary '*' Multiplicative / Primary
  Primary     <- '(' Additive ')' / Number
  Number      <- < [0-9]+ >
  %whitespace <- [ \t]*
)";

parser parser;

parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
  cerr << line << ":" << col << ": " << msg << "\n";
});

auto ok = parser.load_grammar(grammar);
assert(ok);

There are four semantic actions available:

[](const SemanticValues& vs, any& dt)
[](const SemanticValues& vs)
[](SemanticValues& vs, any& dt)
[](SemanticValues& vs)

SemanticValues value contains the following information:

  • Semantic values
  • Matched string information
  • Token information if the rule is literal or uses a token boundary operator
  • Choice number when the rule is 'prioritized choice'

any& dt is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parse method.

A semantic action can return a value of arbitrary data type, which will be wrapped by peg::any. If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& vs argument will be returned. (Yacc parser has the same behavior.)

Here shows the SemanticValues structure:

struct SemanticValues : protected std::vector<any>
{
  // Input text
  const char* path;
  const char* ss;

  // Matched string
  std::string_view sv() const { return sv_; }

  // Line number and column at which the matched string is
  std::pair<size_t, size_t> line_info() const;

  // Tokens
  std::vector<std::string_view> tokens;
  std::string_view token(size_t id = 0) const;

  // Token conversion
  std::string token_to_string(size_t id = 0) const;
  template <typename T> T token_to_number() const;

  // Choice number (0 based index)
  size_t choice() const;

  // Transform the semantic value vector to another vector
  template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const;
}

The following example uses < ... > operator, which is token boundary operator.

peg::parser parser(R"(
  ROOT  <- _ TOKEN (',' _ TOKEN)*
  TOKEN <- < [a-z0-9]+ > _
  _     <- [ \t\r\n]*
)");

parser["TOKEN"] = [](const SemanticValues& vs) {
  // 'token' doesn't include trailing whitespaces
  auto token = vs.token();
};

auto ret = parser.parse(" token1, token2 ");

We can ignore unnecessary semantic values from the list by using ~ operator.

peg::parser parser(R"(
  ROOT  <-  _ ITEM (',' _ ITEM _)*
  ITEM  <-  ([a-z0-9])+
  ~_    <-  [ \t]*
)");

parser["ROOT"] = [&](const SemanticValues& vs) {
  assert(vs.size() == 2); // should be 2 instead of 5.
};

auto ret = parser.parse(" item1, item2 ");

The following grammar is same as the above.

peg::parser parser(R"(
  ROOT  <-  ~_ ITEM (',' ~_ ITEM ~_)*
  ITEM  <-  ([a-z0-9])+
  _     <-  [ \t]*
)");

Semantic predicate support is available with a predicate action.

peg::parser parser("NUMBER  <-  [0-9]+");

parser["NUMBER"] = [](const SemanticValues &vs) {
  return vs.token_to_number<long>();
};

parser["NUMBER"].predicate = [](const SemanticValues &vs,
                                const std::any & /*dt*/, std::string &msg) {
  if (vs.token_to_number<long>() != 100) {
    msg = "value error!!";
    return false;
  }
  return true;
};

long val;
auto ret = parser.parse("100", val);
assert(ret == true);
assert(val == 100);

ret = parser.parse("200", val);
assert(ret == false);

enter and leave actions are also available.

parser["RULE"].enter = [](const Context &c, const char* s, size_t n, any& dt) {
  std::cout << "enter" << std::endl;
};

parser["RULE"] = [](const SemanticValues& vs, any& dt) {
  std::cout << "action!" << std::endl;
};

parser["RULE"].leave = [](const Context &c, const char* s, size_t n, size_t matchlen, any& value, any& dt) {
  std::cout << "leave" << std::endl;
};

You can receive error information via a logger:

parser.set_logger([](size_t line, size_t col, const string& msg) {
  ...
});

parser.set_logger([](size_t line, size_t col, const string& msg, const string &rule) {
  ...
});

Ignoring Whitespaces

As you can see in the first example, we can ignore whitespaces between tokens automatically with %whitespace rule.

%whitespace rule can be applied to the following three conditions:

  • trailing spaces on tokens
  • leading spaces on text
  • trailing spaces on literal strings in rules

These are valid tokens:

KEYWORD   <- 'keyword'
KEYWORDI  <- 'case_insensitive_keyword'
WORD      <-  < [a-zA-Z0-9] [a-zA-Z0-9-_]* >    # token boundary operator is used.
IDNET     <-  < IDENT_START_CHAR IDENT_CHAR* >  # token boundary operator is used.

The following grammar accepts one, "two three", four.

ROOT         <- ITEM (',' ITEM)*
ITEM         <- WORD / PHRASE
WORD         <- < [a-z]+ >
PHRASE       <- < '"' (!'"' .)* '"' >

%whitespace  <-  [ \t\r\n]*

Word expression

peg::parser parser(R"(
  ROOT         <-  'hello' 'world'
  %whitespace  <-  [ \t\r\n]*
  %word        <-  [a-z]+
)");

parser.parse("hello world"); // OK
parser.parse("helloworld");  // NG

Capture/Backreference

peg::parser parser(R"(
  ROOT      <- CONTENT
  CONTENT   <- (ELEMENT / TEXT)*
  ELEMENT   <- $(STAG CONTENT ETAG)
  STAG      <- '<' $tag< TAG_NAME > '>'
  ETAG      <- '</' $tag '>'
  TAG_NAME  <- 'b' / 'u'
  TEXT      <- TEXT_DATA
  TEXT_DATA <- ![<] .
)");

parser.parse("This is <b>a <u>test</u> text</b>."); // OK
parser.parse("This is <b>a <u>test</b> text</u>."); // NG
parser.parse("This is <b>a <u>test text</b>.");     // NG

Dictionary

| operator allows us to make a word dictionary for fast lookup by using Trie structure internally. We don't have to worry about the order of words.

START <- 'This month is ' MONTH '.'
MONTH <- 'Jan' | 'January' | 'Feb' | 'February' | '...'

We are able to find which item is matched with choice().

parser["MONTH"] = [](const SemanticValues &vs) {
  auto id = vs.choice();
};

It supports the case insensitive mode.

START <- 'This month is ' MONTH '.'
MONTH <- 'Jan'i | 'January'i | 'Feb'i | 'February'i | '...'i

Cut operator

operator could mitigate backtrack performance problem, but has a risk to change the meaning of grammar.

S <- '(' ↑ P ')' / '"' ↑ P '"' / P
P <- 'a' / 'b' / 'c'

When we parse (z with the above grammar, we don't have to backtrack in S after ( is matched, because a cut operator is inserted there.

Parameterized Rule or Macro

# Syntax
Start      ← _ Expr
Expr       ← Sum
Sum        ← List(Product, SumOpe)
Product    ← List(Value, ProOpe)
Value      ← Number / T('(') Expr T(')')

# Token
SumOpe     ← T('+' / '-')
ProOpe     ← T('*' / '/')
Number     ← T([0-9]+)
~_         ← [ \t\r\n]*

# Macro
List(I, D) ← I (D I)*
T(x)       ← < x > _

Parsing infix expression by Precedence climbing

Regarding the precedence climbing algorithm, please see this article.

parser parser(R"(
  EXPRESSION             <-  INFIX_EXPRESSION(ATOM, OPERATOR)
  ATOM                   <-  NUMBER / '(' EXPRESSION ')'
  OPERATOR               <-  < [-+/*] >
  NUMBER                 <-  < '-'? [0-9]+ >
  %whitespace            <-  [ \t]*

  # Declare order of precedence
  INFIX_EXPRESSION(A, O) <-  A (O A)* {
    precedence
      L + -
      L * /
  }
)");

parser["INFIX_EXPRESSION"] = [](const SemanticValues& vs) -> long {
  auto result = any_cast<long>(vs[0]);
  if (vs.size() > 1) {
    auto ope = any_cast<char>(vs[1]);
    auto num = any_cast<long>(vs[2]);
    switch (ope) {
      case '+': result += num; break;
      case '-': result -= num; break;
      case '*': result *= num; break;
      case '/': result /= num; break;
    }
  }
  return result;
};
parser["OPERATOR"] = [](const SemanticValues& vs) { return *vs.sv(); };
parser["NUMBER"] = [](const SemanticValues& vs) { return vs.token_to_number<long>(); };

long val;
parser.parse(" -1 + (1 + 2) * 3 - -1", val);
assert(val == 9);

precedence instruction can be applied only to the following 'list' style rule.

Rule <- Atom (Operator Atom)* {
  precedence
    L - +
    L / *
    R ^
}

precedence instruction contains precedence info entries. Each entry starts with associativity which is 'L' (left) or 'R' (right), then operator literal tokens follow. The first entry has the highest order level.

AST generation

cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing. enable_ast method on peg::parser class enables the feature.

NOTE: An AST node holds a corresponding token as std::string_vew for performance and less memory usage. It is users' responsibility to keep the original source text along with the generated AST tree.

peg::parser parser(R"(
  ...
  definition1 <- ... { no_ast_opt }
  definition2 <- ... { no_ast_opt }
  ...
)");

parser.enable_ast();

shared_ptr<peg::Ast> ast;
if (parser.parse("...", ast)) {
  cout << peg::ast_to_s(ast);

  ast = parser.optimize_ast(ast);
  cout << peg::ast_to_s(ast);
}

optimize_ast removes redundant nodes to make a AST simpler. If you want to disable this behavior from particular rules, no_ast_opt instruction can be used.

It internally calls peg::AstOptimizer to do the job. You can make your own AST optimizers to fit your needs.

See actual usages in the AST calculator example and PL/0 language example.

Make a parser with parser combinators

Instead of making a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinators. Here is an example:

using namespace peg;
using namespace std;

vector<string> tags;

Definition ROOT, TAG_NAME, _;
ROOT     <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _)));
TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& vs) {
              tags.push_back(vs.token_to_string());
            };
_        <= zom(cls(" \t"));

auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");

The following are available operators:

Operator Description Operator Description
seq Sequence cho Prioritized Choice
zom Zero or More oom One or More
opt Optional apd And predicate
npd Not predicate lit Literal string
liti Case-insensitive Literal string cls Character class
ncls Negated Character class chr Character
dot Any character tok Token boundary
ign Ignore semantic value csc Capture scope
cap Capture bkr Back reference
dic Dictionary pre Infix expression
rec Infix expression usr User defined parser
rep Repetition

Adjust definitions

It's possible to add/override definitions.

auto syntax = R"(
  ROOT <- _ 'Hello' _ NAME '!' _
)";

Rules additional_rules = {
  {
    "NAME", usr([](const char* s, size_t n, SemanticValues& vs, any& dt) -> size_t {
      static vector<string> names = { "PEG", "BNF" };
      for (const auto& name: names) {
        if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) {
          return name.size(); // processed length
        }
      }
      return -1; // parse error
    })
  },
  {
    "~_", zom(cls(" \t\r\n"))
  }
};

auto g = parser(syntax, additional_rules);

assert(g.parse(" Hello BNF! "));

Unicode support

cpp-peglib accepts UTF8 text. . matches a Unicode codepoint. Also, it supports \u????.

Error report and recovery

cpp-peglib supports the furthest failure error position report as described in the Bryan Ford original document.

For better error report and recovery, cpp-peglib supports 'recovery' operator with label which can be associated with a recovery expression and a custom error message. This idea comes from the fantastic "Syntax Error Recovery in Parsing Expression Grammars" paper by Sergio Medeiros and Fabio Mascarenhas.

The custom message supports %t which is a place holder for the unexpected token, and %c for the unexpected Unicode char.

Here is an example of Java-like grammar:

# java.peg
Prog        ← 'public' 'class' NAME '{' 'public' 'static' 'void' 'main' '(' 'String' '[' ']' NAME ')' BlockStmt '}'
BlockStmt   ← '{' (!'}' Stmt^stmtb)* '}' # Annotated with `stmtb`
Stmt        ← IfStmt / WhileStmt / PrintStmt / DecStmt / AssignStmt / BlockStmt
IfStmt      ← 'if' '(' Exp ')' Stmt ('else' Stmt)?
WhileStmt   ← 'while' '(' Exp^condw ')' Stmt # Annotated with `condw`
DecStmt     ← 'int' NAME ('=' Exp)? ';'
AssignStmt  ← NAME '=' Exp ';'^semia # Annotated with `semi`
PrintStmt   ← 'System.out.println' '(' Exp ')' ';'
Exp         ← RelExp ('==' RelExp)*
RelExp      ← AddExp ('<' AddExp)*
AddExp      ← MulExp (('+' / '-') MulExp)*
MulExp      ← AtomExp (('*' / '/') AtomExp)*
AtomExp     ← '(' Exp ')' / NUMBER / NAME

NUMBER      ← < [0-9]+ >
NAME        ← < [a-zA-Z_][a-zA-Z_0-9]* >

%whitespace ← [ \t\n]*
%word       ← NAME

# Recovery operator labels
semia       ← '' { error_message "missing semicolon in assignment." }
stmtb       ← (!(Stmt / 'else' / '}') .)* { error_message "invalid statement" }
condw       ← &'==' ('==' RelExp)* / &'<' ('<' AddExp)* / (!')' .)*

For instance, ';'^semi is a syntactic sugar for (';' / %recovery(semi)). %recover operator tries to recover the error at ';' by skipping input text with the recovery expression semi. Also semi is associated with a custom message "missing semicolon in assignment.".

Here is the result:

> cat sample.java
public class Example {
  public static void main(String[] args) {
    int n = 5;
    int f = 1;
    while( < n) {
      f = f * n;
      n = n - 1
    };
    System.out.println(f);
  }
}

> peglint java.peg sample.java
sample.java:5:12: syntax error, unexpected '<', expecting '(', <NUMBER>, <NAME>.
sample.java:8:5: missing semicolon in assignment.
sample.java:8:6: invalid statement

As you can see, it can now show more than one error, and provide more meaningful error messages than the default messages.

Custom error message for definitions

We can associate custom error messages to definitions.

# custom_message.peg
START       <- CODE (',' CODE)*
CODE        <- < '0x' [a-fA-F0-9]+ > { error_message 'code format error...' }
%whitespace <- [ \t]*
> cat custom_message.txt
0x1234,0x@@@@,0xABCD

> peglint custom_message.peg custom_message.txt
custom_message.txt:1:8: code format error...

NOTE: If there are more than one elements with error message instruction in a prioritized choice, this feature may not work as you expect.

peglint - PEG syntax lint utility

Build peglint

> cd lint
> mkdir build
> cd build
> cmake ..
> make
> ./peglint
usage: grammar_file_path [source_file_path]

  options:
    --source: source text
    --packrat: enable packrat memoise
    --ast: show AST tree
    --opt, --opt-all: optimize all AST nodes except nodes selected with `no_ast_opt` instruction
    --opt-only: optimize only AST nodes selected with `no_ast_opt` instruction
    --trace: show concise trace messages
    --profile: show profile report
    --verbose: verbose output for trace and profile

Grammar check

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number
%whitespace <- [ \t\r\n]*

> peglint a.peg
[commandline]:3:35: 'Number' is not defined.

Source check

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number
Number      <- < [0-9]+ >
%whitespace <- [ \t\r\n]*

> peglint --source "1 + a * 3" a.peg
[commandline]:1:3: syntax error

AST

> cat a.txt
1 + 2 * 3

> peglint --ast a.peg a.txt
+ Additive
  + Multiplicative
    + Primary
      - Number (1)
  + Additive
    + Multiplicative
      + Primary
        - Number (2)
      + Multiplicative
        + Primary
          - Number (3)

AST optimization

> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive
  - Multiplicative[Number] (1)
  + Additive[Multiplicative]
    - Primary[Number] (2)
    - Multiplicative[Number] (3)

Adjust AST optimization with no_ast_opt instruction

> cat a.peg
Additive    <- Multiplicative '+' Additive / Multiplicative
Multiplicative   <- Primary '*' Multiplicative / Primary
Primary     <- '(' Additive ')' / Number          { no_ast_opt }
Number      <- < [0-9]+ >
%whitespace <- [ \t\r\n]*

> peglint --ast --opt --source "1 + 2 * 3" a.peg
+ Additive/0
  + Multiplicative/1[Primary]
    - Number (1)
  + Additive/1[Multiplicative]
    + Primary/1
      - Number (2)
    + Multiplicative/1[Primary]
      - Number (3)

> peglint --ast --opt-only --source "1 + 2 * 3" a.peg
+ Additive/0
  + Multiplicative/1
    - Primary/1[Number] (1)
  + Additive/1
    + Multiplicative/0
      - Primary/1[Number] (2)
      + Multiplicative/1
        - Primary/1[Number] (3)

Sample codes

License

MIT license (© 2022 Yuji Hirose)

cpp-peglib's People

Contributors

alex-87 avatar apache-hb avatar bnjf avatar g40 avatar halirutan avatar hvellyr avatar jsoref avatar krovatkin avatar mingodad avatar mqnc avatar mura-orz avatar notapenguin0 avatar nsmith- avatar peoro avatar pwuertz avatar romeoxbm avatar rwtolbert avatar spaceim avatar trcwm avatar yhirose avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cpp-peglib's Issues

Releases

Hello
Great library.
It would be great if you had official releases of cpp-peglib even if it's only a header only library, it would make packaging and versioning possible with vcpkg and/or conan, even with cmake fetchcontent which fetches content by git tags.
Thank you for your consideration.

rule called with too many semantic values

Hey Yuji!

I have encountered a strange bug:

If I have this grammar

term <- ( ws atom ws op )* ws atom ws
op <- '+'
ws <- ' '*
atom <- [0-9]*

and parse the text "99" then the rule "term" is called with 5 semantic values although there should only be 3:

for(auto& rule:parser.get_rule_names()){
	parser[rule.c_str()] = [rule](const SemanticValues& sv, any&) {
		cout << "rule " << rule << " called\n";
		for(int i = 0; i<sv.size(); i++){
			cout << "  sv[" << i << "] = " << sv[i].get<string>() << "\n";
		}
		return rule;
	};
}

produces

rule ws called
rule atom called
rule ws called
rule ws called
rule atom called
rule ws called
rule term called
  sv[0] = atom
  sv[1] = ws
  sv[2] = ws
  sv[3] = atom
  sv[4] = ws

Or is there something I don't get?

peglint crashes

Hi!

Somehow peglint does not work for me like described in the readme:

$ cat a.peg
Additive    <- Multitive '+' Additive / Multitive
Multitive   <- Primary '*' Multitive / Primary
Primary     <- '(' Additive ')' / Number
Number      <- < [0-9]+ >
%whitespace <- [ \t]*

$ ./peglint --ast --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)

When I run in server mode, it shows the page and then crashes:

$ ./peglint --ast --server 8001 --source "1 + 2 * 3" a.peg
Server running at http://localhost:8001/
(now I open the browser and it shows a promising page)
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1
Aborted (core dumped)

I compiled and ran under Ubuntu,
The C compiler identification is GNU 7.2.0
The CXX compiler identification is GNU 7.2.0

All the best!

PS: If you can't reproduce this, I can try to provide more debug information.

[Request] external macros

Here's the situation:
I think sooner or later I will probably want the ability to parse a context-dependent grammar, so it would be nice to be able to call external functions during the parsing process.
The first issue where this popped up is this:
I want to be able to parse custom operators with custom priority. So the user should be able to say
"× is infix and should be evaluated before + but after *"
But in order to make the priority customizeable, the parser must know which operator comes on which priority level during parsing. But since the definition of the operator also happens during parsing, it cannot know it during grammar construction.
And in the future I will probably also need to create some sort of registry during parsing where the parser can look things up.
So the best way that I see would be this:

Create a way to call external functions that determine whether there is a match or not. Something like:

Result <- Operand @Operator Operand

parser.definition["Operator"] = [](const char* s, size_t n, SemanticValues& sv, any& dt){
    ... // do custom stuff like look ups
    return matchlen;
}

Do you think this is a good idea and useful in general? Or is there maybe a more elegant solution to my problem? (Like in the end I also didn't need left recursion although I thought I did)

In return I can implement UTF8 support ;)

C/C++ escaped quotes in strings?

Hello Yuji.

Cannot get this to work using master. The intention is to be able to parse a C/C++ style literal string with escaped quotes as in:

// i.e. file content, not a compiled string
" Hello \"Yuji\" "

Any thoughts? The rule is specified this:

// 
RULE <- '"' (LITERAL_ESC_QUOTE / LITERAL_CHAR)* '"'
// i.e. match anything that is not a single quote character
LITERAL_CHAR <- (!["] .)
// this is a string not an escaped quote
LITERAL_ESC_QUOTE  <- '\"'

TAIA.

Jerry

Semantic Values seems weird if string and string are nearby

I would better to use less code to complain the problem that I had occurred.
For example the parser expressions are as below:

ASSIGN			<-	"Set" TYPE IDSTR '=' EXPRESSION
TYPE			<-	["Interger""Decimal"]
IDSTR			<-	[_A-Za-z][_A-Za-z0-9]*
 EXPRESSION		<-	...brabrabra, just return double value...

And I want to parse "SetDecimalvariable=50.0"
sv that I get in ASSIGN will be as follow:

sv.size() == 3        // fine, (TYPE, IDSTR, EXPRESSION) are three element
sv.str_c() == "SetDecimalvariable=50.0"        // fine, originally data
sv[0].get<string>() == "Decimalvariable=50.0"        // weired, it should be "Decimal" as first element
sv[1].get<string>() == "ecimalvariable=50.0"        // more weired, it should be "variable" as second element
sv[2].get<Ele>() == 50.0        // fine, it's the third element

Am I thought wrong?
Thanks for the project, it helps me a lot.

[Question] how to make parser continue after syntax error?

I am trying to write a scripting language using this library. In compilers/interpreters of other scripting languages I've used, you can get multiple errors from a single file/class/function. I know that I can check for a lot of errors after the parser (from this lib) has finished (accessing a private var, calling a function that doesn't exist, etc.), but I can't seem to find a way to check for multiple syntax errors, because the parser stops after encountering one. Is there be a way to ignore a rule if the parser finds a syntax error in my input?

Memory leaks

I was playing with this version of your parser https://github.com/yhirose/cpp-peglib/blob/57f866c6ca77f5a5afe37f72942d5526c45d7e87/peglib.h and accidentally found unlimited memory consumption. Consider this example:

#include <cpp-peglib/peglib.h>
#include <iostream>
#include <cstdlib>

using namespace peg;
using namespace std;

int main(int , char** ) try
{
	do	
	{
		function<long (const Ast&)> eval = [&](const Ast& ast) {
			if (ast.name == "NUMBER") {
				return stol(ast.token);
			} else {
				const auto& nodes = ast.nodes;
				auto result = eval(*nodes[0]);
				for (auto i = 1u; i < nodes.size(); i += 2) {
					auto num = eval(*nodes[i + 1]);
					auto ope = nodes[i]->token[0];
					switch (ope) {
						case '+': result += num; break;
						case '-': result -= num; break;
						case '*': result *= num; break;
						case '/': result /= num; break;
					}
				}
				return result;
			}
		};

		parser parser(R"(
			EXPRESSION       <-  TERM (TERM_OPERATOR TERM)*
			TERM             <-  FACTOR (FACTOR_OPERATOR FACTOR)*
			FACTOR           <-  NUMBER / '(' EXPRESSION ')'
			TERM_OPERATOR    <-  < [-+] >
			FACTOR_OPERATOR  <-  < [/*] >
			NUMBER           <-  < [0-9]+ >
			%whitespace      <-  [ \t\r\n]*
		)");

		parser.enable_ast();
		parser.enable_packrat_parsing();

		auto expr = " 2+2*2 ";
		shared_ptr<Ast> ast;
		if (parser.parse(expr, ast)) {
			ast = AstOptimizer(true).optimize(ast);
			//cout << ast_to_s(ast);
			//cout << expr << " = " << eval(*ast) << endl;
		}
	}
	while(0);
	

	return 0;
}
catch(const std::exception &ex)
{
	std::cerr << "Error: " << ex.what() << "\n";
	return 1;
}

(Built with g++ (i686-posix-dwarf-rev0, Built by MinGW-W64 project) 8.1.0, Win7, 'g++ -std=c++17 -Wall -Wextra -Wpedantic -g -O0 -fno-inline -fno-omit-frame-pointer -ggdb -isystemK:/1/0/source/cpp-peglib main.cpp -o a.exe'.)
If I make infinite loop while(1), process memory in a few minutes grows up to 200 MB and even more. Program with while(0) requires less than 1 MB.
There is full output of 'drmemory -- a.exe' https://pastebin.com/raw/5qkLW8Zj. Its important parts:

Error #1: LEAK 172 direct bytes 0x020fbd20-0x020fbdcc + 540 indirect bytes
peglib.h:3269 _ZZN3peg6parser10enable_astINS_7AstBaseINS_9EmptyTypeEEEEERS0_vENKUlRKNS_14SemanticValuesEE_clES8_ 

Error #3: LEAK 8 direct bytes 0x021051d0-0x021051d8 + 344 indirect bytes
peglib.h:2942 peg::AstBase<>::AstBase   

peglib.h:3269 in lambda:

auto ast = std::make_shared<T>(

peglib.h:2942 in c-tor:

, nodes(a_nodes)

Am I doing something wrong or is it error in your library? Could you fix it?

Tree rewriting

Have a secondary stage to walk the parse tree that is currently output in AST mode. This stage could then create a much reduced AST according to an additional grammar or extra markup.

I have some example code written for the GDB grammar that demonstrates this idea. Though this version relies on the functor callabacks for nodes in the parse tree.

I will explain this more clearly with an example. Just want to have a placeholder.

AST crashes on using optional param in grammar

For the csv grammar in the given example it core dumps when forming an AST of it

CSV grammar based on RFC 4180 (http://www.ietf.org/rfc/rfc4180.txt)

file <- (header NL)? record (NL record)* NL?
header <- name (COMMA name)*
record <- field (COMMA field)*
name <- field
field <- escaped / non_escaped
escaped <- DQUOTE (TEXTDATA / COMMA / CR / LF / D_DQUOTE)* DQUOTE
non_escaped <- TEXTDATA*
COMMA <- ','
CR <- '\r'
DQUOTE <- '"'
LF <- '\n'
NL <- CR LF / CR / LF
TEXTDATA <- !([",] / NL) .
D_DQUOTE <- '"' '"'

#0 0x000000000043267a in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:369
#1 0x0000000000429eae in std::_Hashtable<std::string, std::pair<std::string const, peg::Definition>, std::allocator<std::pair<std::string const, peg::Definition> >, std::__detail::_Select1st, std::equal_tostd::string, std::hashstd::string, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/hashtable.h:455
#2 0x00000000004216f6 in std::unordered_map<std::string, peg::Definition, std::hashstd::string, std::equal_tostd::string, std::allocator<std::pair<std::string const, peg::Definition> > >::begin (this=0x0)
at /opt/gcc/4.8.3/include/c++/4.8.3/bits/unordered_map.h:249
#3 0x00000000004237e9 in peg::parser::enable_ast<peg::AstBasepeg::EmptyType > (this=0x7ffe1c236ca0) at cpp-peglib/peglib.h:3254

If the (?) operator is not used, it doesn't crash. Any reason to not use the (?) operator

utf-8 support

As part of a larger project which uses utf-8 internally for which I had to read files which could be utf-8 or utf-16. While the project is not ready for posting I have broken off some utilities which includes utf conversion routines (and a link to where I got the information in the header file). I hope that this may be of use to you in your efforts.

https://github.com/mjsurette/easyUtils

Mike

Capture and AST generation

Hi Yuji, would be very interested to get your thoughts on this too.

  1. parser fills the ast.token value with the literal content of the string.
STRING_LITERAL  <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' > 
  1. no capture, ast.token value is empty.
STRING_LITERAL  <- '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' 
  1. parser creates an AST node for STRING_LITERAL with a new child leaf for each matching occurrence of ESC and CHAR. Obviously there is no content in the AST token member.
STRING_LITERAL  <- < '"' (ESC / CHAR)* '"' > 
ESC                                <- ('\\"' / '\\t' / '\\n')
CHAR                             <- (!["] .)

Questions:
Are the differences in 1 and 2 intentional?
3 is possibly unexpected but is consistent throughout so not a problem. I think this definitely worth documenting in readme.

[Enhancement] Const access to definitions

Hi, I noticed that all the parsing methods of peg::parser and peg::Definition are const, which is nice.
Could we get const access to the definitions too in order to parse specific rules via const peg::parser&?

diff --git a/peglib.h b/peglib.h
index 463cd3b..8050ec2 100644
--- a/peglib.h
+++ b/peglib.h
@@ -3116,10 +3116,14 @@ public:
 
     Definition& operator[](const char* s) {
         return (*grammar_)[s];
     }
 
+    const Definition& operator[](const char* s) const {
+        return (*grammar_)[s];
+    }
+
     std::vector<std::string> get_rule_names(){
         std::vector<std::string> rules;
         rules.reserve(grammar_->size());
         for (auto const& r : *grammar_) {
             rules.emplace_back(r.first);

Thought: Profiling backtracking ...

Hi Yuji,

Something of a 'nice to have': Some statistics on the amount of backtracking during a parse might suggest better/optimal rule ordering (?)

matching of choice expression

I have very simple grammar to illustrate:

ARRAY <- SPACEONLY? TEST SPACEONLY? ( COMMA SPACEONLY? TEST SPACEONLY?)*  

TEST <- NUM  		
 / VARNAME
		
COMMA <- ','

NUM <- [0-9]+

VARNAME <- [a-zA-z0-9_]+

~SPACEONLY <- [ \t]+

I expect that this will match NUM firstly, if that fails go to VARNAME, because by
order of choice expression. From the Bryan's paper
at page 2 from the bottom left page paragraph says

"The choice expression ‘e1 / e2’ first attempts
pattern e1, then attempts e2 from the same starting point if e1 fails."

Then, the following tokens shall match:

1DA, SA1_1WS
1, SA1_1WS
sa_1,2
1,2,3

Do you consider this case is by design or bug?

Thank you for your wonderful project.

Locally disable %whitespace

I have a grammar for a programming language. It defines %whitespace, because whitespaces are not significant.

Now, I want to parse string literals with a rule like this:

StrQuot   <- '"' (StrEscape / StrChars)* '"'
StrEscape <- < '\\' any >
StrChars  <- < (!'"' !'\\' any)+ >

StrEscape and StrChars both have rules that produces std::string, that I combine together in the rule of StrQuot. The problem is that the whitespaces in the strings are ignored, and thus the resulting string has all the whitespaces filtered out.

Is there a way to deactivate locally the %whitespace rule?

Parser reduce() functions?

Hi Yuji, more of a meta question here. I was experimenting to see if I could generate a much more minimal AST by associating specific rules with a custom reduce() function. However this is not quite working as expected. The peg::SemanticValues& argument is always empty. Does there need to be a functor associated with each and every rule in the grammar? Or have I missed something here?

A minimal example based on the GDB/MI grammar:

        auto mknode = [](const peg::SemanticValues& sv, peg::any& arg) -> peg::any
        {
            if (sv.size())
            {
                std::cout << sv[0].name << ' ' << sv[0].s << std::endl;
            }
            return peg::any();
        };

        peg::any arg;
                // set up functor for rules of interest *only*
        parser["STRING_LITERAL"] =  mknode;
        parser["IDENTIFIER"] = mknode;
        parser["LBRACE"] = mknode;
        parser["RBRACE"] = mknode;
        parser["LBRACK"] = mknode;
        parser["RBRACK"] = mknode;

        if (!parser.parse_n(source.data(), source.size(), arg ))
        {
            ret = -1;
        }

Simple grammar failing

I hate opening GitHub issues simply asking for help, however I've been fighting with a seemly simple grammar that I've not been able to get working correctly. I stumbled across peglint for testing luckily.

File a.peg contains:

Any         <- Placeholder / Text
Placeholder <- '${' Int ':' Any '}'
Int         <- [0-9]+
Text        <- [a-z]+

Running a simple example works as expected:

> peglint.exe --ast --source "${1:hi}" a.peg
+ Any
  + Placeholder
    - Int (1)
    + Any
      - Text (hi)

A bit more complex example doesn't work:

> peglint.exe --ast --source "${1:hi${2:bye}}" a.peg
[commendline]:1:7: syntax error

I'm using Win7 and VS2015 if that makes a difference at all. Thanks.

ast tree does not record selected rule

so here's a simple grammar to parse protobuf

statements <- statement*
statement <-
    "syntax" '=' string ';' /
    "import" string ';' /
    "package" token ';' /
    enum_statement /
    message_statement

enum_statement <- "enum" token '{' enum_decl* '}'
enum_decl <- token '=' number ';'

message_statement <- "message" token '{' field* '}'
field <-
    type_decl /
    repeated_decl /
    oneof_decl /
    map_decl /
    message_statement /
    enum_statement

type_decl <- type token '=' number ';'
type <- token ('.' token)*

repeated_decl <- "repeated" type token '=' number ';'
oneof_decl <- "oneof" token '{' type_decl* '}'
map_decl <- "map" '<' type ',' type '>' token '=' number ';'

%word <- token / number
string <- < '"' (!'"' .)* '"' >
token  <- < [a-zA-Z_][a-zA-Z0-9_]* >
number <- < [0-9]+ >

%whitespace <- [ \t\r\n]*

the problem is the ast doesn't record which statement rule was matched.
the "syntax" and "import" nodes look the same.

Allow peglib to work w/o RTTI

The current library is pretty close to working without C++ RTTI being enabled. peglib is useful for environments where RTTI adds more space/time overhead than is acceptable, but it currently requires some very minor changes to the code (ie: #ifdef __cpp_rtti around the uses of dynamic_cast and replacement methods that return void* instead).

Would this be something that you'd be OK accepting a patch for?

left recursion not detected

Hi Yuji!

I stumbled upon an undetected left recursion. It was hiding deep down in my grammar and I was able to reduce it to the following pattern:

_ <- ' '*
A <- B
B <- _ A

Peglib/Peglint does not see a problem there. However, if I substitute the _ rule, it works fine:

A <- B
B <- ' '* A
lrec.peg:1:6: 'B' is left recursive.
lrec.peg:2:11: 'A' is left recursive.

pegdebug

I wrote an interactive debug inspector for PEGs using peglib. I probably wouldn't have started that if I had found peglint before but now it's done and there's nothing we can do about it. However, what I needed most is a means of finding out why rules don't match certain parts of text although I wanted them to. So pegdebug displays the complete parsing process, not just the resulting AST.

You can check it out here:
https://github.com/mqnc/pegdebug

I had to modify peglib slightly to make it work. You can see the changes here:
https://github.com/mqnc/cpp-peglib

I also need these changes in my other projects. Maybe you find the functionality useful and can include it into your library. However, as it is, it breaks code that uses your peglib since the enter and leave functions have different signatures.

Please let me know if I did something wrong with the licensing or anything else in that direction.

I hope it's useful!

(sorry I made this an issue, I don't see another way for communication)

[Question] Is it safe to modify semantic values in parse action?

At least in the way I structured my parse actions, I think it could be useful at times to build onto semantic values from previous actions by modifying or reusing data instead of copying it. For example:

// A <- (rule for matching/creating A)
parser["A"] = [](const peg::SemanticValues& sv) -> StructA {
    return { /* using data from sv string */ }
}
// ModA <- (rule for matching a modification of A)
parser["ModA"] = [](const peg::SemanticValues& sv) -> /* StructA */ {
    /* modify StructA within sv[0] and let sv[0] be passed on as usual? */
}
// B <- (rule for matching/creating B from A)
parser["B"] = [](const peg::SemanticValues& sv) -> StructB {
    return { /* std::move data from sv[0] for creating StructB efficiently? */ }
}

In order to do this, I'd need a T& from the const any& items returned by sv[], but the references from any::get<T>() const are const T&. If it is guaranteed that the parser itself doesn't utilize the semantic value contents, wouldn't it be safe to drop const from the reference returned by sv[].get<T>()?

If this is true, could we add this guarantee to SemanticValues as a getter for non-const refs? Maybe something like:

template<typename T> T& SemanticValues::value(size_t index = 0) const {
    return const_cast<T&>(operator[](index).get<T>());
}

The only other problem I could think of is if semantic values were shared between multiple parse handlers, which shouldn't happen since there is only one parse action per match (input is not shared) and AST nodes don't have multiple parents (output is not shared).

build errors with clang++ in debug mode

As the subject line indicates I have build problems with clang 5.0.0 in debug mode.

[ 18%] Linking CXX executable test-main CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84b85): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bbf): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84bed): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c1b): undefined reference to peg::enabler' CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c49): undefined reference to peg::enabler'
CMakeFiles/test-main.dir/test.cc.o:(.debug_info+0x84c77): more undefined references to peg::enabler' follow clang-5.0.0: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [test/CMakeFiles/test-main.dir/build.make:95: test/test-main] Error 1 make[1]: *** [CMakeFiles/Makefile2:86: test/CMakeFiles/test-main.dir/all] Error 2 make: *** [Makefile:95: all] Error 2

It works in release mode and gcc works in both modes.

Awesome library btw. I especially like being able to use parser combinators. I really shoots up the performance.

Mike

Suggestion: Action forwarding

Hey Yuji!

While I was designing the grammar for my language, I found that many rules require the same actions.
Consider this:

sentence <- subject verb object '.'
question <- verb subject something '?'
quote <- subject verb '"' (sentence/question) '."'
subject <- word
verb <- word
object <- word
something <- word
word <- [a-zA-Z]*

Now, the word things have to return their matched strings and the sentence rules (1 to 3) need to perform some sort of concatenation, so I kind of need two different default rules. Of course I can assign them like this:

parser["sentence"] = parser["question"] = parser["quote"] = [](sv){...};
parser["subject"] = parser["verb"] = parser["object"] = parser["something"] = [](sv){...};

or maybe make one of the two the default rule.
(This example is a bit stupid, in reality I need many more default rules)

But I was thinking, maybe it would be nice to be able to specify some forwarding in the grammar already:

sentence>>concat <- subject verb object '.'
question>>concat <- verb subject something '?'
quote>>concat <- subject verb '"' (sentence/question) '."'
subject>>match <- word
verb>>match <- word
object>>match <- word
something>>match <- word
word <- [a-zA-Z]*

parser["concat"] = [](sv){...};
parser["match"] = [](sv){...};

I'm not sure if the syntax is confusing tho. Maybe rather
name: forward <- pattern

Parsley does it like this:
name = pattern -> action
which I find more intuitive but it's no longer original PEG syntax so I think it's unacceptable.

You think this might be a good idea? It's nothing that I desperately need but maybe a useful feature.

Cheers!

Doesn't compile with MinGW, despite CMake

Why is there an "@" symbol in that "ar" line?

$ cmake -G "MSYS Makefiles" .
-- The C compiler identification is GNU 8.2.0
-- The CXX compiler identification is GNU 8.2.0
-- Check for working C compiler: C:/MinGW/bin/gcc.exe
-- Check for working C compiler: C:/MinGW/bin/gcc.exe -- broken
CMake Error at C:/Program Files/CMake/share/cmake-3.13/Modules/CMakeTestCCompiler.cmake:52 (message):
  The C compiler

    "C:/MinGW/bin/gcc.exe"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: C:/Users//git/cpp-peglib/CMakeFiles/CMakeTmp

    Run Build Command:"C:/MinGW/msys/1.0/bin/make.exe" "cmTC_c9e41/fast"
    /usr/bin/make -f CMakeFiles/cmTC_c9e41.dir/build.make CMakeFiles/cmTC_c9e41.dir/build
    make[1]: Entering directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
    Building C object CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj
    /C/MinGW/bin/gcc.exe    -o CMakeFiles/cmTC_c9e41.dir/testCCompiler.c.obj   -c /C/Users//git/cpp-peglib/CMakeFiles/CMakeTmp/testCCompiler.c
    Linking C executable cmTC_c9e41.exe
    "/C/Program Files/CMake/bin/cmake.exe" -E remove -f CMakeFiles/cmTC_c9e41.dir/objects.a
    /C/MinGW/bin/ar.exe cr CMakeFiles/cmTC_c9e41.dir/objects.a @CMakeFiles/cmTC_c9e41.dir/objects1.rsp
    c:\MinGW\bin\ar.exe: could not create temporary file whilst writing archive: no more archived files
    make[1]: *** [cmTC_c9e41.exe] Error 1
    make[1]: Leaving directory `/c/Users//git/cpp-peglib/CMakeFiles/CMakeTmp'
    make: *** [cmTC_c9e41/fast] Error 2




  CMake will not be able to correctly generate this project.


-- Configuring incomplete, errors occurred!

Also, compilation fails:

g++ -std=c++11 peglib.h

[Question] How to get line/column information in a semantic action?

When parsing some grammar, I would like to keep parsing and only record minor semantics issues as warnings at the end, along with the line/column number with the warnings. Therefore, it would be nice to have the line/col information during an semantic action.

However, the only way I can see to get at the line/column information is to actually throw the parse_error exception and let the log function to be called.

Any suggestions?

adding mutable lambdas action

Ok, I plan want to contribute to write csv or json parser in the examples, however sometime it was not comfortable since we cannot write some thing like following

auto syntax = R"(
    ROOT  <- _ TOKEN (',' _ TOKEN)*
    TOKEN <- < [a-z0-9]+ > _

    _     <- [ \t\r\n]*
)";

tree<string> sym ;
peg pg(syntax); 
pg["TOKEN"] = [=](const char* s, size_t l, const vector<any>& v){ //this will trigger an error because we need mutable lambdas, at peglib.h:273 and 276 didnt allow this
   sym.insert(sym.begin(),string(s,l));
   return string(s,l);
}

also this is useful for generating ast, or let user managing data structure. I have an Idea that, peg grammar is quite powerful to handle some ebnf with some restriction on how prioritization should be used.

Lint checker, munmap friendly for mingw?

Adding -D_MSC_VER also not working and I got a lot of errors
gcc4.9

In file included from D:/mingw32/i686-w64-mingw32/include/combaseapi.h:154:0,
                 from D:/mingw32/i686-w64-mingw32/include/objbase.h:14,
                 from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
                 from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
                 from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
                 from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
                 from mmap.h:6,
                 from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h: In member function 'HRESULT IUnknown::QueryInterface(Q**)':
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: expected primary-expression before ')' token
       return QueryInterface(__uuidof(Q), (void **)pp);
                                       ^
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]
D:/mingw32/i686-w64-mingw32/include/unknwnbase.h:74:39: note: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)
In file included from D:/mingw32/i686-w64-mingw32/include/urlmon.h:289:0,
                 from D:/mingw32/i686-w64-mingw32/include/objbase.h:163,
                 from D:/mingw32/i686-w64-mingw32/include/ole2.h:17,
                 from D:/mingw32/i686-w64-mingw32/include/wtypes.h:12,
                 from D:/mingw32/i686-w64-mingw32/include/winscard.h:10,
                 from D:/mingw32/i686-w64-mingw32/include/windows.h:97,
                 from mmap.h:6,
                 from peglint.cc:11:
D:/mingw32/i686-w64-mingw32/include/servprov.h: In member function 'HRESULT IServiceProvider::QueryService(const GUID&, Q**)':
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: expected primary-expression before ')' token
   return QueryService(guidService, __uuidof(Q), (void **)pp);
                                              ^
D:/mingw32/i686-w64-mingw32/include/servprov.h:66:46: error: there are no arguments to '__uuidof' that depend on a template parameter, so a declaration of '__uuidof' must be available [-fpermissive]

Using cpp-peglib inside Circle

Hi, great work on the library. I've been writing my own C++ compiler, and I've found a very novel use for cpp-peglib. A friend of mine has been using cpp-peglib with Circle to generate code targeting an exotic architecture. I decided to port calc3.cc as a tutorial for using cpp-peglib at compile time to define and implement DSLs in C++.

https://github.com/seanbaxter/circle/blob/master/peg_dsl/peg_dsl.md

Basically what I do is #include peglib.h and make a compile-time instance of peg::parser. Circle has an integrated interpreter so any code can be executed during source translation. I beefed up the calc3 grammar by adding an IDENTIFIER rule. Then I create C++ functions that expand a Circle macro and feed it the text to be parsed in a string literal. A Circle macro invokes the parser object (at compile time!), gets back the AST, and traverses the AST using some other macros. IDENTIFIER nodes in the AST are evaluated with @expressions, which lexes, parses and injects the contained text. That is, the result object for an IDENTIFIER node is an lvalue to the object named. The other parts of the AST are processed similarly.

When all this is done, compilation ends, and you're left with a C++ program that has the code you specified in the string lowered to LLVM IR. There is no remnant of the parser in the executable, because its job was to translate the DSL into an AST at compile time.

Since uses of parsing libraries is to make developer tools anyway, having a C++ compiler serve as a host language or a scripting language to bind the desired grammar to a C++ frontend via dynamic parsers like cpp-peglib is a real win.

My main project page is here, and there are tons of other examples.
https://www.circle-lang.org/

I think Circle could also simplify the implementation of cpp-peglib. Since the grammar is almost always known at compile time, you could move parsing of that to compile time and benefit from the type system and error handling already built into the compiler. It would allow you to achieve the performance of a statically scheduled compiler (like a hand-written RD parser) while having the expressiveness of the dynamic system you built.

Thanks,
sean

Results differ if using AST mode?

Hi Yuji, this one is a bit odd. I'm using the peglint framework so loading grammars from a text file. If I enable AST mode I get a clean parse and a tree etc. If I use the parse() member function I get a syntax error from the same test. Am I missing something obvious?

        # a grammar to parse the GDB 'machine interface' (GDB/MI)
        GDB_MI          <- (GDB_ELEMENT EOL)*
        GDB_ELEMENT     <- (AT_STRING / NEG_STRING / OP_LIST)
        AT_STRING       <- '@' STRING_LITERAL
        NEG_STRING      <- '~' STRING_LITERAL
        OP_LIST         <- OP_CHAR IDENTIFIER ',' (RESULT_LIST)*
        OP_CHAR         <- ( '~' / '*' / '=' / '+' / '^')
        RESULT_LIST     <- RESULT (COMMA RESULT)*
        RESULT          <- (IDENTIFIER (ASSIGNMENT_OP/HASH_OP))? (STRING_LITERAL / LBRACE (RESULT_LIST)* RBRACE / LBRACK (RESULT_LIST)* RBRACK)
        ~ASSIGNMENT_OP  <- < '=' >
        HASH_OP         <- < '#' >
        LBRACE          <- < '{' >
        RBRACE          <- < '}' >
        LBRACK          <- < '[' >
        RBRACK          <- < ']' >
        ~COMMA          <- < ',' >
        # GDB/MI strings contain escape sequences. The <> ensures the token 
        # content is captured in the STRING_LITERAL AST node
        STRING_LITERAL  <- < '"' (('\\"' / '\\t' / '\\n') / (!["] .))* '"' > 
        # GDB/MI identifiers can contain '-' as in -info-breakpoint
        IDENTIFIER      <- < [_a-zA-Z] ([_A-Za-z0-9] / '-')* > 
        # recognize but ignore end of line characters
        ~EOL            <- '\n'
        # sent at the end of a sequence
        ~TERMINATOR     <- '(gdb)' 
        # consume the following during parse. +1!
        %whitespace     <-  [ \t\r]*

Here's the test string

^done,address="0x4000bb28",load-size="116244",transfer-rate="41104",write-rate="369",BreakpointTable={nr_rows="1",nr_cols="6",hdr=[{width="7",alignment="-1",col_name="number",colhdr="Num"},{width="14",alignment="-1",col_name="type",colhdr="Type"},{width="4",alignment="-1",col_name="disp",colhdr="Disp"},{width="3",alignment="-1",col_name="enabled",colhdr="Enb"},{width="10",alignment="-1",col_name="addr",colhdr="Address"},{width="40",alignment="2",col_name="what",colhdr="What"}],body=[bkpt={number="1",type="breakpoint",disp="keep",enabled="y",addr="0x40003e10",func="main",file="../cyfxbulklpauto.c",fullname="r:\\src\\cypress\\fx3\\usbbulkloopauto\\cyfxbulklpauto.c",line="702",thread-groups=["i1"],times="0",original-location="main"}]}

VS2017 linking errors

Adding target_link_libraries(peglint "Ws2_32.lib") to the peglint stopped the linker errors.

Noted for future developers.

make 'release' tag?

just want to know if you are going to make some 'release' or 'tag' on this project?
I assume the code is pretty stable enough..

Syntax parsing fails with MSVC but not with GCC

Hey! I am having a problem where a simple syntax fails to parse with MSVC compiler but seems to work with GCC. Compiling the following code with MSVC (x64) versions 19.00.24213.1 and 18.00.31101 will result in parsing error ie. !parser == true. C++14 support is enabled. If 'in?' is removed from op_cmd token then it starts working again. This works with GCC 5.3.0. I used the latest cpp-peglib commit 5e67fcb. Any ideas what could be going wrong or how I could debug it further?

#include <peglib.h>
#include <boost/log/trivial.hpp>

int main()
{
    const auto testSyntax =
    R"(
        main <- op_cmd

        # Set interpolation mode
        intpol_lin      <- 'G01'
        intpol_cw       <- 'G02'
        intpol_ccw      <- 'G03'
        in              <- intpol_lin / intpol_cw / intpol_ccw

        # Operations
        move_intpol     <- 'D01'
        move            <- 'D02'
        flash           <- 'D03'
        sel             <- [XYIJ]
        # If 'in?' then it doesn't crash with VS2015
        op_cmd          <- in? (sel coord)+ (move_intpol / move / flash) '*'

        # General token identifiers
        digit           <- [0-9]
        coord           <- ('+' / '-')? digit+
    )";
    parser parser(testSyntax);

    if(!parser)
    {
        BOOST_LOG_TRIVIAL(error) << "Parser syntax error!";
    }
}

Handling comments?

Hi Yuji

Any thoughts on the best way to handle single line comments, C++ or Python style?

// I am a comment
# ditto

Not sure if this can be easily done in a grammar. Wondering if there could be a %comment directive as per %whitespace

I had forgotten how utterly cool this code is for writing parsers. 👍

Tree building notation (III)

Another idea for a tree-annotated grammar. Note the N: prefix. The 0'th node becomes the parent. Then its children are assigned in numerical order. So this makes it really easy to create new nodes with an arbitrary number of children. (Yuji, this is only a partially-considered scribble right now. I'm just posting it here for more brain fodder)

                # annotation equivalent to the default parse tree construction
        0:RESULT_LIST       <- 1:RESULT (',' 2:RESULT)*
        RESULT          <- (NAMED_RESULT / ANON_RESULT)
                # ASSIGNMENT_OP becomes parent to IDENTIFIER and ANON_RESULT
        NAMED_RESULT    <- 1:IDENTIFIER 0:(ASSIGNMENT_OP/HASH_OP) 2:ANON_RESULT
        ANON_RESULT     <- (0:STRING_LITERAL / BRACE_LIST / BRACK_LIST)
                # i.e. here we end up with a new RESULT_LIST node, not a BRACE_LIST
        BRACE_LIST      <- '{' (0:RESULT_LIST)* '}'
        BRACK_LIST      <- '[' (0:RESULT_LIST)* ']'
        ASSIGNMENT_OP   <- < '=' >
        HASH_OP         <- < '#' >

Example code should demonstrate the use of Log

[praise]
I am using cpp-peglib for a few weeks now, and this library works tremendously well. The API is easy to grasp and its behavior is very predictable. I originally started writing my grammar (a subset of Python's grammar) using boost::spirit, and I stopped when things became unmanageable. I can do a lot more, and more easily with cpp-peglib, so thank you for that.
[/praise]

One thing I was not happy with cpp-peglib is that, when I got my grammar wrong, the only thing I saw was a crash at runtime (access to a nullptr). I finally looked more closely in the sources, and I found out two useful things:

  1. It is possible to use the constructor of parser with no argument, and to call load_grammar afterwards. This returns a bool that tells if the grammar is ok.
  2. It is possible to attach a logger function (parser::log), to get the details of what is wrong with our grammar.

It took me a few weeks to realize that these tools were already there, waiting for me to use them. My suggestions is to use them in the example code of the readme file:

Instead of:

    parser parser(syntax);

Use something like this:

    parser parser;
    parser_->log = [](size_t line, size_t col, const string& msg) {
      cerr << line << ":" << col << ": " << msg << "\n";
    };
    bool ok = parser_->load_grammar(grammar);
    assert(ok);

Its is not as pretty as the current version, but it will be a huge help for users to understand the mistakes in their grammars. Alternatively, there could be a default logger installed on all parser instances, that can be deactivated if needs to.

if enable PackratParsing ,program crash

i set enablePackratParsing=1
and my peg is
auto syntax = R"xyz(decl<-decl_specs init_declarator_list ';'

	decl_specs<-'char' / 'short' / 'int' / 'long' / 'float'

	init_declarator_list<-direct_declarator
	/ direct_declarator '=' initializer

	direct_declarator<-id


	initializer<-additive

	additive<-left:multiplicative "+" right : additive
	/ multiplicative

	multiplicative<-left : primary "*" right : multiplicative
	/ primary

	primary<-primary_exp
	/ "(" additive : additive ")"

	primary_exp<-id
	/ const)xyz";
  parser parser(syntax);

when I debug ,i set breakpoint after "parser parser(syntax);",but it crashed

Tree building notation (II)

Here's the annotated grammar. Note the ^ prefix. The notation means take the name and value of the prefixed node and transfer those values to its parent. In effect transforming a n-ary node into a suitably configured binary node. Such a rotated/transformed tree is much simpler to walk.

        RESULT_LIST     <- RESULT (',' RESULT)*
        RESULT          <- (NAMED_RESULT / ANON_RESULT)
        NAMED_RESULT    <- IDENTIFIER ^(ASSIGNMENT_OP/HASH_OP) ANON_RESULT
        ANON_RESULT     <- (STRING_LITERAL / BRACE_LIST / BRACK_LIST)
        BRACE_LIST      <- '{' (RESULT_LIST)* '}'
        BRACK_LIST      <- '[' (RESULT_LIST)* ']'
        ASSIGNMENT_OP   <- < '=' >
        HASH_OP         <- < '#' >

Standard output:

     + RESULT (NAMED_RESULT)
      - IDENTIFIER: 'times'
      - ASSIGNMENT_OP: '='
      - ANON_RESULT (STRING_LITERAL): '"0"'
     + RESULT (NAMED_RESULT)
      - IDENTIFIER: 'original-location'
      - ASSIGNMENT_OP: '='
      - ANON_RESULT (STRING_LITERAL): '"main"'

Transformed output:

     +  ASSIGNMENT_OP: '='
      - IDENTIFIER: 'times'
      - ANON_RESULT (STRING_LITERAL): '"0"'
     + ASSIGNMENT_OP: '='
      - IDENTIFIER: 'original-location'
      - ANON_RESULT (STRING_LITERAL): '"main"'

[Question] Can you match c++ raw string literals?

Hi! Great job on the library! I love that it's just a single h file! I have a question tho:

A raw string literal in C++ looks like this:
R"CustomDelimiter(any text you could possibly want)CustomDelimiter"

Can this somehow be parsed with cpp-peglib? It would have to be something like this:

CPPRAWSTRING <- 'R"' '(' .* ')' {token0} '"'

Is that possible? If not with a simple grammar, maybe using enter and leave?

peglint example crashes

peglint exits with an error when trying to reproduce tutorial example:

./peglint --ast --opt --source "1 + 2 * 3" a.peg
terminate called after throwing an instance of 'std::system_error'
  what():  Unknown error -1

Seems to be a regression of #46

My environment is: Linux Mint 19.1, gcc 7.3.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.