OK, so I know the whole idea behind a "keyword" may not make sense in PEG, but most co

At least it would be possible for rules like <code class="notranslat

On Tue, Jul 3, 2012 at 11:04 PM, Roman D. Boiko <a href="mailto:reply@reply.github

Keywords,about philippesigaud/pegged

Comments (37)

PhilippeSigaud commented on July 18, 2024

What would the effect be, exactly?

from pegged.

alexrp commented on July 18, 2024

I imagine simply a parse error anywhere where the keyword is encountered (separated by spaces) and not expected, i.e. suppose I say 'foo' is a keyword and my language allows syntax such as type ident = expr;:

int foo = 1;

This would be an error because foo is a keyword and was not explicitly specified to be allowed at this point in the grammar.

Does that make sense?

from pegged.

PhilippeSigaud commented on July 18, 2024

Does that make sense?

Yes that makes sense. There are articles on Bryan Ford site
(http://bford.info/packrat/), for example:

"Parsing Expression Grammar as a primitive recursive-descent parser
with backtracking" Roman Redziejowski, CS&P'2006, September 2006.

The guy coded the Java grammar in PEG and wanted to get keywords out
of the way. So he defined identifier to be:

Identifier <- !ReservedWords StandardIdentifierDefinition
ReservedWords <- "abstract" / "byte" / ...

The problem with that is that any time you encounter a putative
identifier, you need to test for all keywords. At the very least,
Pegged would need to recognize a list of string literals and use a
special algorithm for them (maybe a switch).

from pegged.

commented on July 18, 2024

I'd like to write a trie-based lookup for any string literals inside grammar. Or even for any terminal. However, I've spent too little time browsing Pegged code and documentation, and thus not sure this is a good idea. For example, I'm not sure this would behave nice with respect to ordered or (/), etc.

Would you like to have something like that? I can create a draft implementation for further discussion.

from pegged.

commented on July 18, 2024

At least it would be possible for rules like

Keyword <- "abstract" / "alias" / "align" / [* all the other keywords sorted lexicographically *]

from pegged.

commented on July 18, 2024

Another option is to introduce a predicate like bool isEitherOf(...) that would perform a fast (trie?) lookup and could be used inside a rule:

Identifier <- !Keyword [_a-zA-Z] [_a-zA-Z0-9]* Keyword <- &isEitherOf(...) [_a-zA-Z] [_a-zA-Z0-9]*

(excuse me if syntax is ugly, that should be fixable)

from pegged.

PhilippeSigaud commented on July 18, 2024

On Tue, Jul 3, 2012 at 11:04 PM, Roman D. Boiko
[email protected]
wrote:

I'd like to write a trie-based lookup for any string literals inside grammar. Or even for any terminal. However, I've spent too little time browsing Pegged code and documentation, and thus not sure this is a good idea.

Would you like to have something like that? I can create a draft implementation for further discussion.

First, thanks for looking at Pegged :)

I'm interested in tries, even if they are not used right there. I mean, they should be part of std.collection.

Do not forget that with PEGs, choices are ordered. A trie (AFACT) does not distinguish between "A" / "B" / "C" and "C" / "B" / "A", or rather
cannot insure the 'A' case will be tested first.

Moreover, most of the lists are 3-6 elements long, apart from the D keyword list which reaches 100+. I think having the engine detect a
list of or-ed strings is a very good idea. But I wonder whether a trie is then particularly faster than a simple linear lookup in an array of
switch on compile-time-known strings.

from pegged.

commented on July 18, 2024

I'll benchmark a few cases, e.g., linear lookup, binary lookup, switch over string values, switches over characters (top level corresponding to the first character of every keyword, then the first nested switch (inside some top-level branch) corresponding to the second character, and so on), and probably some more.

I'm mostly interested in a particular use case of D keywords.

Anyway I wonder which API would be preferable for that.

from pegged.

commented on July 18, 2024

Total # of Identifiers & Keywords: 28500

from them # of Keywords: 6505

isKeyword_Dictionary
Duration: 1331[mks]

isKeyword_SwitchChar
Duration: 1888[mks]

isKeyword_SwitchString
Duration: 2303[mks]

isKeyword_LinearArrayLookup
Duration: 1632344[mks]

isKeyword_BinaryArrayLookup
Duration: 17117[mks]

Total # of Identifiers & Keywords: 121671

from them # of Keywords: 17488

isKeyword_Dictionary
Duration: 4306[mks]

isKeyword_SwitchChar
Duration: 6735[mks]

isKeyword_SwitchString
Duration: 9207[mks]

isKeyword_LinearArrayLookup
Duration: 21460119[mks]

isKeyword_BinaryArrayLookup
Duration: 70797[mks]

from pegged.

commented on July 18, 2024

Tested on two Phobos files with largest number of identifiers.

from pegged.

commented on July 18, 2024

Later I will also try switching by identifier length and then doing either dictionary, or some other (binary?) lookup among only keywords of that length. And also will check trie created by Dmitry Olshansky.

from pegged.

PhilippeSigaud commented on July 18, 2024

On Wed, Jul 4, 2012 at 4:36 PM, Roman D. Boiko
[email protected]
wrote:

Later I will also try switching by identifier length and then doing either dictionary, or some other (binary?) lookup among only keywords of that length. And also will check trie created by Dmitry Olshansky.

Thanks.

Does 'dictionnary" means the built-in D AA's?

from pegged.

commented on July 18, 2024

Yes

from pegged.

PhilippeSigaud commented on July 18, 2024

And you used an list of about 100 keywords, right?

In that case, I'm impressed by their speed (supposing lower mks means faster)

from pegged.

commented on July 18, 2024

D keywords on D sources. Times are total for all identifiers, so I'm also impressed. However, LinearArrayLookup on the largest Phobos file would consume 21s :)

from pegged.

commented on July 18, 2024

There was a bug in linear lookup (I used enum as array, so it was allocating on each access).

Corrected numbers:

isKeyword_Dictionary: 1341 [microsec] total, 241 [nanosec / lookup].
isKeyword_SwitchChar: 2838 [microsec] total, 511 [nanosec / lookup].
isKeyword_SwitchString: 3291 [microsec] total, 592 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 19706 [microsec] total, 3549 [nanosec / lookup].
isKeyword_LinearArrayLookup: 20412 [microsec] total, 5223 [nanosec / lookup].
Total: 24592 identifiers + 3908 keywords.

isKeyword_Dictionary: 4754 [microsec] total, 271 [nanosec / lookup].
isKeyword_SwitchChar: 10717 [microsec] total, 612 [nanosec / lookup].
isKeyword_SwitchString: 14535 [microsec] total, 831 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 79191 [microsec] total, 4528 [nanosec / lookup].
isKeyword_LinearArrayLookup: 89874 [microsec] total, 7444 [nanosec / lookup].
Total: 109598 identifiers + 12073 keywords.

from pegged.

commented on July 18, 2024

Benchmarks are at https://github.com/roman-d-boiko/dct/tree/experimental

make
make run

(see fe/lexer.d and sample/client.d for actual code).

There will be a lot of diagnostics output from std.uni, which is non-standard, just ignore that.

from pegged.

commented on July 18, 2024

isKeyword_Dictionary: 1406 [microsec] total, 253 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 1372 [microsec] total, 247 [nanosec / lookup].
isKeyword_SwitchChar: 1866 [microsec] total, 336 [nanosec / lookup].
isKeyword_SwitchString: 2218 [microsec] total, 399 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 6067 [microsec] total, 1092 [nanosec / lookup].
isKeyword_LinearArrayLookup: 14766 [microsec] total, 2660 [nanosec / lookup].
Total: 22949 identifiers + 5551 keywords.

isKeyword_Dictionary: 4354 [microsec] total, 248 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 3955 [microsec] total, 226 [nanosec / lookup].
isKeyword_SwitchChar: 5708 [microsec] total, 326 [nanosec / lookup].
isKeyword_SwitchString: 9281 [microsec] total, 530 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 22881 [microsec] total, 1308 [nanosec / lookup].
isKeyword_LinearArrayLookup: 68785 [microsec] total, 3933 [nanosec / lookup].
Total: 104183 identifiers + 17488 keywords.

Added benchmark for switch on length and nested switches for each character inside. That one has performance similar to AA, and its implementation is still far from optimal.

Switched DMD options to -release -O.

from pegged.

commented on July 18, 2024

Added -inline and replaced two slower switch-based algorithms with trie-based algorithms by Dmitry Olshansky. Also added a trivial loop over string characters as baseline.

isKeyword_Dummy: 1403 [microsec] total, 25 [nanosec / lookup].
isKeyword_Dictionary: 4657 [microsec] total, 266 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 2286 [microsec] total, 130 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 13800 [microsec] total, 789 [nanosec / lookup].
isKeyword_LinearArrayLookup: 63555 [microsec] total, 3634 [nanosec / lookup].
isKeyword_UnicodeTrie: 4496 [microsec] total, 257 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 3445 [microsec] total, 196 [nanosec / lookup]
Total: 104183 identifiers + 17488 keywords.

Now AA (aka Dictionary) is clearly not the winner.

from pegged.

commented on July 18, 2024

isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / lookup].
isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec / lookup].
isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 [nanosec / lookup].
isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 [nanosec / lookup].
Total: 22949 identifiers + 5551 keywords.

isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / lookup].
isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec / lookup].
isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [nanosec / lookup].
isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [nanosec / lookup].
Total: 104183 identifiers + 17488 keywords.

(Switch code optimization)

from pegged.

alexrp commented on July 18, 2024

Hey @PhilippeSigaud what's the status on this? Does Pegged have some kind of built-in support for keyword recognition yet?

from pegged.

PhilippeSigaud commented on July 18, 2024

On Thu, Sep 27, 2012 at 2:00 PM, Alex Rønne Petersen <
[email protected]> wrote:

Hey @PhilippeSigaud https://github.com/PhilippeSigaud what's the status
on this? Does Pegged have some kind of built-in support for keyword
recognition yet?

No. I coded a trie a few days ago, but performance was disappointing. Using
switch is quite efficient, but does not play well with the rest: kw do not
all have the same length, so I cannot do

switch(inputPrefix)
{
     case "class":
     case "interface":
}

I'll go back to producing a compile-time AA and if that does not work, my
next step will be to generate basic code like this:

if (input[0..5] == "class" || input[0..9] == "interface" || ...)
    return true;
else
    return false;

If that's a blocking point for you, I can lift it up my priority list and
try to code it tonight.

from pegged.

PhilippeSigaud commented on July 18, 2024

OK, done. I introduced a specific check for long choices of literals, like "abstract" / "alias" / "align" /...

Pegged now detects these and automatically generates specific, lower-level code to match them.
My tests give a 150 to 800% speed-up in keyword matching.

So, for your identifiers rules, use:

Keywords <- "abstract" / "alias" / ...  # Keyword list -> specific code is created
Identifier <- !Keywords identifier     # identifier is a predefined rule

Note that Pegged only does that for alternate choices of literals. If there is anything else in the choices, it won't create it. Maybe as a future add-on?

Rule <- "abc" / "def" / .  # No specific code

from pegged.

alexrp commented on July 18, 2024

This sounds great. I'll give it a go either sometime tonight or tomorrow.

Unrelated, but: What exactly is the regenerate binary supposed to do? After a fresh clone, make, and running it, it appears it generates a different parser than what's currently in the repo. Intended?

from pegged.

PhilippeSigaud commented on July 18, 2024

On Thu, Sep 27, 2012 at 10:22 PM, Alex Rønne Petersen <
[email protected]> wrote:

This sounds great. I'll give it a go either sometime tonight or tomorrow.

Unrelated, but: What exactly is the regenerate binary supposed to do?
After a fresh clone, make, and running it, it appears it generates a
different parser than what's currently in the repo. Intended?

Regenerate recreates the Pegged parser. So, each time pegged.grammar or
pegged.peg are changed, I should use it to maintain the parser up to date
with the lastest engine evolution. I think I forgot to do this for a few
days.

from pegged.

alexrp commented on July 18, 2024

OK, I'm now defining keywords like this:

keywords <- "rec" / ....
identifier <~ !keywords [a-zA-Z_] [a-zA-Z_0-9]*

Later I have a rule like this:

recordDeclaration < :"rec" identifier ....

To which I feed:

rec record ....

Note that the identifier starts with "rec" which is a keyword. The identifier rule thus thinks it's dealing with a keyword even though the full identifier isn't such.

How can I work around this?

from pegged.

PhilippeSigaud commented on July 18, 2024

So you need a compulsory space after a keyword.

I see the following possibilities:

change your ìdentifier` rule to:

identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*`

I change the way Pegged deals to string choices like your keywords
rules and add an automatic space parser after each possibility. I think
it's not a good idea, because some other user might use these choice list
in another, incompatible way.

from pegged.

alexrp commented on July 18, 2024

So you need a compulsory space after a keyword.

No, imagine code like:

rec record{ ....

It can be anything except the things that make up an identifier, really.

from pegged.

alexrp commented on July 18, 2024

I suppose I can just use . instead of Spacing but I'm not sure if there are any catches to that...

from pegged.

PhilippeSigaud commented on July 18, 2024

No, imagine code like:

rec record{ ....

It can be anything except the things that make up an identifier, really.

I'm not sure I get it. 'rec' is a keyword, 'record' is an identifier,
right? I thought the problem was that 'record' was parsed as 'rec' 'ord'.

This works for me:

import pegged.grammar;

mixin(grammar(`
TEST:
 recordDeclaration < :"rec" identifier '{' '}'

 keywords <- "rec" / "foo" / "bar"
 identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*
`));

void main()
{
    string input =  "rec record { }";
    writeln(TEST(input));
}

from pegged.

alexrp commented on July 18, 2024

I thought the problem was that 'record' was parsed as 'rec' 'ord'.

I don't know what the problem actually is... Somehow (without that extra Spacing added) the main rule of the grammar is marked as failure and half of the tree is gone (but with no further indication of where the parse failed). But if I change the identifier to reecord or whatever that doesn't start with rec things work just fine. So, the identifier rule needs to somehow deal with the fact that once it encounters something that's not part of an identifier, the rule must succeed (which is what I assume my original identifier rule didn't).

I hope this makes sense...

from pegged.

PhilippeSigaud commented on July 18, 2024

On Fri, Oct 5, 2012 at 5:50 PM, Alex Rønne Petersen <
[email protected]> wrote:

I suppose I can just use . instead of Spacing but I'm not sure if there
are any catches to that...

That's might be problematic for keywords that are prefixes of another
keyword, maybe (for / foreach).
Why don't you want to use Spacing?

from pegged.

PhilippeSigaud commented on July 18, 2024

On Fri, Oct 5, 2012 at 7:32 PM, Alex Rønne Petersen <
[email protected]> wrote:

I thought the problem was that 'record' was parsed as 'rec' 'ord'.

I don't know what the problem actually is... Somehow (without that extra
Spacing added) the main rule of the grammar is marked as failure and half
of the tree is gone (but with no further indication of where the parse
failed). But if I change the identifier to reecord or whatever that
doesn't start with rec things work just fine. So, the identifier rule
needs to somehow deal with the fact that once it encounters something
that's not part of an identifier, the rule must succeed (which is what I
assume my original identifier rule didn't).

That's what you rule does.But since keywords succeeds on a prefix of
'record' then !keywords fails. And thus identifier, being a sequence
with its first subrule which failed, also fails. When a sequence fails, its
last child is the failed node. In this case, there is only one child,
created by !keywords, but this child, being anonymous (not from a grammar
named rule) is cut by the tree decimation. Hence the failure with no child.

from pegged.

alexrp commented on July 18, 2024

So would a correct rule be:

identifier <~ !(keywords .) [a-zA-Z_] [a-zA-Z_0-9]*

from pegged.

PhilippeSigaud commented on July 18, 2024

So would a correct rule be:

identifier <~ !(keywords .) [a-zA-Z_] [a-zA-Z_0-9]*

Well, no. Let say your keywords are "int" / ... and the "interface"
identifier "int_a".
Then keywords . will succeeds on "int_" and (!keywords .) will fail,
bringing identifier down.

Really, the only correct rule I see is:

identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*

This still fail for "int;": (keywords Spacing) fails on "int;", thus
!(...) succeeds and pass the relay to [a-zA-Z_]... which will happily
match on "int".

So I guess the real rule should be:

identifier <~ !(keywords Separator) [a-zA-Z_] [a-zA-Z_0-9]*
Separator <- Spacing / ';'     # Or anything else needed in your language

from pegged.

alexrp commented on July 18, 2024

So Separator would have to contain any non-keyword, non-identifier token that might exist in the language, I suppose?

from pegged.

PhilippeSigaud commented on July 18, 2024

On Sat, Oct 6, 2012 at 10:28 AM, Alex Rønne Petersen <
[email protected]> wrote:

So Separator would have to contain any non-keyword, non-identifier token
that might exist in the language, I suppose?

Not so much. It's only to eliminate cases like "int;" or "int}" in places
where "identifier;" or "identifier}" would make sense. The second condition
is important, because in most case "identifier separator-token" would not
make sense.

from pegged.

Keywords about pegged HOT 37 CLOSED

Comments (37)

Total # of Identifiers & Keywords: 28500

Total # of Identifiers & Keywords: 121671

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent