Comments (37)
What would the effect be, exactly?
from pegged.
I imagine simply a parse error anywhere where the keyword is encountered (separated by spaces) and not expected, i.e. suppose I say 'foo' is a keyword and my language allows syntax such as type ident = expr;
:
int foo = 1;
This would be an error because foo is a keyword and was not explicitly specified to be allowed at this point in the grammar.
Does that make sense?
from pegged.
Does that make sense?
Yes that makes sense. There are articles on Bryan Ford site
(http://bford.info/packrat/), for example:
"Parsing Expression Grammar as a primitive recursive-descent parser
with backtracking" Roman Redziejowski, CS&P'2006, September 2006.
The guy coded the Java grammar in PEG and wanted to get keywords out
of the way. So he defined identifier to be:
Identifier <- !ReservedWords StandardIdentifierDefinition
ReservedWords <- "abstract" / "byte" / ...
The problem with that is that any time you encounter a putative
identifier, you need to test for all keywords. At the very least,
Pegged would need to recognize a list of string literals and use a
special algorithm for them (maybe a switch).
from pegged.
I'd like to write a trie-based lookup for any string literals inside grammar. Or even for any terminal. However, I've spent too little time browsing Pegged code and documentation, and thus not sure this is a good idea. For example, I'm not sure this would behave nice with respect to ordered or (/
), etc.
Would you like to have something like that? I can create a draft implementation for further discussion.
from pegged.
At least it would be possible for rules like
Keyword <- "abstract" / "alias" / "align" / [* all the other keywords sorted lexicographically *]
from pegged.
Another option is to introduce a predicate like bool isEitherOf(...)
that would perform a fast (trie?) lookup and could be used inside a rule:
Identifier <- !Keyword [_a-zA-Z] [_a-zA-Z0-9]* Keyword <- &isEitherOf(...) [_a-zA-Z] [_a-zA-Z0-9]*
(excuse me if syntax is ugly, that should be fixable)
from pegged.
On Tue, Jul 3, 2012 at 11:04 PM, Roman D. Boiko
[email protected]
wrote:
I'd like to write a trie-based lookup for any string literals inside grammar. Or even for any terminal. However, I've spent too little time browsing Pegged code and documentation, and thus not sure this is a good idea.
Would you like to have something like that? I can create a draft implementation for further discussion.
First, thanks for looking at Pegged :)
I'm interested in tries, even if they are not used right there. I mean, they should be part of std.collection.
Do not forget that with PEGs, choices are ordered. A trie (AFACT) does not distinguish between "A" / "B" / "C" and "C" / "B" / "A", or rather
cannot insure the 'A' case will be tested first.
Moreover, most of the lists are 3-6 elements long, apart from the D keyword list which reaches 100+. I think having the engine detect a
list of or-ed strings is a very good idea. But I wonder whether a trie is then particularly faster than a simple linear lookup in an array of
switch on compile-time-known strings.
from pegged.
I'll benchmark a few cases, e.g., linear lookup, binary lookup, switch over string values, switches over characters (top level corresponding to the first character of every keyword, then the first nested switch (inside some top-level branch) corresponding to the second character, and so on), and probably some more.
I'm mostly interested in a particular use case of D keywords.
Anyway I wonder which API would be preferable for that.
from pegged.
Total # of Identifiers & Keywords: 28500
from them # of Keywords: 6505
isKeyword_Dictionary
Duration: 1331[mks]
isKeyword_SwitchChar
Duration: 1888[mks]
isKeyword_SwitchString
Duration: 2303[mks]
isKeyword_LinearArrayLookup
Duration: 1632344[mks]
isKeyword_BinaryArrayLookup
Duration: 17117[mks]
Total # of Identifiers & Keywords: 121671
from them # of Keywords: 17488
isKeyword_Dictionary
Duration: 4306[mks]
isKeyword_SwitchChar
Duration: 6735[mks]
isKeyword_SwitchString
Duration: 9207[mks]
isKeyword_LinearArrayLookup
Duration: 21460119[mks]
isKeyword_BinaryArrayLookup
Duration: 70797[mks]
from pegged.
Tested on two Phobos files with largest number of identifiers.
from pegged.
Later I will also try switching by identifier length and then doing either dictionary, or some other (binary?) lookup among only keywords of that length. And also will check trie created by Dmitry Olshansky.
from pegged.
On Wed, Jul 4, 2012 at 4:36 PM, Roman D. Boiko
[email protected]
wrote:
Later I will also try switching by identifier length and then doing either dictionary, or some other (binary?) lookup among only keywords of that length. And also will check trie created by Dmitry Olshansky.
Thanks.
Does 'dictionnary" means the built-in D AA's?
from pegged.
Yes
from pegged.
And you used an list of about 100 keywords, right?
In that case, I'm impressed by their speed (supposing lower mks means faster)
from pegged.
D keywords on D sources. Times are total for all identifiers, so I'm also impressed. However, LinearArrayLookup on the largest Phobos file would consume 21s :)
from pegged.
There was a bug in linear lookup (I used enum as array, so it was allocating on each access).
Corrected numbers:
isKeyword_Dictionary: 1341 [microsec] total, 241 [nanosec / lookup].
isKeyword_SwitchChar: 2838 [microsec] total, 511 [nanosec / lookup].
isKeyword_SwitchString: 3291 [microsec] total, 592 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 19706 [microsec] total, 3549 [nanosec / lookup].
isKeyword_LinearArrayLookup: 20412 [microsec] total, 5223 [nanosec / lookup].
Total: 24592 identifiers + 3908 keywords.
isKeyword_Dictionary: 4754 [microsec] total, 271 [nanosec / lookup].
isKeyword_SwitchChar: 10717 [microsec] total, 612 [nanosec / lookup].
isKeyword_SwitchString: 14535 [microsec] total, 831 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 79191 [microsec] total, 4528 [nanosec / lookup].
isKeyword_LinearArrayLookup: 89874 [microsec] total, 7444 [nanosec / lookup].
Total: 109598 identifiers + 12073 keywords.
from pegged.
Benchmarks are at https://github.com/roman-d-boiko/dct/tree/experimental
make
make run
(see fe/lexer.d and sample/client.d for actual code).
There will be a lot of diagnostics output from std.uni, which is non-standard, just ignore that.
from pegged.
isKeyword_Dictionary: 1406 [microsec] total, 253 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 1372 [microsec] total, 247 [nanosec / lookup].
isKeyword_SwitchChar: 1866 [microsec] total, 336 [nanosec / lookup].
isKeyword_SwitchString: 2218 [microsec] total, 399 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 6067 [microsec] total, 1092 [nanosec / lookup].
isKeyword_LinearArrayLookup: 14766 [microsec] total, 2660 [nanosec / lookup].
Total: 22949 identifiers + 5551 keywords.
isKeyword_Dictionary: 4354 [microsec] total, 248 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 3955 [microsec] total, 226 [nanosec / lookup].
isKeyword_SwitchChar: 5708 [microsec] total, 326 [nanosec / lookup].
isKeyword_SwitchString: 9281 [microsec] total, 530 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 22881 [microsec] total, 1308 [nanosec / lookup].
isKeyword_LinearArrayLookup: 68785 [microsec] total, 3933 [nanosec / lookup].
Total: 104183 identifiers + 17488 keywords.
Added benchmark for switch on length and nested switches for each character inside. That one has performance similar to AA, and its implementation is still far from optimal.
Switched DMD options to -release -O.
from pegged.
Added -inline and replaced two slower switch-based algorithms with trie-based algorithms by Dmitry Olshansky. Also added a trivial loop over string characters as baseline.
isKeyword_Dummy: 1403 [microsec] total, 25 [nanosec / lookup].
isKeyword_Dictionary: 4657 [microsec] total, 266 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 2286 [microsec] total, 130 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 13800 [microsec] total, 789 [nanosec / lookup].
isKeyword_LinearArrayLookup: 63555 [microsec] total, 3634 [nanosec / lookup].
isKeyword_UnicodeTrie: 4496 [microsec] total, 257 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 3445 [microsec] total, 196 [nanosec / lookup]
Total: 104183 identifiers + 17488 keywords.
Now AA (aka Dictionary) is clearly not the winner.
from pegged.
isKeyword_Dummy (baseline): 427 [microsec] total, 28 [nanosec / lookup].
isKeyword_Dictionary: 1190 [microsec] total, 214 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 466 [microsec] total, 83 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 2722 [microsec] total, 490 [nanosec / lookup].
isKeyword_LinearArrayLookup: 13822 [microsec] total, 2490 [nanosec / lookup].
isKeyword_UnicodeTrie: 1317 [microsec] total, 237 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 1072 [microsec] total, 193 [nanosec / lookup].
Total: 22949 identifiers + 5551 keywords.
isKeyword_Dummy (baseline): 2738 [microsec] total, 50 [nanosec / lookup].
isKeyword_Dictionary: 4247 [microsec] total, 242 [nanosec / lookup].
isKeyword_SwitchByLengthThenByChar: 1593 [microsec] total, 91 [nanosec / lookup].
isKeyword_BinaryArrayLookup: 14351 [microsec] total, 820 [nanosec / lookup].
isKeyword_LinearArrayLookup: 59564 [microsec] total, 3405 [nanosec / lookup].
isKeyword_UnicodeTrie: 4167 [microsec] total, 238 [nanosec / lookup].
isKeyword_UnicodeTrieBoolLookup: 3466 [microsec] total, 198 [nanosec / lookup].
Total: 104183 identifiers + 17488 keywords.
(Switch code optimization)
from pegged.
Hey @PhilippeSigaud what's the status on this? Does Pegged have some kind of built-in support for keyword recognition yet?
from pegged.
On Thu, Sep 27, 2012 at 2:00 PM, Alex Rønne Petersen <
[email protected]> wrote:
Hey @PhilippeSigaud https://github.com/PhilippeSigaud what's the status
on this? Does Pegged have some kind of built-in support for keyword
recognition yet?
No. I coded a trie a few days ago, but performance was disappointing. Using
switch is quite efficient, but does not play well with the rest: kw do not
all have the same length, so I cannot do
switch(inputPrefix)
{
case "class":
case "interface":
}
I'll go back to producing a compile-time AA and if that does not work, my
next step will be to generate basic code like this:
if (input[0..5] == "class" || input[0..9] == "interface" || ...)
return true;
else
return false;
If that's a blocking point for you, I can lift it up my priority list and
try to code it tonight.
from pegged.
OK, done. I introduced a specific check for long choices of literals, like "abstract" / "alias" / "align" /...
Pegged now detects these and automatically generates specific, lower-level code to match them.
My tests give a 150 to 800% speed-up in keyword matching.
So, for your identifiers rules, use:
Keywords <- "abstract" / "alias" / ... # Keyword list -> specific code is created
Identifier <- !Keywords identifier # identifier is a predefined rule
Note that Pegged only does that for alternate choices of literals. If there is anything else in the choices, it won't create it. Maybe as a future add-on?
Rule <- "abc" / "def" / . # No specific code
from pegged.
This sounds great. I'll give it a go either sometime tonight or tomorrow.
Unrelated, but: What exactly is the regenerate binary supposed to do? After a fresh clone, make
, and running it, it appears it generates a different parser than what's currently in the repo. Intended?
from pegged.
On Thu, Sep 27, 2012 at 10:22 PM, Alex Rønne Petersen <
[email protected]> wrote:
This sounds great. I'll give it a go either sometime tonight or tomorrow.
Unrelated, but: What exactly is the regenerate binary supposed to do?
After a fresh clone, make, and running it, it appears it generates a
different parser than what's currently in the repo. Intended?
Regenerate recreates the Pegged parser. So, each time pegged.grammar or
pegged.peg are changed, I should use it to maintain the parser up to date
with the lastest engine evolution. I think I forgot to do this for a few
days.
from pegged.
OK, I'm now defining keywords like this:
keywords <- "rec" / ....
identifier <~ !keywords [a-zA-Z_] [a-zA-Z_0-9]*
Later I have a rule like this:
recordDeclaration < :"rec" identifier ....
To which I feed:
rec record ....
Note that the identifier starts with "rec" which is a keyword. The identifier
rule thus thinks it's dealing with a keyword even though the full identifier isn't such.
How can I work around this?
from pegged.
So you need a compulsory space after a keyword.
I see the following possibilities:
- change your ìdentifier` rule to:
identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*`
- I change the way Pegged deals to string choices like your
keywords
rules and add an automatic space parser after each possibility. I think
it's not a good idea, because some other user might use these choice list
in another, incompatible way.
from pegged.
So you need a compulsory space after a keyword.
No, imagine code like:
rec record{ ....
It can be anything except the things that make up an identifier, really.
from pegged.
I suppose I can just use .
instead of Spacing
but I'm not sure if there are any catches to that...
from pegged.
No, imagine code like:
rec record{ ....
It can be anything except the things that make up an identifier, really.
I'm not sure I get it. 'rec' is a keyword, 'record' is an identifier,
right? I thought the problem was that 'record' was parsed as 'rec' 'ord'.
This works for me:
import pegged.grammar;
mixin(grammar(`
TEST:
recordDeclaration < :"rec" identifier '{' '}'
keywords <- "rec" / "foo" / "bar"
identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*
`));
void main()
{
string input = "rec record { }";
writeln(TEST(input));
}
from pegged.
I thought the problem was that 'record' was parsed as 'rec' 'ord'.
I don't know what the problem actually is... Somehow (without that extra Spacing
added) the main rule of the grammar is marked as failure and half of the tree is gone (but with no further indication of where the parse failed). But if I change the identifier to reecord
or whatever that doesn't start with rec
things work just fine. So, the identifier
rule needs to somehow deal with the fact that once it encounters something that's not part of an identifier, the rule must succeed (which is what I assume my original identifier
rule didn't).
I hope this makes sense...
from pegged.
On Fri, Oct 5, 2012 at 5:50 PM, Alex Rønne Petersen <
[email protected]> wrote:
I suppose I can just use . instead of Spacing but I'm not sure if there
are any catches to that...
That's might be problematic for keywords that are prefixes of another
keyword, maybe (for / foreach).
Why don't you want to use Spacing?
from pegged.
On Fri, Oct 5, 2012 at 7:32 PM, Alex Rønne Petersen <
[email protected]> wrote:
I thought the problem was that 'record' was parsed as 'rec' 'ord'.
I don't know what the problem actually is... Somehow (without that extra
Spacing added) the main rule of the grammar is marked as failure and half
of the tree is gone (but with no further indication of where the parse
failed). But if I change the identifier to reecord or whatever that
doesn't start with rec things work just fine. So, the identifier rule
needs to somehow deal with the fact that once it encounters something
that's not part of an identifier, the rule must succeed (which is what I
assume my original identifier rule didn't).That's what you rule does.But since
keywords
succeeds on a prefix of
'record' then!keywords
fails. And thusidentifier
, being a sequence
with its first subrule which failed, also fails. When a sequence fails, its
last child is the failed node. In this case, there is only one child,
created by!keywords
, but this child, being anonymous (not from a grammar
named rule) is cut by the tree decimation. Hence the failure with no child.
from pegged.
So would a correct rule be:
identifier <~ !(keywords .) [a-zA-Z_] [a-zA-Z_0-9]*
?
from pegged.
So would a correct rule be:
identifier <~ !(keywords .) [a-zA-Z_] [a-zA-Z_0-9]*
Well, no. Let say your keywords are "int" / ... and the "interface"
identifier "int_a".
Thenkeywords .
will succeeds on "int_" and(!keywords .)
will fail,
bringingidentifier
down.
Really, the only correct rule I see is:
identifier <~ !(keywords Spacing) [a-zA-Z_] [a-zA-Z_0-9]*
This still fail for "int;": (keywords Spacing)
fails on "int;", thus
!(...)
succeeds and pass the relay to [a-zA-Z_]...
which will happily
match on "int".
So I guess the real rule should be:
identifier <~ !(keywords Separator) [a-zA-Z_] [a-zA-Z_0-9]*
Separator <- Spacing / ';' # Or anything else needed in your language
from pegged.
So Separator
would have to contain any non-keyword, non-identifier token that might exist in the language, I suppose?
from pegged.
On Sat, Oct 6, 2012 at 10:28 AM, Alex Rønne Petersen <
[email protected]> wrote:
So Separator would have to contain any non-keyword, non-identifier token
that might exist in the language, I suppose?
Not so much. It's only to eliminate cases like "int;" or "int}" in places
where "identifier;" or "identifier}" would make sense. The second condition
is important, because in most case "identifier separator-token" would not
make sense.
from pegged.
Related Issues (20)
- is it okay for the D grammar to compile in 10 minutes and use 45 gigs of ram? HOT 14
- Run-time parsing of Peg grammars HOT 1
- qualifiedIdentifier in dgrammar.d is not defined HOT 2
- Add equivalent of “important” blocks? HOT 3
- Release 0.4.5 HOT 2
- `pure` and `@safe` grammars/rules? HOT 3
- Can't fix this failure. HOT 2
- Grammar won't parse whole text, it stops short a few lines... HOT 1
- Comment syntax HOT 7
- Unwanted space consumption in rule parameter
- Syntax wrappers HOT 1
- How does one do Intellisense using a Pegged grammar? HOT 3
- Is there a way to break up a grammar into D classes, each with their own data to parse into & thus subgrammar? HOT 2
- I can't find the Wikipedia ParseTree handling code in the Pegged wiki any more... HOT 1
- Will there be a speed-up when using a Dlang switch-case and shorter pegged variable names? HOT 2
- Can we create a grammar induction algorithm based upon expected Pegged parsing failures? HOT 2
- How do you use dub + pegged + asModule? HOT 1
- Bump new version HOT 7
- Remove obsolete CI.
- How do you branch on complex names such as caseSensitiveLiteral!("let") (they get much more complicated) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pegged.