Hi, I was trying to parse regexes with this parser and couldn't get the <code class="n

Also, with <a class="issue-link js-issue-link" data-error-text="Failed to load title"

If you want you can have a look here: <a href="https://github.com/jkrumbiegel/Readable

Precedence about pikaparser.jl HOT 10 CLOSED

jkrumbiegel commented on May 20, 2024

Precedence

from pikaparser.jl.

Comments (10)

exaexa commented on May 20, 2024

Hi,

predecence with PEGs is often a bit surprising. Original pikaparser basically compiles the precedence integers into "base grammar" by the usual failthrough construction with a base case, which removes much ambiguity and also allows you to nicely specify the fixity of whatever operators you have.

In your case, there's no telling whether the "glue chars together" in :sequence should have lower or higher precedence than the "alternatives" in :either, and what you got is a valid parse with respect to the grammar.

I succeeded with implementing the failthrough gadget manually:

using PikaParser
const P = PikaParser

r = Dict(
    :regex => P.seq(:expr1),  # toplevel
    :expr1 => P.first(:or => P.seq(:expr1, P.token('|'), :expr2), :expr2),  # precedence level 1
    :expr2 => P.one_or_more(:basic),   # precedence level 2
    :basic => P.first(  # easily decidable single-item base cases
        :char => P.satisfy(isletter),
        :parens => P.seq(P.token('('), :expr1, P.token(')')),
    )
)

g = P.make_grammar([:regex], P.flatten(r))

input = collect("abc|d(e|fg|h)ij|kl")

p = P.parse(g, input)

println(P.traverse_match(g, p, P.find_match_at(g, p, :regex, 1), :regex))

(At precedence level 1, you may also try other combinations of :expr1 and :expr2 operands to change the fixity of the |.)

On the string abc|d(e|fg|h)ij|kl it should give (with a bit of juliaformatter) something like:

regex(
    expr1(
        or(
            expr1(
                or(
                    expr1(expr2(basic(char()), basic(char()), basic(char()))),
                    var"or-2"(),
                    expr2(
                        basic(char()),
                        basic(
                            parens(
                                var"parens-1"(),
                                expr1(
                                    or(
                                        expr1(
                                            or(
                                                expr1(expr2(basic(char()))),
                                                var"or-2"(),
                                                expr2(basic(char()), basic(char())),
                                            ),
                                        ),
                                        var"or-2"(),
                                        expr2(basic(char())),
                                    ),
                                ),
                                var"parens-3"(),
                            ),
                        ),
                        basic(char()),
                        basic(char()),
                    ),
                ),
            ),
            var"or-2"(),
            expr2(basic(char()), basic(char())),
        ),
    ),
)

I don't see the immediate issue why your parser would produce the "bad" parsetree, but I expect the problem to be a bad interplay of greedy matching and left-recursion problems, which always gives surprises. In this case, if you'd parse with normal recursive descent, you would expect the :sequence rule to expand to :expr, then try :either which again tries :sequence at the same position and then loops forever. Pikaparser fixes this by reconstructing the parsetree from the end, and prefers matching the b|c as a continuation in ab|c as a higher-priority :either, which is a perfectly valid :expr (even better match than the :char because it doesn't produce failure later), but that doesn't give you the way to glue the chars together. In short, "priority" in first isn't the same thing as "precedence".

You might be able to quickfix your grammar by instead using :chars that greedily matches the whole char group, but I didn't try that.

Also, I should probably add some support for easily creating the precedence chains :D

from pikaparser.jl.

exaexa commented on May 20, 2024

Bonus: parsing the regex suffixes with normal grammars is a major source of PITA but I guess here you'll have luck with just

...
:expr2 => P.one_or_more(:expr3),
:expr3 => P.first(:one_or_more, :zero_or_more, ..., :basic),
:one_or_more => P.seq(:basic, P.token('+')),
...

from pikaparser.jl.

exaexa commented on May 20, 2024

Also, with #3 merged you can write the ruleset from above roughly as:

r = Dict(
    P.@precedences (i -> Symbol(:r, i)) c n begin
      :regex => P.seq(n, P.token('|'), c)
      :sequence => P.one_or_more(n)
      P.first(
        :group => P.seq(P.token('('), n, P.token(')')),
        :char => P.satisfy(isletter),
      )
    end
)

(untested)

from pikaparser.jl.

jkrumbiegel commented on May 20, 2024

Cool, I will play around with this. It already looked like your previous suggestions worked, and I understood better why it didn't work before, so thanks!

from pikaparser.jl.

exaexa commented on May 20, 2024

OK. Feel free to report your final grammar, might be useful to add some extra bits from it to tests.

from pikaparser.jl.

jkrumbiegel commented on May 20, 2024

If you want you can have a look here: https://github.com/jkrumbiegel/ReadableRegex.jl/blob/explain-regex/explain_regex.jl

I wanted to do the reverse of what the package is doing, translate a given regex into function code that should hopefully be easier to understand.

For example, explain_regex(r"^-?[0-9]{0,2}(\.[0-9]{1,2})?$|^-?(100)(\.[0]{1,2})?$") gives this (with a tiny bit manual indentation to make the either arguments more salient:

either(
    BEGIN *
        maybe('-') *
        between(0, 2, char_in('0':1:'9')) *
        maybe(capture('.' * between(1, 2, char_in('0':1:'9')))) *
        END,
    BEGIN *
        maybe('-') *
        capture("100") *
        maybe(capture('.' * between(1, 2, char_in('0')))) *
        END,
)

from pikaparser.jl.

exaexa commented on May 20, 2024

Aaaah nice. Regex is a bit write-only tbh, but this should work. If I get it right, you are able to reconstruct actual ReadableRegex structure from any regex pattern?

Anyway, let me know in case you hit any difficulties!

(also, how new is the import Package as P syntax? the last time I tried it didn't work :D :D )

from pikaparser.jl.

jkrumbiegel commented on May 20, 2024

Yes, that's the goal. And yeah it's mostly read only, but it's very hard to understand as it is. Like spotting a | in the middle of a long regex etc. So I thought it would be cool to have that split out a bit, and with the ReadableRegex functions you can then more easily try out changes. But anyway it's mostly a didactical thing :)

from pikaparser.jl.

exaexa commented on May 20, 2024

Thinking about that, is there any regular language matcher in Julia? This way we could have a completely regex-less regular language matching. (Technically you could translate the regular expressions to pikaparser and match them as is, but compared to plain DFA/NFA required for regexes there's an ugly lot of overhead here...)

from pikaparser.jl.

exaexa commented on May 20, 2024

Closing this because the precedence helpers gonna get released asap. Thanks for input!

from pikaparser.jl.

Precedence about pikaparser.jl HOT 10 CLOSED

Comments (10)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent