In yacc, it is possible to declare a fake token with the highest precedence, <div

Looks like grmtools lexer is implemented as <a href="https://github.com/s

Testing of flex confirmed that [.]</co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Closing this in favour of <a class="issue-link js-issue-link" data-error-text="Failed

error when using fake UMINUS token in %prec directive about grmtools HOT 9 CLOSED

yuanweixin commented on July 18, 2024

error when using fake UMINUS token in %prec directive

from grmtools.

Comments (9)

ratmice commented on July 18, 2024 1

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag)`, so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

regex is configurable whether dot accepts newline via https://docs.rs/regex/latest/regex/struct.RegexBuilder.html#method.dot_matches_new_line called here:
https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/lexer.rs#L59
So I really think it should match no characters on grmtools, even if you do not have a token matching newline.

I didn't see this listed in the lex compatibility section of the docs, (And i'm working solely off of memory for lex's dot newline behavior). But maybe this is something that should be documented here. I'll try and test lex out then open a PR for the docs change if my memory is correct. In that pr perhaps we should investigate whether there is any difficulty in making a CTLexerBuilder config option for dot newline behavior.

from grmtools.

yuanweixin commented on July 18, 2024 1

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?
%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}

I think . is treated as a normal character inside character classes. For flex you could specify [^\x00-\xFF] to negate everything.

In my usage, instead of using a catch-all like [^.], I think I can use - for UMINUS, since - is already used for PLUS earlier, so UMINUS should never match.

from grmtools.

ratmice commented on July 18, 2024

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them. So I believe it would never match anything on grmtools. But on lex it might match a newline unless newline is already matched such as is the case where you match whitespace early in lex file. So given those stipulations I believe it should behave the same between grmtools/lex.

from grmtools.

yuanweixin commented on July 18, 2024

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Good idea. I am so stuck in the yacc mindset that it is hard for me to switch and think outside the box.

For posterity, I had to do 2 things: first putting in [^.] "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them.

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag), so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

Edit:
For posterity, I had to do 2 things: first putting in - "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

from grmtools.

ratmice commented on July 18, 2024

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?

%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}

from grmtools.

ltratt commented on July 18, 2024

AFAIK, lex/flex predate the "semi-standarisation" of regexs that occurred after Perl-style regexs started to dominate. It's why things like vim, Unix's re_format, and so on have what to us seem to be "weird" regex formats, each a little different.

I did hope -- but I know I won't get around to! -- that someone might implement a "proper" lex/flex compatibility mode for lrlex that implements that particular style of regexs. It's not that I think it's a particularly good format, but it will mean that we could move from "we do whatever Rust's regex engine does" to "we run lex files the same as any other lex".

In the specific case we have here, I'm a bit neutral about whether . should match newlines or not. I lean towards thinking that both behaviours have surprises, and that breaking backwards compatibility just to opt into another surprise might not be worth it, but I can be convinced otherwise.

from grmtools.

ratmice commented on July 18, 2024

@yuanweixin Nice! that sounds like the best option and avoids all these of compatibility woes.

@ltratt Yeah, I probably won't get around to implementing a full lex compatible regex engine either (fond of being able to just use rust regex instead). I not really too keen on changing the default newline behavior either. I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder. Which can increase compatibility in this regard without much effort. But with the above seemingly sorted there might not be anyone needs it. Anyway i'll happily post a patch for that if it seems useful.

from grmtools.

ltratt commented on July 18, 2024

I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder.

That could be interesting!

from grmtools.

ltratt commented on July 18, 2024

Closing this in favour of #400.

from grmtools.

error when using fake UMINUS token in %prec directive about grmtools HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent