Giter Site home page Giter Site logo

Comments (9)

ratmice avatar ratmice commented on July 18, 2024 1

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag)`, so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

regex is configurable whether dot accepts newline via https://docs.rs/regex/latest/regex/struct.RegexBuilder.html#method.dot_matches_new_line called here:
https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/lexer.rs#L59
So I really think it should match no characters on grmtools, even if you do not have a token matching newline.

I didn't see this listed in the lex compatibility section of the docs, (And i'm working solely off of memory for lex's dot newline behavior). But maybe this is something that should be documented here. I'll try and test lex out then open a PR for the docs change if my memory is correct. In that pr perhaps we should investigate whether there is any difficulty in making a CTLexerBuilder config option for dot newline behavior.

from grmtools.

yuanweixin avatar yuanweixin commented on July 18, 2024 1

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?

%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}

I think . is treated as a normal character inside character classes. For flex you could specify [^\x00-\xFF] to negate everything.

In my usage, instead of using a catch-all like [^.], I think I can use - for UMINUS, since - is already used for PLUS earlier, so UMINUS should never match.

from grmtools.

ratmice avatar ratmice commented on July 18, 2024

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them. So I believe it would never match anything on grmtools. But on lex it might match a newline unless newline is already matched such as is the case where you match whitespace early in lex file. So given those stipulations I believe it should behave the same between grmtools/lex.

from grmtools.

yuanweixin avatar yuanweixin commented on July 18, 2024

One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.

Good idea. I am so stuck in the yacc mindset that it is hard for me to switch and think outside the box.

For posterity, I had to do 2 things: first putting in [^.] "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them.

Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag), so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.

Edit:
For posterity, I had to do 2 things: first putting in - "UMINUS in the .l file. then add a %token "UMINUS" in the .y file, for the error to go away.

from grmtools.

ratmice avatar ratmice commented on July 18, 2024

Testing of flex confirmed that [.] does not match newline, however it also seems to show that [^.] rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?

%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}

from grmtools.

ltratt avatar ltratt commented on July 18, 2024

AFAIK, lex/flex predate the "semi-standarisation" of regexs that occurred after Perl-style regexs started to dominate. It's why things like vim, Unix's re_format, and so on have what to us seem to be "weird" regex formats, each a little different.

I did hope -- but I know I won't get around to! -- that someone might implement a "proper" lex/flex compatibility mode for lrlex that implements that particular style of regexs. It's not that I think it's a particularly good format, but it will mean that we could move from "we do whatever Rust's regex engine does" to "we run lex files the same as any other lex".

In the specific case we have here, I'm a bit neutral about whether . should match newlines or not. I lean towards thinking that both behaviours have surprises, and that breaking backwards compatibility just to opt into another surprise might not be worth it, but I can be convinced otherwise.

from grmtools.

ratmice avatar ratmice commented on July 18, 2024

@yuanweixin Nice! that sounds like the best option and avoids all these of compatibility woes.

@ltratt Yeah, I probably won't get around to implementing a full lex compatible regex engine either (fond of being able to just use rust regex instead). I not really too keen on changing the default newline behavior either. I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder. Which can increase compatibility in this regard without much effort. But with the above seemingly sorted there might not be anyone needs it. Anyway i'll happily post a patch for that if it seems useful.

from grmtools.

ltratt avatar ltratt commented on July 18, 2024

I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder.

That could be interesting!

from grmtools.

ltratt avatar ltratt commented on July 18, 2024

Closing this in favour of #400.

from grmtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.