Comments (9)
Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says
.
any character except new line (includes new line with s flag)`, so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.
regex is configurable whether dot accepts newline via https://docs.rs/regex/latest/regex/struct.RegexBuilder.html#method.dot_matches_new_line called here:
https://github.com/softdevteam/grmtools/blob/master/lrlex/src/lib/lexer.rs#L59
So I really think it should match no characters on grmtools, even if you do not have a token matching newline.
I didn't see this listed in the lex compatibility section of the docs, (And i'm working solely off of memory for lex's dot newline behavior). But maybe this is something that should be documented here. I'll try and test lex out then open a PR for the docs change if my memory is correct. In that pr perhaps we should investigate whether there is any difficulty in making a CTLexerBuilder config option for dot newline behavior.
from grmtools.
Testing of flex confirmed that
[.]
does not match newline, however it also seems to show that[^.]
rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?%% [^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');} %% int yywrap(){return(1);} main(int argc, char **argv){yylex();}
I think .
is treated as a normal character inside character classes. For flex you could specify [^\x00-\xFF]
to negate everything.
In my usage, instead of using a catch-all like [^.]
, I think I can use -
for UMINUS, since -
is already used for PLUS earlier, so UMINUS should never match.
from grmtools.
One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS"
, probably making sure that it is at the end. Which should in theory make a token which never matches anything.
Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of .
with respect to the inclusion of newlines with grmtools including newlines in .
and lex not including them. So I believe it would never match anything on grmtools. But on lex it might match a newline unless newline is already matched such as is the case where you match whitespace early in lex file. So given those stipulations I believe it should behave the same between grmtools/lex.
from grmtools.
One thought that comes to mind is to include a rule in your lex file such as [^.] "UMINUS", probably making sure that it is at the end. Which should in theory make a token which never matches anything.
Good idea. I am so stuck in the yacc mindset that it is hard for me to switch and think outside the box.
For posterity, I had to do 2 things: first putting in [^.] "UMINUS
in the .l file. then add a %token "UMINUS"
in the .y file, for the error to go away.
Edit: The reason i say "in theory", is that I believe there is a slight difference between lex and grmtools treatment of . with respect to the inclusion of newlines with grmtools including newlines in . and lex not including them.
Looks like grmtools lexer is implemented as a loop over the list of regexes and keep track of the longest match. And the rust regex doc says . any character except new line (includes new line with s flag)
, so [^.] should match new line but no other characters. So, since my .l file matches [ \t\r\n], it should be safe to use [^.] as a catch-all "never match" rule.
Edit:
For posterity, I had to do 2 things: first putting in - "UMINUS
in the .l file. then add a %token "UMINUS"
in the .y file, for the error to go away.
from grmtools.
Testing of flex confirmed that [.]
does not match newline, however it also seems to show that [^.]
rather than not matching any character, matches any character including newlines instead of only newlines. This seems like an odd/unexpected behavior to me though, so I don't see at the moment with flex a good way to get a token which doesn't match. Perhaps i'm making a silly mistake?
%%
[^.] { printf("match: %s %B\n", yytext, yytext[0] == '\n');}
%%
int yywrap(){return(1);}
main(int argc, char **argv){yylex();}
from grmtools.
AFAIK, lex/flex predate the "semi-standarisation" of regexs that occurred after Perl-style regexs started to dominate. It's why things like vim, Unix's re_format
, and so on have what to us seem to be "weird" regex formats, each a little different.
I did hope -- but I know I won't get around to! -- that someone might implement a "proper" lex/flex compatibility mode for lrlex that implements that particular style of regexs. It's not that I think it's a particularly good format, but it will mean that we could move from "we do whatever Rust's regex engine does" to "we run lex files the same as any other lex".
In the specific case we have here, I'm a bit neutral about whether .
should match newlines or not. I lean towards thinking that both behaviours have surprises, and that breaking backwards compatibility just to opt into another surprise might not be worth it, but I can be convinced otherwise.
from grmtools.
@yuanweixin Nice! that sounds like the best option and avoids all these of compatibility woes.
@ltratt Yeah, I probably won't get around to implementing a full lex compatible regex engine either (fond of being able to just use rust regex instead). I not really too keen on changing the default newline behavior either. I guess I was just entertaining the idea of exposing RegexBuilder
options to CTLexerBuilder
. Which can increase compatibility in this regard without much effort. But with the above seemingly sorted there might not be anyone needs it. Anyway i'll happily post a patch for that if it seems useful.
from grmtools.
I guess I was just entertaining the idea of exposing RegexBuilder options to CTLexerBuilder.
That could be interesting!
from grmtools.
Closing this in favour of #400.
from grmtools.
Related Issues (20)
- %ignore like in flex HOT 8
- Is it possible to perform side effects while parsing? HOT 3
- Would be nice to have an online playground HOT 8
- Order of execution of grammar statements? HOT 7
- Bug in grmtools documentation - traits not displaying required method names HOT 12
- Is there a way to avoid global variables for compile time data structures? HOT 5
- Request for example of handling string literals in lexer HOT 2
- GDB Support? HOT 5
- Add support for comments in lrlex files HOT 3
- Explain why Copy is required for a type specified %parse-param HOT 7
- Nondeterministic generation of Rust code HOT 2
- Support lex-style definitions HOT 1
- Trying to replicate `tinylang.yy` and `tinylang.l` syntax with `nimbleparse` HOT 5
- Weird that `nimbleparse` works on `java7.y` `java7.l` even though the `GT` symbol in both is different. HOT 1
- Online Yacc/Lex editor/tester HOT 6
- Detailed debug info for reduce/reduce shift/reduce errors? HOT 27
- Docs Question - Optional Prefix to Regular Expressions HOT 5
- lrpar lose info about debug HOT 7
- nimbleparse:Suggestions for outputting information HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grmtools.