jcgoble3 / lua-matchext Goto Github PK
View Code? Open in Web Editor NEWFork of Lua 5.3 pattern matching with added features
License: Other
Fork of Lua 5.3 pattern matching with added features
License: Other
The ability to use |
for regex-style alternation within groups would be nice. Probably can use a modified version of match_balance
to find the end of the group, then match until encountering |
or )
. On success, make a recursive call to match
beginning at the end of the group. If that returns success, pass it up the chain, else on NULL
backtrack to the start of the group in the source string and attempt matching in the next alternative.
The major problem will be correctly counting capture groups, including capture groups passed over in unused alternatives. Also, need to handle groups that don't participate in the match (return nil
s in match table, but how to handle gsub()
and :expand()
?).
EDIT: Here's a proper task list:
MatchState
start_capture()
to parse the namematch()
to parse and handle named and named-style numbered backreferencesbuild_result_table
to check for and handle named capturesadd_s()
to handle named and named-style numbered backreferencesmatchobj_expand()
to handle named and named-style numbered backreferencesOriginal:
Named captures would be awesome. Here's my idea: add a third member to the captures
struct in MatchState
, which would be an fixed array of char
. start_capture
would then read the upcoming characters, and if a named capture is detected, obtain the name and insert it into the array, followed by '\0'
. (We can't just store a pointer in the MatchState
, since then the array would expire when start_capture
returns, or worse, when the inner if block ends. I'd rather not get into malloc
stuff here.) Then it would proceed with the match as normal. In order to allow position captures to be named, the check for a position capture would be moved out of match
and into start_capture
.
Alternatively, a separate array can be added to MatchState
, mapping group names to group numbers. This would simplify backreferences, but complicate result table building, and would also require either a NULL
sentinel or an extra member tracking the number of named groups. I think this is better, though.
Table matching functions would include named captures in the table (obviously), but it's a bit tricky: because Lua (unlike Python) does not distinguish between indexing and attribute access, and some special fields are already defined. Thus, a new table, groups
will be added, which will contain all captures, both numbered and named, instead of the main table. Next, to avoid having it shadowed by a capture named "expand", the expand
method will be placed directly in the table. Finally, the table's __index metamethod will point to the groups
subtable, thereby making all groups accessible by direct indexing, except for those that share the name of a special field. (Note that this will require a separate metatable for each result table, instead of a single metatable stored in the registry that all result tables share. expand
will also have to be modified to pull from the groups
subtable.
startpos
and endpos
would also carry the named fields, but fortunately they do not suffer from this problem. The documentation will officially recommend accessing named captures through groups
rather than directly, especially when the name is unknown (e.g. user input). Using the groups
subtable will also make pairs
iteration over captures easier.
The syntax would be the same as most regex: (?<name>...)
. Excluding the subpattern (...
) would turn it into a position capture. Backreferences within the pattern would use the PCRE syntax (but with Lua's escape character) %k<name>
, while group references in a gsub
replacement pattern would use simply %<name>
. As a bonus, these backreference syntaxes can be overloaded to allow referencing numbered captures 10 or higher.
Valid group names would be anything that is a valid variable in Lua, but with a cap on length (maybe 15). Duplicate names would throw an error when encountered. Backreferences to non-existent groups would also throw an error.
Backwards-compatibility: I do not consider this proposed syntax to be backward incompatible with PUC-Rio Lua. In PUC-Lua, (?
will open a capture and then match ?
literally, but this is undefined. The manual defines ?
as a special character which always has special meaning; the fact that it matches literally when following another special sequence (e.g. try it after another quantifier) is thus undefined behavior, and I do not consider a change in behavior that was previously undefined to be backwards-incompatible. Likewise, %<
in a PUC-Lua replacement string would throw an error for an invalid escape; thus, all replacement strings that previously worked will continue to work. Finally, %k
is currently not a defined character class, and thus the fact that it matches itself is undefined. Therefore, named captures can be included in the basic functions, even though their only use will be backreferences (since the basic functions don't have a means to return names).
(I think this explanation is more long-winded than the implementation will be. :P
For enhanced patterns covered by #9 (#6, #7, and #8) might it be a good idea to pre-compile these patterns to bytecode? It would likely make these features easier to implement. Again, need to study Python's re
module code to see how this is done (in particular, how to build an array of unknown length of numbers; maybe also study Lua's source code compiler to see how it builds the bytecode?). Could also build one byte at a time using Lua's buffer facilities, then convert the string to a proper array of numbers.
A staple of any regex flavor. For the enhanced patterns (issue #9), this could simply be a flag passed as a separate argument that changes the meaning of ^
and $
. But it could be implemented for the basic functions as well, if an appropriate letter code can be chosen.
Unfortunately, many letters are already taken by PUC Lua: a
, b
, c
, d
, f
, g
, l
, p
, s
, u
, w
, x
, and z
. That's half the alphabet in use by stock patterns alone, not to mention that named captures (issue #14) have taken k
. So we're running low on letters and numbers are not an option (backreferences). Also, I'm guaranteeing that all patterns written for stock Lua will remain compatible with the basic functions in this library as long as they do not rely on undefined behavior, and the 5.3 manual specifies that non-magic characters match themselves literally whether escaped or not, so new punctuation characters cannot be used even with an escape. So I may have to relegate this to enhanced patterns only.
As it's a simple boolean integer, this argument probably should be stored directly in the GMatchState
struct, rather than in an additional upvalue.
See http://www.lua.org/work/diffu-lua-5.3.2-lua-5.3.3.html (Ctrl+F for strlib
). Of note: removal of non-linear complexity stuff, adjustments to handle a coroutine bug (see issue #17), and changes to how gmatch
and gsub
handle empty matches.
This last item is a bit tricky; technically this change can be made freely since the behavior is undefined, but many use cases depend on a specific behavior for empty matches (e.g. split functions) and so some code has come to rely on the existing behavior. To be consistent with upstream, the default functions ought to be changed. However, is there any good reason to retain the older behavior in a new set of functions, or with an optional flag to the existing functions? I don't think so, but it's worth considering.
I mean, we've copied the entire Lua 5.3 pattern-matching source code, so why don't they work?
(Note: as the current behavior matches what Lua 5.1 users are used to, I consider the idea of getting nulls to work in 5.1 a feature rather than a bug. A bug is something that the user expects to work, but doesn't; while I, the developer, expected this to work, end users are just seeing the same behavior they're already used to, so it's not a bug, per se.)
Should have proper documentation, not just some stuff in the README. I think LuaRocks does something special with anything in a docs/
folder?
Just noticed this. First line of matchbalance
should subtract 2 instead of 1 if dealing with the uppercase B form. Should be a quick fix.
sub
be renamed expand
?source
be renamed string
?Provide an "escape" function that allows passing in a string and getting back a version will all non-alphanumeric characters escaped. Might be a good idea to restrict the escaping to non-alphanumeric characters that are ASCII and not control codes (c <= 126 && c >= 32 && !isalnum(c)
), as control codes and non-ASCII characters will never be given special meaning and escaping non-ASCII characters could interfere with UTF-8 matching.
It would be nice to support both greedy and lazy forms of {m,n}
quantifiers, as well as a lazy version of '?'. *
already has a direct lazy version as -
, but depending on how this resolves, it might be nice to support *?
as well for consistency. As the lazy form of a+
is aa-
, necessitating repeating the token, a lazy form of +
would be helpful as well, especially for group quantifiers (issue #6).
It would also be awesome to support possessive quantifiers, essentially a greedy quantifier that never backtracks or gives up a match. Not all regex flavors support this, but I think it should be fairly easy to implement.
Bug and patch here.
This would be a nice feature: match on UTF-8 codepoints instead of single bytes. Things to figure out:
What changes would be needed?
umatch
plus utmatch
and so forth? Or is that too much, and I should just use table functions (in which case, is the t
needed in the function names)? Thinking just table functions for this and all further "enhanced" features that can't be merged directly with the basic functions.MatchState
needs a new utf8
pseudo-boolean field indicating whether to perform byte matching or UTF-8 matching; changes below will only take effect when utf8
is trueinit
parameters should refer to codepoints instead of bytes (pull from stock utf8
module?)posrelat
might need changes, depending on how I handle init
parametersmatch_class
should force ASCII meanings (we're not going to deal with Unicode meanings, just like the standard utf8
library)match_class
should also match on codepoints in the default
branchmatchbracketclass
will need a new implementation to match codepointssinglematch
should match a codepoint in the default
branch whenmatchbalance
needs to handle codepoint arguments -- can ignore UTF-8 when all arguments are ASCIImatch
's default
branch > else
branch needs to handle codepoints on s++
and s + 1
; function is otherwise high-level enough to be finepush_onecapture
needs to count codepoints for position capturesprepstate
needs a new argument to set ms->utf8
with build_result_table
: should this use codepoints in startpos
and endpos
? Or should UTF-8 functions add startcodepoint
and endcodepoint
?build_result_table
should probably also output a utf8
boolean fieldstr_find_aux
should handle init
as codepoints or delegate to posrelat
str_find_aux
should count on codepoints in do...while
condition (s++
)gmatch_aux
should advance one codepoint on empty matchstr_gsub_aux
should add a whole codepoint on no matchFunctions that are fine as is:
check_capture
capture_to_close
classend
(operates on bytes, but should play nice with proper UTF-8)max_expand
and min_expand
start_capture
and end_capture
(len
will be bytes, but that's needed for other functions and can be modified for codepoints where it's used)lmemfind
push_captures
nospecials
reprepstate
add_value
add_s
and matchobj_expand
(operate on bytes, but should play nice with proper UTF-8)See my travis-test repo. I can take advantage of caching on the Linux builds to save the Lua and LuaRocks installations, so that we don't have to redownload and recompile them every time. Unfortunately, the OS X workers don't support caching for public repositories, so I'll need to consider if that's worth keeping. On the plus side, caching would necessitate combining all three Lua versions into one build, so I'll no longer have to wait for the first two OS X builds to finish before the third one can boot.
A "nice to have" feature would be the ability to apply quantifiers to groups and not just individual characters/classes/sets. Probably not possible without a new MatchState (MatchStateE
for Enhanced?) that has space to track backtracking points.
The current code uses recursive calls over the match
function to handle backtracking (match
returns NULL
on no match, which either unwinds the stack all the way to signal no match at all or is caught somewhere down the stack by a caller that performs the backtracking and makes another recursive call). However, the quantifiers handle all their matches with only one recursive call, adding or subtracting another character each time NULL
is returned. Here, each iteration could match different characters and a different number of bytes, so doing it that way would require a separate recursive call for each match, easily triggering a stack overflow.
Need to study the code for Python's re
module here to understand how the matching works there; that would probably help a lot here.
Need to figure out AppVeyor testing, so I can run Windows builds alongside the Ubuntu and Mac builds on Travis CI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.