Giter Site home page Giter Site logo

lua-matchext's People

Contributors

jcgoble3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

dmabrothers

lua-matchext's Issues

Alternation within groups

The ability to use | for regex-style alternation within groups would be nice. Probably can use a modified version of match_balance to find the end of the group, then match until encountering | or ). On success, make a recursive call to match beginning at the end of the group. If that returns success, pass it up the chain, else on NULL backtrack to the start of the group in the source string and attempt matching in the next alternative.

The major problem will be correctly counting capture groups, including capture groups passed over in unused alternatives. Also, need to handle groups that don't participate in the match (return nils in match table, but how to handle gsub() and :expand()?).

Named captures

EDIT: Here's a proper task list:

  • add member to MatchState
  • modify start_capture() to parse the name
  • modify match() to parse and handle named and named-style numbered backreferences
  • modify build_result_table to check for and handle named captures
  • modify add_s() to handle named and named-style numbered backreferences
  • modify matchobj_expand() to handle named and named-style numbered backreferences
  • add documentation
  • for the love of all things programming, ADD TESTS!!!

Original:

Named captures would be awesome. Here's my idea: add a third member to the captures struct in MatchState, which would be an fixed array of char. start_capture would then read the upcoming characters, and if a named capture is detected, obtain the name and insert it into the array, followed by '\0'. (We can't just store a pointer in the MatchState, since then the array would expire when start_capture returns, or worse, when the inner if block ends. I'd rather not get into malloc stuff here.) Then it would proceed with the match as normal. In order to allow position captures to be named, the check for a position capture would be moved out of match and into start_capture.

Alternatively, a separate array can be added to MatchState, mapping group names to group numbers. This would simplify backreferences, but complicate result table building, and would also require either a NULL sentinel or an extra member tracking the number of named groups. I think this is better, though.

Table matching functions would include named captures in the table (obviously), but it's a bit tricky: because Lua (unlike Python) does not distinguish between indexing and attribute access, and some special fields are already defined. Thus, a new table, groups will be added, which will contain all captures, both numbered and named, instead of the main table. Next, to avoid having it shadowed by a capture named "expand", the expand method will be placed directly in the table. Finally, the table's __index metamethod will point to the groups subtable, thereby making all groups accessible by direct indexing, except for those that share the name of a special field. (Note that this will require a separate metatable for each result table, instead of a single metatable stored in the registry that all result tables share. expand will also have to be modified to pull from the groups subtable.

startpos and endpos would also carry the named fields, but fortunately they do not suffer from this problem. The documentation will officially recommend accessing named captures through groups rather than directly, especially when the name is unknown (e.g. user input). Using the groups subtable will also make pairs iteration over captures easier.

The syntax would be the same as most regex: (?<name>...). Excluding the subpattern (...) would turn it into a position capture. Backreferences within the pattern would use the PCRE syntax (but with Lua's escape character) %k<name>, while group references in a gsub replacement pattern would use simply %<name>. As a bonus, these backreference syntaxes can be overloaded to allow referencing numbered captures 10 or higher.

Valid group names would be anything that is a valid variable in Lua, but with a cap on length (maybe 15). Duplicate names would throw an error when encountered. Backreferences to non-existent groups would also throw an error.

Backwards-compatibility: I do not consider this proposed syntax to be backward incompatible with PUC-Rio Lua. In PUC-Lua, (? will open a capture and then match ? literally, but this is undefined. The manual defines ? as a special character which always has special meaning; the fact that it matches literally when following another special sequence (e.g. try it after another quantifier) is thus undefined behavior, and I do not consider a change in behavior that was previously undefined to be backwards-incompatible. Likewise, %< in a PUC-Lua replacement string would throw an error for an invalid escape; thus, all replacement strings that previously worked will continue to work. Finally, %k is currently not a defined character class, and thus the fact that it matches itself is undefined. Therefore, named captures can be included in the basic functions, even though their only use will be backreferences (since the basic functions don't have a means to return names).

(I think this explanation is more long-winded than the implementation will be. :P

Pre-compilation of enhanced patterns

For enhanced patterns covered by #9 (#6, #7, and #8) might it be a good idea to pre-compile these patterns to bytecode? It would likely make these features easier to implement. Again, need to study Python's re module code to see how this is done (in particular, how to build an array of unknown length of numbers; maybe also study Lua's source code compiler to see how it builds the bytecode?). Could also build one byte at a time using Lua's buffer facilities, then convert the string to a proper array of numbers.

Non-capturing groups

Support (?: for non-capturing groups. This would be a nice bonus to either #6 and/or #7, but has no use until then.

Multiline mode (anchors for individual lines)

A staple of any regex flavor. For the enhanced patterns (issue #9), this could simply be a flag passed as a separate argument that changes the meaning of ^ and $. But it could be implemented for the basic functions as well, if an appropriate letter code can be chosen.

Unfortunately, many letters are already taken by PUC Lua: a, b, c, d, f, g, l, p, s, u, w, x, and z. That's half the alphabet in use by stock patterns alone, not to mention that named captures (issue #14) have taken k. So we're running low on letters and numbers are not an option (backreferences). Also, I'm guaranteeing that all patterns written for stock Lua will remain compatible with the basic functions in this library as long as they do not rely on undefined behavior, and the 5.3 manual specifies that non-magic characters match themselves literally whether escaped or not, so new punctuation characters cannot be used even with an escape. So I may have to relegate this to enhanced patterns only.

"table" argument for gmatch

As it's a simple boolean integer, this argument probably should be stored directly in the GMatchState struct, rather than in an additional upvalue.

Merge Lua 5.3.3 changes from upstream

See http://www.lua.org/work/diffu-lua-5.3.2-lua-5.3.3.html (Ctrl+F for strlib). Of note: removal of non-linear complexity stuff, adjustments to handle a coroutine bug (see issue #17), and changes to how gmatch and gsub handle empty matches.

This last item is a bit tricky; technically this change can be made freely since the behavior is undefined, but many use cases depend on a specific behavior for empty matches (e.g. split functions) and so some code has come to rely on the existing behavior. To be consistent with upstream, the default functions ought to be changed. However, is there any good reason to retain the older behavior in a new set of functions, or with an optional flag to the existing functions? I don't think so, but it's worth considering.

Find out why "\0" characters still don't work in Lua 5.1

I mean, we've copied the entire Lua 5.3 pattern-matching source code, so why don't they work?

(Note: as the current behavior matches what Lua 5.1 users are used to, I consider the idea of getting nulls to work in 5.1 a feature rather than a bug. A bug is something that the user expects to work, but doesn't; while I, the developer, expected this to work, end users are just seeing the same behavior they're already used to, so it's not a bug, per se.)

Improved documentation

Should have proper documentation, not just some stuff in the README. I think LuaRocks does something special with anything in a docs/ folder?

Match tables

  • Complete the basic code
  • Should sub be renamed expand?
  • Should source be renamed string?
  • Promote from alpha (develop) to beta (master)
  • Add some docs to README?
  • Update description in rockspec

"escape" function

Provide an "escape" function that allows passing in a string and getting back a version will all non-alphanumeric characters escaped. Might be a good idea to restrict the escaping to non-alphanumeric characters that are ASCII and not control codes (c <= 126 && c >= 32 && !isalnum(c)), as control codes and non-ASCII characters will never be given special meaning and escaping non-ASCII characters could interfere with UTF-8 matching.

Expand quantifiers to include arbitrary {m,n} and lazy/possessive

It would be nice to support both greedy and lazy forms of {m,n} quantifiers, as well as a lazy version of '?'. * already has a direct lazy version as -, but depending on how this resolves, it might be nice to support *? as well for consistency. As the lazy form of a+ is aa-, necessitating repeating the token, a lazy form of + would be helpful as well, especially for group quantifiers (issue #6).

It would also be awesome to support possessive quantifiers, essentially a greedy quantifier that never backtracks or gives up a match. Not all regex flavors support this, but I think it should be fairly easy to implement.

UTF-8 matching

This would be a nice feature: match on UTF-8 codepoints instead of single bytes. Things to figure out:

What changes would be needed?

  • Probably should be a separate set of functions, but should I have umatch plus utmatch and so forth? Or is that too much, and I should just use table functions (in which case, is the t needed in the function names)? Thinking just table functions for this and all further "enhanced" features that can't be merged directly with the basic functions.
  • MatchState needs a new utf8 pseudo-boolean field indicating whether to perform byte matching or UTF-8 matching; changes below will only take effect when utf8 is true
  • init parameters should refer to codepoints instead of bytes (pull from stock utf8 module?)
  • posrelat might need changes, depending on how I handle init parameters
  • match_class should force ASCII meanings (we're not going to deal with Unicode meanings, just like the standard utf8 library)
  • match_class should also match on codepoints in the default branch
  • matchbracketclass will need a new implementation to match codepoints
  • singlematch should match a codepoint in the default branch when
  • matchbalance needs to handle codepoint arguments -- can ignore UTF-8 when all arguments are ASCII
  • match's default branch > else branch needs to handle codepoints on s++ and s + 1; function is otherwise high-level enough to be fine
  • push_onecapture needs to count codepoints for position captures
  • prepstate needs a new argument to set ms->utf8 with build_result_table: should this use codepoints in startpos and endpos? Or should UTF-8 functions add startcodepoint and endcodepoint?
  • build_result_table should probably also output a utf8 boolean field
  • str_find_aux should handle init as codepoints or delegate to posrelat
  • str_find_aux should count on codepoints in do...while condition (s++)
  • gmatch_aux should advance one codepoint on empty match
  • str_gsub_aux should add a whole codepoint on no match

Functions that are fine as is:

  • check_capture
  • capture_to_close
  • classend (operates on bytes, but should play nice with proper UTF-8)
  • max_expand and min_expand
  • start_capture and end_capture (len will be bytes, but that's needed for other functions and can be modified for codepoints where it's used)
  • lmemfind
  • push_captures
  • nospecials
  • reprepstate
  • add_value
  • add_s and matchobj_expand (operate on bytes, but should play nice with proper UTF-8)

Improve testing on Travis CI

See my travis-test repo. I can take advantage of caching on the Linux builds to save the Lua and LuaRocks installations, so that we don't have to redownload and recompile them every time. Unfortunately, the OS X workers don't support caching for public repositories, so I'll need to consider if that's worth keeping. On the plus side, caching would necessitate combining all three Lua versions into one build, so I'll no longer have to wait for the first two OS X builds to finish before the third one can boot.

Quantifiers on parenthetical groups

A "nice to have" feature would be the ability to apply quantifiers to groups and not just individual characters/classes/sets. Probably not possible without a new MatchState (MatchStateE for Enhanced?) that has space to track backtracking points.

The current code uses recursive calls over the match function to handle backtracking (match returns NULL on no match, which either unwinds the stack all the way to signal no match at all or is caught somewhere down the stack by a caller that performs the backtracking and makes another recursive call). However, the quantifiers handle all their matches with only one recursive call, adding or subtracting another character each time NULL is returned. Here, each iteration could match different characters and a different number of bytes, so doing it that way would require a separate recursive call for each match, easily triggering a stack overflow.

Need to study the code for Python's re module here to understand how the matching works there; that would probably help a lot here.

Start testing on AppVeyor

Need to figure out AppVeyor testing, so I can run Windows builds alongside the Ubuntu and Mac builds on Travis CI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.