apple / swift-experimental-string-processing Goto Github PK

An early experimental general-purpose pattern matching engine for Swift.

License: Apache License 2.0

Swift 98.45% C 1.23% CMake 0.14% Python 0.18%

swift-experimental-string-processing's Issues

Fails with release build configuration

The string processing package (SwiftPM) fails to build in release mode because the PEG prototype uses @testable import _StringProcessing in order to access the matching engine. It’s not urgent to fix it since we don't ever release using SwiftPM, but i think we should either move PEG to a test target or make the matching engine symbols be SPI for PEG to access.

Bring pretty printer up to date

The pretty printer converts a Regex defined in syntax into a RegexBuilder-style one, but it isn't up to date with the latest changes. This needs to be brought up to date, ideally with a validator that can take a regex and some input/output data, convert the regex to builder style, and then validate that both regexes have the same behavior.

For example, #"(?P)\w+ \d{2,} -"# converts to

Group(/* TODO: changeMatchingOptions */) {
  Concatenation {
    /* TODO: assertions */
    OneOrMore(.eager) {
      .wordCharacter
    }
    "-"
  }
}

instead of

Regex {
  Anchor.startOfInput
  OneOrMore(.eager) {
    CharacterClass.wordCharacter
  }
  "-"
}.asciiOnlyCharacterClasses()

MatchingEngine Capabilities and Roadmap

Details

TODO

Regex feature status

The matching engine supports (modulo bugs) the following

Basic constructs: concatenation, alternation, non-capturing grouping, etc.
Literal constructs: scalar literals, quotes,
Character classes, custom and built-in
- Including ranges, nested custom character classes, some set operations (e.g. subtraction), inversion, etc
Quantification
- eager, reluctant, possessive x* x*? x*+
- bounded and unbounded x+ x{n,m} x{n,} x{,m} x{n} (and kind variants)
Character properties: named characters, general category, most UCD properties
Assertions: built-in and custom
- Including anchors such as $, ^
- Including custom look-ahead assertions
Arbitrary consumer call-outs (Input, Range<Input.Index>) -> Input.Index?
- This is the basic extension point for library-driven pattern matching
- This is how character classes are currently implemented
Arbitrary assertion call-outs (Input, Input.Index, Range<Input.Index>) -> Bool
- This is how anchors are currently implemented
- Note the provided bounds, as assertions often deal with boundary conditions
Arbitrary value-producing callouts (CustomRegexComponent)
A function call stack for recursive PEG-style grammars
- Note: Backtracking properly manages this by restoring stack position
- Note: This is only very very lightly tested (mostly for PEGs)
- TODO: Hook up to (?R), etc.
Captures
Backreferences
Scripts and some missing Unicode scalar properties

The following has some corner-case known bugs in it

Backtracking to completely different function call stack
- Currently we restore a stack index, but it's not clear if we need to restore entire stack

The following are currently unsupported

All Unicode scalar properties
Atomic grouping
- Will need to figure out how best to play with the rest of the stdlib here
Custom look-behind assertions
Script runs and PCRE-style call outs
- We have engine support, but isn't hooked up to syntax
- We will likely want something strongly-typed and better checked
Matching options
- Things like case-insensitivity, semantic mode switching, etc
Subpatterns
Conditional patterns
- (awaiting parser support)
Oniguruma style absent functions
Keep/reset (\K)

The following is undetermined

Grapheme-semantic mode switching and behavior/design
Options, especially controlling backtracking
Provisioning for the interpreter
How best to do word-boundary analysis

Performance

TODO

Better parser recovery

Currently we throw parser errors. While this is very convenient, we need to start recovering where we can, and still produce an AST.

`@RegexComponentBuilder` overloads for collection algorithms

Collection/string algorithms should include an overload of each one to take a @RegexComponentBuilder

Support dynamically obtaining the capture count from `Regex`

extension RegexComponent {
    public var captureCount: Int { get }
}

Allow end-of-line comments in custom character classes

We currently don't parse end-of-line comments as comments in a character class such as:

let r = #/
[
  a # comment
]
/#

However ICU supports this.

`split` methods should have `maxSplits` and `omittingEmptySubsequences`

split methods should have maxSplits and omittingEmptySubsequences configurations to match their stdlib String counterpart:

split(separator:maxSplits:omittingEmptySubsequences:)
split(maxSplits:omittingEmptySubsequences:whereSeparator:)

https://developer.apple.com/documentation/swift/string/2894564-split

Type-erasing `Regex<AnyRegexOutput>.init(_:)` is not implemented

Throws a fatal error with FIXME: Not implemented

I'm raising this issue as request to not drop this functionality (however unlikely that is).
It will have its uses!

Strip extra protocols / implementations from algorithms

For performance, see what we can remove from implementations.

_RegexParser fails to build with a 5.6 toolchain

When doing a --bootstrapping=hosttools Swift compiler build with Xcode 13.4, you get the error:

/Users/hamish/src/swift-dev/swift-experimental-string-processing/Sources/_RegexParser/Utility/TypeConstruction.swift:63:64: error: cannot infer return type for closure with multiple statements; add explicit type to disambiguate
    let result = elementTypes.withContiguousStorageIfAvailable { elementTypesBuffer in
                                                               ^
                                                                                    -> <#Result#>

We should ideally allow _RegexParser to compile with a 5.6 toolchain.

Substring matches base?

From the forums

    let regex = Regex { OneOrMore(.any) }
    print("abc".wholeMatch(of: regex)!.0) // prints "abc", as expected
    print("abc".suffix(1).wholeMatch(of: regex)!.0) // also prints "abc"

Reject quantifiers on zero-width assertions

A repeated assertion, like \b+ causes an infinite loop, since it checks the same position over and over without advancing in the input. We should reject regexes with this pattern and prevent RegexBuilder regexes from creating them.

AST post-processing

This issue tracks logic that needs to be implemented after the parser has produced an AST.

To be implemented:

Group reference validity checking (as group references may refer to groups that come after them)
- This will require deciding how to handle the ambiguity with the syntax (?(xxx)). PCRE always treats this as a named reference. .NET only treats it as a named reference if there is a group defined with that name, otherwise it treats it as an arbitrary regex condition. It's possible we may want to require users explicitly spell named references (?('xxx')) to avoid this ambiguity.
Errors for AST nodes that are unsupported by the matching engine
Warnings for syntax that has a better spelling

Unaligned store in `CaptureStructure.encode(to:)`

https://gist.github.com/rjmccall/28fa7ff66402fddde4114970201f7da5

Reproducible with an asan build. (Thanks @rjmccall!)

    buffer.storeBytes(
      of: Self.currentSerializationVersion, as: SerializationVersion.self)

Whitespace in CCCs in extended syntax causes trap when matching

The following traps in Processor.cycle(), due to the unescaped space inside the custom character class:

try! Regex(compiling: #"(?xx)[ \t]+"#).matchWhole(" \t ")

Removing the space ([\t]) or escaping it ([\ \t]) resolves the crash.

Flatten optionality of regex captures

We currently nest optionals for regex literal captures, we should flatten them out to match the regex literal proposal.

Case insensitivity behavior of character class ranges differs from PCRE

PCRE allows (?i)[B-a], which matches e.g B, [, b, and A. However, we crash at runtime.

rdar://96898279

Diagnostic Improvements

This tracks some improvements we can make to the parser diagnostics we currently emit

Add octal fix-it for e.g [\7]
For e.g \p{otherLowercase}, redirect users to \p{isAlphabetic}

Implement parser warnings

We should start emitting warnings for at least these cases:

Unnecessary escaping (in e.g extended literals)
Cases where - is treated as literal in a custom character class
When ] is treated as literal when it is the first member of a custom character class
Certain non-canonical syntax?

Engine limiters and checking task cancellation

During proposal reviews, concerns of RDoS and responsivity came up. We should add engine limiters which can halt execution if some threshold is exceeded, either counted in time or number of byte code instructions (possibly proportional to input length). Additionally, we should check Task.checkCancellation() in case the parent task has been cancelled.

Basic limiter infrastructure based on engine cycle count
Occasionally (say every thousand bytecode instructions executed) check cancellation
Testing infrastructure around limiters
Limiters based on input length
Surface as API

From https://forums.swift.org/t/se-0350-regex-type-and-overview/56530/19:

In the meantime, before that is available, would it be possible to include some calls to Task.checkCancellation() during the evaluation of the match? The use case I'm thinking of for this for evaluating a user-specified Regex in an interactive application. If the match operation is taking too long, it should be possible for the user to cancel this.

`\N{name}` needs to use fuzzy matching

The name in a \N{...} block should use fuzzy matching instead of exact equality.

"👩‍👩‍👧‍👦".contains(/(?u).\N{ZERO WIDTH JOINER}/)   // true
"👩‍👩‍👧‍👦".contains(/(?u).\N{ZeroWidthJoiner}/)     // false, should be true

Adopt swift-atomics as a dependency

If we take the approach in #457 of subsetting a sliver of swift-atomics, we should switch to rely on the actual package when we're able to do so. (i.e. when that doesn't pose problems for integration into the compiler, etc.)

Fix matching of `\N`

\N currently uses _CharacterClassModel and is defined as an inverted newline sequence. However this doesn't seem correct as it shouldn't be affected by the options that change what \R matches. It seems like it ought to use emitAny() instead, as it should be identical to . except not being affected by (?s) (i.e never matching a newline).

`Regex` is only available in macOS 9999 or newer

To try out the functionality provided here, download the latest open source development toolchain. Import _StringProcessing in your source file to get access to the API and specify -Xfrontend -enable-experimental-string-processing to get access to the literals.

Following these steps from the README doesn't seem to be all that is necessary to try out the Regex features with the swift-DEVELOPMENT-SNAPSHOT-2022-04-21-a-osx toolchain.

Decide on repeated empty match behavior

Does "1".matches(of: /(|\d)/) match twice or three times?

Build warning "'matchLevel' is deprecated"

This warning has been around for some while in both the package and the compiler builds. Would be good to clear this.

Sources/_StringProcessing/ConsumerInterface.swift:139:12: warning build: 'matchLevel' is deprecated

Implement case-folded comparisons as stdlib SPI

Following on the behavior improvement in #383, we should provide case-folded comparisons as stdlib SPI. We should be able to skip UTF-8 encoding and do character-by-character canonicalization much more efficiently that way.

Pitch and Proposal Status

Regex type and overview

TODO: Draft/PR/Thread

Presents basic Regex type and gives an overview of how everything fits into the overall story

TODO: Should we pull more into this topic, perhaps either typed captures or more run-time creation aspects?

Run-time Regex construction

Pitch thread: Regex Syntax
- Brief: Syntactic superset of PCRE2, Oniguruma, ICU, UTS#18, etc.
TODO: Pull in discussion of initializers, extended syntaxes, and AnyRegexOutput into this thread

Covers the "interior" syntax, extended syntaxes, run-time construction of a regex from a string, and details of AnyRegexOutput.

Regex literals

TODO: Thread
Draft: delimiters
(Old) original pitch:
- Thread
- Update

Covers the choice of regex delimiter and integration into the Swift programming language.

TODO: Should we pull more into this topic? E.g. introducing typed captures here?

Regex builder DSL

Pitch thread

Covers the result builder approach and basic API.

String processing algorithms

Pitch thread

Proposes a slew of Regex-powered algorithms.

Introduces CustomMatchingRegexComponent, which is a monadic-parser style interface for external parsers to be used as components of a regex.

Unicode for String Processing

TODO: Where is this at @natecook1000?

Thread on Swift Forums

Covers three topics:

Proposes literal and DSL API for library-defined character classes, Unicode scripts and properties, and custom character classes.
Proposes literal and DSL API for options that affect matching behavior.
Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes.

TODO:

Pitch API for character classes/scripts/etc. (by 2/25)
Pitch option API (by 2/25)
Verify grapheme-break semantics (by 2/28)

Target date: Proposal ready for review by 3/7

⏳ SE-0348: `buildPartialBlock` for result builders

(Old) Overview

Thread

Introduces our general approach: regex literals and result builders together.

(Old) Pitch: Regular expression literals

(Old) Pitch: Strongly typed regex captures

Thread

Presents the basic result builder approach of putting type arity and kind of capture in generic type parameter position.

The pitch proposes that the whole match ("capture 0") also be a generic parameter, pending a deeper look into details concerning variadic generics and/or additional result builder API.

TODO: Probably makes sense to slurp this up into one of the other pitches.

Look at `resetStartOfMatch` behavior

How are we handling this via DSL? How does it impact replacement?

Non-semantic whitespace doesn't work for character class ranges

The following doesn't parse correctly:

(?x)[ a - c ]

Regex API Status

TBD

Regex parser should be wary of combining characters

The regex parser is based over Character, which is fine, but means that some programs could put combining scalars following meta characters and those will not compare equal. We have a few options:

We process scalars instead
We error out for any multi-scalar grapheme cluster that starts with a metacharacter scalar

The latter seems simpler and there's an easy (and highly advisable!) fall back path of representing the combining scalar through an escape.

Parser should error out for unsupported runtime features

Right now we will error out during regex-compilation for unsupported features or combinations, but the parser should also surface these errors for swift-compilation time.

Regex literals with invalid `\N{...}` names should not compile

UTS18 makes a distinction between providing an invalid name in a Unicode property character class (\p{name=...}) and an individual named character (\N{...}). If a programmer uses an invalid name in a property character class, the expression should compile and that character class should simply not match anything:

"🐯".contains(/\p{name=TIGER FACE}/)     // true
"🐯".contains(/\p{name=TIEGR FACE}/)     // false

However, an invalid name given in a \N{...} named character should be a syntax/compilation error:

"🐯".contains(/\N{TIEGR FACE}/)          // error: Invalid Unicode scalar name

See https://unicode.org/reports/tr18/#Individually_Named_Characters

Implement the special Java character properties

These character properties are described in the regex syntax proposal but are not implemented:

... javaLowerCase, javaUpperCase, javaWhitespace, javaMirrored.

Implement named backreference support

Named backreferences e.g (?<x>)\k<x> are not currently supported, but it seems like we ought to be able to lookup the capture index from the capture list, and emit it the same as a numbered backreference (i.e \1).

Support obtaining captures by name on `AnyRegexOutput`

extension Regex.Match where Output == AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

extension AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

Regex DSL Status

This tracks the status and progress of built-in result builder DSL APIs. It doesn't necessarily reflect other related API, the protocols used for library extension, or API details beyond result builders such as options and custom character classes.

See #63 and #132

Current Status

Needed

Conditional patterns
Pattern inversions
Options and option scopes
Redundantly named captures, branch-reset, etc
Built-in properties
Balancing groups
Recursion level for backreferences
Named captures
TBD

Type system concerns / impact

TBD

Structural and flat captures

TBD

Allow empty inline comments

We currently reject (?#), but it should be allowed.

API design: Custom abort errors and API threading

For our extension points, including tryCapture, a return of nil signals a local matching failure and backtrack. A thrown error aborts. How should we surface thrown errors, and are they worth threading through our API?

Update integration doc on README

The integration section in the README needs an update.

Remove _CUnicode module as it no longer exists.
Add RegexBuilder as another "specially integrated module".

Require unique names for named captures

According to regex101, PCRE2 requires a unique name for each capture, but the current implementation of Regex.init(compiling:) doesn't throw an error when there's duplicate names.

Digit matching behaving as intended?

Reposted from the Swift forums: https://forums.swift.org/t/bad-digit-matching-bugreport-regarding-se-0354-regex-literals/57262/1

Problem: Some digit character groups match number-like grapheme clusters.

// this matches:
try /[1-2]/.wholeMatch(in: "1️⃣")

// still matches:
try /[1-2]/.asciiOnlyDigits().wholeMatch(in: "1️⃣")

// does not match:
try /[12]/.wholeMatch(in: "1️⃣")

Above described behavior seems inconsistent and difficult to predict. Shouldn't [1-2] and [12] be identical? Should they match anything outside of ascii?

Note: 1️⃣ is U+0031 (ascii digit 1) U+FE0F (VARIATION SELECTOR-16) U+20E3 (COMBINING ENCLOSING KEYCAP)

Same is true for 1︎⃣: U+0031 (ascii digit 1) U+FE0E (VARIATION SELECTOR-15) U+20E3 (COMBINING ENCLOSING KEYCAP)

rdar://96898279

Near-Future Work

I want to gather up many areas of near-future work that we've been clarifying through the proposal reviews.

Loose categorization:

Language and integration

Ability to use a String-backed, CaseIterable enum as a regex component
Define errors types for compilation and type mismatches
Callouts from literals
A Regex-backed enum that will construct a ChoiceOf all cases in order

API

Ability to map over a regex, perhaps per-capture, to supply post-processing transforms at regex declaration time
A modifier on a regex to convert it to matches-anywhere semantics
- E.g. regex.matchingAnywhere => Regex { /.*?/ ; regex ; /.*/ }.
- But we'd preserve the matched range, i.e. reset start/end position
Character alignment queries
- API for whether start/end is Character-aligned for whole match and each capture
API to query options (e.g. is this case insensitive?)
API for (?n), could be nice to strip out captures you don't care about, especially for type erased regexes.
- compilation error if there are back-references or it if changes the semantics of the program

Algorithms

Add a replace(_:withTemplate:) method that recognizes $1 or \1 placeholders
A separator-preserving split variant
Suffix / from-the-end operations (trim etc)
Customize search

String and Unicode

Add unsupported Unicode properties to Unicode.Properties and support in regexes
Add Unicode.AllScalars as a public type (semi-tangential)
Add var Substring.range: Range<String.Index> to simplify getting the range of a capture group
Inits for making a NFC string from UTF-8
String.lines() and String.words()
Add option for canonical equivalence in scalar-semantic mode

Dynamic Regex API

Add a capture-description API to all regexes
- some RAC of capture, which has a type and optionality
Missing match conversions
- Regex<T>.Match.init?(_:ARO)
- Regex<T>.Match.init?(_:Regex<ARO>.Match)

Builders

A high-level helper for separated/quoted repetitions, e.g Repeat(separator: \.whitespace) { ... }
A helper for repeated matching lookahead and negative lookahead, e.g. Repeat(while:) Repeat(whileNot:)
- Until(negLookaheadCondition) { ... }
A func compile() throws to explicitly trigger compilation and get errors, such as quantifying the unquantifiable
- This is useful when composing regexes together to check the final result instead of trapping at run time.
Default Reference capture type to Substring.self

Engine

Engine limiters, low-level backtracking control and timeouts
- #262
Provide a way to access all values of a repeated capture (e.g. subscribe)
Conditionals (?(x)...) (requires updated parsing)
Quoted string inside custom character classes (e.g. [a-z\q{ch}])

Parser

Support for duplicate group names through (?J) (requires figuring out typed captures)
Support for branch reset alternations (?|) (parsing is implemented, but requires figuring out typed captures)
Parsing of conditionals (?(x)...) in accordance to what is in the syntax proposal (we currently parse the condition differently)
- Including interpolation conditions (?(?{...}))
- Conditional conditions don't capture on their own, only for child nodes e.g (?((x))x). .NET also forbids named capture conditions, we should ban that.
- Stop parsing named reference conditions for (?(x)...)
- Don't allow (?(DEFINE)) to have a false branch
Support for regex property values \p{key=/regex/}
Support for transform matching e.g \p{toNFKC_Casefold=@toNFKC@}
Support for alternative character property separators?
- UTS#18 suggests key≠value, key!=value
- Perl allows key:value
Support a** syntax as explicitly eager quantification
- I.e. it's not affected by API to change default quantification kind, (probably) not affected by (?U)

Failing ParseableInterface test

_MatchingEngine is failing the ParseableInterface test in the Swift repo.

/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: type 'TypedIndex<C, 👻>' does not conform to protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: unavailable instance method 'replaceSubrange(_:with:)' was used to satisfy a requirement of protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
Swift.RangeReplaceableCollection:4:26: note: 'replaceSubrange(_:with:)' declared here
    public mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: C) where C : Collection, Self.Element == C.Element
                         ^
Swift.RangeReplaceableCollection:4:19: note: requirement 'replaceSubrange(_:with:)' declared here
    mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
                  ^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1464:14: error: no exact matches in call to instance method 'replaceSubrange'
    rawValue.replaceSubrange(rawRange, with: newElements)
             ^
Swift.RangeReplaceableCollection:4:19: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
    mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
                  ^
Swift.RangeReplaceableCollection:2:37: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
    @inlinable public mutating func replaceSubrange<C, R>(_ subrange: R, with newElements: __owned C) where C : Collection, R : RangeExpression, Self.Element == C.Element, Self.Index == R.Bound
                                    ^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1:1: error: failed to build module '_MatchingEngine' for importation due to the errors above; the textual interface may be broken by project issues or a compiler bug
// swift-interface-format-version: 1.0
^

--

********************
********************
Failed Tests (1):
  Swift-validation(macosx-x86_64) :: ParseableInterface/verify_all_overlays.py

`Regex.Match` element accessors should not materialize the whole output

Currently Regex.Match element accessors are implemented this way:

  /// Lookup a capture by name or number
  public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T {
    output[keyPath: keyPath]
  }

  // Allows `.0` when `Match` is not a tuple.
  @_disfavoredOverload
  public subscript(
    dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
  ) -> Output {
    output
  }

This is not correct as we should not materialize the entire output.

Adopt lightweight generics for generic functions

Functions that take a generic argument should adopt light weight generics and refer to the type directly as the argument type.

For example,

public func contains<C: Collection>(_ other: C) -> Bool

would be

public func contains(_ other: some Collection<Element>) -> Bool

Adoption of standard library types is blocked by swiftlang/swift#41843

Syntax Status and Roadmap

For the regex literal syntax, we're looking at supporting a syntactic superset of:

PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
Oniguruma, an internationalization-oriented engine with some modern features
ICU, used by NSRegularExpression, a Unicode-focused engine
Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.
TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns

These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.

Status

The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:

Alternations a|b
Capture groups e.g (x), (?:x), (?<name>x)
Escaped character sequences e.g \n, \a
Unicode scalars e.g \u{...}, \x{...}, \uHHHH
Builtin character classes e.g ., \d, \w, \s
Custom character classes [...], including binary operators &&, ~~, --
Quantifiers x?, x+, x*, x{n,m}
Anchors e.g \b, ^, $
Quoted sequences \Q ... \E
Comments (?#comment)
Character properties \p{...}, [:...:]
Named characters \N{...}, \N{U+hh}
Lookahead and lookbehind e.g (?=), (?!), (*pla:), (?*...), (?<*...), (napla:...)
Script runs e.g (*script_run:...), (*sr:...), (*atomic_script_run:...), (*asr:...)
Octal sequences \ddd, \o{...}
Backreferences e.g \1, \g2, \g{2}, \k<name>, \k'name', \g{name}, \k{name}, (?P=name)
Matching options e.g (?m), (?-i), (?:si), (?^m)
Sub-patterns e.g \g<n>, \g'n', (?R), (?1), (?&name), (?P>name)
Conditional patterns e.g (?(R)...), (?(n)...), (?(<n>)...), (?('n')...), (?(condition)then|else)
PCRE callouts e.g (?C2), (?C"text")
PCRE backtracking directives e.g (*ACCEPT), (*SKIP:NAME)
[.NET] Balancing group definitions (?<name1-name2>...)
[Oniguruma] Recursion level for backreferences e.g \k<n+level>, (?(n+level))
[Oniguruma] Extended callout syntax e.g (?{...}), (*name)
- NOTE: In Perl, (?{...}) has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)
[Oniguruma] Absent functions e.g (?~absent)
PCRE global matching options e.g (*LIMIT_MATCH=d), (*LF)
Extended-mode (?x)/(?xx) syntax allowing for non-semantic whitespace and end-of-line comments abc # comment

Experimental syntax

Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".

Non-semantic whitespace: /a b c/ === /abc/
Modern quotes: /"a.b"/ === /\Qa.b\E/
Swift style ranges: /a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
Non-captures: /a (_: b) c/ === /a(?:b)c/

TBD:

Modern named captures: /a (name: b) c/ === /a(?<name>b)c/
Modern comments using /* comment */ or // commentinstead of(?#. comment)`
Multi-line expressions
- Line-terminating comments as // comment
Full Swift-lexed comments, string literals as quotes (includes raw and interpolation), etc.
- Makes sense to add as we suck actual literal lexing through our wormhole in the compiler

Swift's syntactic additions

Options for selecting a semantic level
- X: grapheme cluster semantics
- O: Unicode scalar semantics
- b: byte semantics

Source location tracking

Implemented:

Location of | in alternation
Location of - in [a-f]