Giter Site home page Giter Site logo

swift-experimental-string-processing's Introduction

Declarative String Processing for Swift

An early experimental general-purpose pattern matching engine for Swift.

See Declarative String Processing Overview

Requirements

Trying it out

To try out the functionality provided here, download the latest open source development toolchain. Import _StringProcessing in your source file to get access to the API and specify -Xfrontend -enable-experimental-string-processing to get access to the literals.

For example, in a Package.swift file's target declaration:

.target(
    name: "foo",
    dependencies: ["depA"],
    swiftSettings: [.unsafeFlags(["-Xfrontend", "-enable-experimental-string-processing"])]
 ),

Integration with Swift

_RegexParser and _StringProcessing are specially integrated modules that are built as part of apple/swift.

Specifically, _RegexParser contains the parser for regular expression literals and is built both as part of the compiler and as a core library. _CUnicode and _StringProcessing are built together as a core library named _StringProcessing.

Module Swift toolchain component
_RegexParser SwiftCompilerSources/Sources/_RegexParser and stdlib/public/_RegexParser
_CUnicode stdlib/public/_StringProcessing
_StringProcessing stdlib/public/_StringProcessing

Branching scheme

Development branch

The main branch is the branch for day-to-day development. Generally, you should create PRs against this branch.

Swift integration branches

Branches whose name starts with swift/ are Swift integration branches similar to those in apple/llvm-project. For each branch, dropping the swift/ prefix is the corresponding branch in apple/swift.

apple/swift branch apple/swift-experimental-string-processing branch
main swift/main
release/5.7 swift/release/5.7
... swift/...

A pair of corresponding branches are expected to build successfully together and pass all tests.

Integration workflow

To integrate the latest changes in apple/swift-experimental-string-processing to apple/swift, carefully follow the workflow:

  • Create pull requests.
    • Create a branch from a commit on main that you would like to integrate into swift/main.
    • Create a pull request in apple/swift-experimental-string-processing from that branch to swift/main, e.g. "[Integration] main () -> swift/main".
    • If apple/swift needs to be modified to work with the latest main in apple/swift-experimental-string-processing, create a pull request in apple/swift. Note: Since CI in apple/swift-experimental-string-processing has not yet been set up to run full toolchain tests, you should create a PR in apple/swift regardless; if the integartion does not require changing apple/swift, create a dummy PR in apple/swift by changing the README and just not merge it in the end.
  • Trigger CI.
    • In the apple/swift-experimental-string-processing pull request, trigger CI using the following command (replacing <PR NUMBER> with the apple/swift pull request number, if any):
      apple/swift#<PR NUMBER> # use this line only if there is an corresponding apple/swift PR
      @swift-ci please test
      
    • In the apple/swift pull request (if any), trigger CI using the following command (replacing <PR NUMBER> with the apple/swift-experimental-string-processing pull request number):
      apple/swift-experimental-string-processing#<PR NUMBER>
      @swift-ci please test
      
  • Merge when approved.
    • Merge the pull request in apple/swift-experimental-string-processing as a merge commit.
    • Merge the pull request in apple/swift (if any).

Development notes

Compiler integration can be tricky. Use special caution when developing _RegexParser and _StringProcessing modules.

  • Do not change the names of these modules without due approval from compiler and infrastructure teams.
  • Do not modify the existing ABI (e.g. C API, serialization format) between the regular expression parser and the Swift compiler unless absolutely necessary.
  • Always minimize the number of lockstep integrations, i.e. when apple/swift-experimental-string-processing and apple/swift have to change together. Whenever possible, introduce new API first, migrate Swift compiler onto it, and then deprecate old API. Use versioning if helpful.
  • In _StringProcessing, do not write fully qualified references to symbols in _CUnicode, and always wrap import _CUnicode in a #if canImport(_CUnicode). This is because _CUnicode is built as part of _StringProcessing with CMake.

swift-experimental-string-processing's People

Contributors

amartini51 avatar azoy avatar benedictst avatar catfish-man avatar compnerd avatar daveewing avatar etcwilde avatar finagolfin avatar glessard avatar grynspan avatar hamishknight avatar harlanhaskins avatar itingliu avatar jpsim avatar kateinoigakukun avatar kylemacomber avatar milseman avatar natecook1000 avatar ole avatar rctcwyvrn avatar rintaro avatar rxwei avatar shahmishal avatar slavapestov avatar stephentyrone avatar tkremenek avatar tshortli avatar uhooi avatar valeriyvan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

swift-experimental-string-processing's Issues

Implement case-folded comparisons as stdlib SPI

Following on the behavior improvement in #383, we should provide case-folded comparisons as stdlib SPI. We should be able to skip UTF-8 encoding and do character-by-character canonicalization much more efficiently that way.

_RegexParser fails to build with a 5.6 toolchain

When doing a --bootstrapping=hosttools Swift compiler build with Xcode 13.4, you get the error:

/Users/hamish/src/swift-dev/swift-experimental-string-processing/Sources/_RegexParser/Utility/TypeConstruction.swift:63:64: error: cannot infer return type for closure with multiple statements; add explicit type to disambiguate
    let result = elementTypes.withContiguousStorageIfAvailable { elementTypesBuffer in
                                                               ^
                                                                                    -> <#Result#>

We should ideally allow _RegexParser to compile with a 5.6 toolchain.

Adopt lightweight generics for generic functions

Functions that take a generic argument should adopt light weight generics and refer to the type directly as the argument type.

For example,

public func contains<C: Collection>(_ other: C) -> Bool

would be

public func contains(_ other: some Collection<Element>) -> Bool

Adoption of standard library types is blocked by apple/swift#41843

Require unique names for named captures

According to regex101, PCRE2 requires a unique name for each capture, but the current implementation of Regex.init(compiling:) doesn't throw an error when there's duplicate names.

Regex parser should be wary of combining characters

The regex parser is based over Character, which is fine, but means that some programs could put combining scalars following meta characters and those will not compare equal. We have a few options:

  • We process scalars instead
  • We error out for any multi-scalar grapheme cluster that starts with a metacharacter scalar

The latter seems simpler and there's an easy (and highly advisable!) fall back path of representing the combining scalar through an escape.

`Regex` is only available in macOS 9999 or newer

To try out the functionality provided here, download the latest open source development toolchain. Import _StringProcessing in your source file to get access to the API and specify -Xfrontend -enable-experimental-string-processing to get access to the literals.

Following these steps from the README doesn't seem to be all that is necessary to try out the Regex features with the swift-DEVELOPMENT-SNAPSHOT-2022-04-21-a-osx toolchain.

Fix matching of `\N`

\N currently uses _CharacterClassModel and is defined as an inverted newline sequence. However this doesn't seem correct as it shouldn't be affected by the options that change what \R matches. It seems like it ought to use emitAny() instead, as it should be identical to . except not being affected by (?s) (i.e never matching a newline).

Implement parser warnings

We should start emitting warnings for at least these cases:

  • Unnecessary escaping (in e.g extended literals)
  • Cases where - is treated as literal in a custom character class
  • When ] is treated as literal when it is the first member of a custom character class
  • Certain non-canonical syntax?

Bring pretty printer up to date

The pretty printer converts a Regex defined in syntax into a RegexBuilder-style one, but it isn't up to date with the latest changes. This needs to be brought up to date, ideally with a validator that can take a regex and some input/output data, convert the regex to builder style, and then validate that both regexes have the same behavior.

For example, #"(?P)\w+ \d{2,} -"# converts to

Group(/* TODO: changeMatchingOptions */) {
  Concatenation {
    /* TODO: assertions */
    OneOrMore(.eager) {
      .wordCharacter
    }
    "-"
  }
}

instead of

Regex {
  Anchor.startOfInput
  OneOrMore(.eager) {
    CharacterClass.wordCharacter
  }
  "-"
}.asciiOnlyCharacterClasses()

Regex literals with invalid `\N{...}` names should not compile

UTS18 makes a distinction between providing an invalid name in a Unicode property character class (\p{name=...}) and an individual named character (\N{...}). If a programmer uses an invalid name in a property character class, the expression should compile and that character class should simply not match anything:

"🐯".contains(/\p{name=TIGER FACE}/)     // true
"🐯".contains(/\p{name=TIEGR FACE}/)     // false

However, an invalid name given in a \N{...} named character should be a syntax/compilation error:

"🐯".contains(/\N{TIEGR FACE}/)          // error: Invalid Unicode scalar name

See https://unicode.org/reports/tr18/#Individually_Named_Characters

Better parser recovery

Currently we throw parser errors. While this is very convenient, we need to start recovering where we can, and still produce an AST.

Syntax Status and Roadmap

For the regex literal syntax, we're looking at supporting a syntactic superset of:

  • PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.

  • Oniguruma, an internationalization-oriented engine with some modern features

  • ICU, used by NSRegularExpression, a Unicode-focused engine

  • Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.

  • TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns

These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.

Status

The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:

  • Alternations a|b
  • Capture groups e.g (x), (?:x), (?<name>x)
  • Escaped character sequences e.g \n, \a
  • Unicode scalars e.g \u{...}, \x{...}, \uHHHH
  • Builtin character classes e.g ., \d, \w, \s
  • Custom character classes [...], including binary operators &&, ~~, --
  • Quantifiers x?, x+, x*, x{n,m}
  • Anchors e.g \b, ^, $
  • Quoted sequences \Q ... \E
  • Comments (?#comment)
  • Character properties \p{...}, [:...:]
  • Named characters \N{...}, \N{U+hh}
  • Lookahead and lookbehind e.g (?=), (?!), (*pla:), (?*...), (?<*...), (napla:...)
  • Script runs e.g (*script_run:...), (*sr:...), (*atomic_script_run:...), (*asr:...)
  • Octal sequences \ddd, \o{...}
  • Backreferences e.g \1, \g2, \g{2}, \k<name>, \k'name', \g{name}, \k{name}, (?P=name)
  • Matching options e.g (?m), (?-i), (?:si), (?^m)
  • Sub-patterns e.g \g<n>, \g'n', (?R), (?1), (?&name), (?P>name)
  • Conditional patterns e.g (?(R)...), (?(n)...), (?(<n>)...), (?('n')...), (?(condition)then|else)
  • PCRE callouts e.g (?C2), (?C"text")
  • PCRE backtracking directives e.g (*ACCEPT), (*SKIP:NAME)
  • [.NET] Balancing group definitions (?<name1-name2>...)
  • [Oniguruma] Recursion level for backreferences e.g \k<n+level>, (?(n+level))
  • [Oniguruma] Extended callout syntax e.g (?{...}), (*name)
    • NOTE: In Perl, (?{...}) has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)
  • [Oniguruma] Absent functions e.g (?~absent)
  • PCRE global matching options e.g (*LIMIT_MATCH=d), (*LF)
  • Extended-mode (?x)/(?xx) syntax allowing for non-semantic whitespace and end-of-line comments abc # comment

Experimental syntax

Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".

  • Non-semantic whitespace: /a b c/ === /abc/
  • Modern quotes: /"a.b"/ === /\Qa.b\E/
  • Swift style ranges: /a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
  • Non-captures: /a (_: b) c/ === /a(?:b)c/

TBD:

  • Modern named captures: /a (name: b) c/ === /a(?<name>b)c/
  • Modern comments using /* comment */ or // commentinstead of(?#. comment)`
  • Multi-line expressions
    • Line-terminating comments as // comment
  • Full Swift-lexed comments, string literals as quotes (includes raw and interpolation), etc.
    • Makes sense to add as we suck actual literal lexing through our wormhole in the compiler

Swift's syntactic additions

  • Options for selecting a semantic level
    • X: grapheme cluster semantics
    • O: Unicode scalar semantics
    • b: byte semantics

Source location tracking

Implemented:

  • Location of | in alternation
  • Location of - in [a-f]

TBD:

Integration with the Swift compiler

Initial parser support landed in apple/swift#40595, using the delimiters '/.../', which are lexed in-package.

Implement named backreference support

Named backreferences e.g (?<x>)\k<x> are not currently supported, but it seems like we ought to be able to lookup the capture index from the capture list, and emit it the same as a numbered backreference (i.e \1).

Pitch and Proposal Status

Regex type and overview

  • TODO: Draft/PR/Thread

Presents basic Regex type and gives an overview of how everything fits into the overall story

TODO: Should we pull more into this topic, perhaps either typed captures or more run-time creation aspects?

Run-time Regex construction

  • Pitch thread: Regex Syntax
    • Brief: Syntactic superset of PCRE2, Oniguruma, ICU, UTS#18, etc.
  • TODO: Pull in discussion of initializers, extended syntaxes, and AnyRegexOutput into this thread

Covers the "interior" syntax, extended syntaxes, run-time construction of a regex from a string, and details of AnyRegexOutput.

Regex literals

Covers the choice of regex delimiter and integration into the Swift programming language.

TODO: Should we pull more into this topic? E.g. introducing typed captures here?

Regex builder DSL

Covers the result builder approach and basic API.

String processing algorithms

Proposes a slew of Regex-powered algorithms.

Introduces CustomMatchingRegexComponent, which is a monadic-parser style interface for external parsers to be used as components of a regex.

Unicode for String Processing

TODO: Where is this at @natecook1000?

Thread on Swift Forums

Covers three topics:

  • Proposes literal and DSL API for library-defined character classes, Unicode scripts and properties, and custom character classes.
  • Proposes literal and DSL API for options that affect matching behavior.
  • Defines how Unicode scalar-based classes are extended to grapheme clusters in the different semantic and other matching modes.

TODO:

  • Pitch API for character classes/scripts/etc. (by 2/25)
  • Pitch option API (by 2/25)
  • Verify grapheme-break semantics (by 2/28)

Target date: Proposal ready for review by 3/7

⏳ SE-0348: buildPartialBlock for result builders

(Old) Overview

Introduces our general approach: regex literals and result builders together.

(Old) Pitch: Regular expression literals

(Old) Pitch: Strongly typed regex captures

Presents the basic result builder approach of putting type arity and kind of capture in generic type parameter position.

The pitch proposes that the whole match ("capture 0") also be a generic parameter, pending a deeper look into details concerning variadic generics and/or additional result builder API.

TODO: Probably makes sense to slurp this up into one of the other pitches.

Digit matching behaving as intended?

Reposted from the Swift forums: https://forums.swift.org/t/bad-digit-matching-bugreport-regarding-se-0354-regex-literals/57262/1

Problem: Some digit character groups match number-like grapheme clusters.

// this matches:
try /[1-2]/.wholeMatch(in: "1️⃣")

// still matches:
try /[1-2]/.asciiOnlyDigits().wholeMatch(in: "1️⃣")

// does not match:
try /[12]/.wholeMatch(in: "1️⃣")

Above described behavior seems inconsistent and difficult to predict. Shouldn't [1-2] and [12] be identical? Should they match anything outside of ascii?

Note: 1️⃣ is U+0031 (ascii digit 1) U+FE0F (VARIATION SELECTOR-16) U+20E3 (COMBINING ENCLOSING KEYCAP)

Same is true for 1︎⃣: U+0031 (ascii digit 1) U+FE0E (VARIATION SELECTOR-15) U+20E3 (COMBINING ENCLOSING KEYCAP)

rdar://96898279

Diagnostic Improvements

This tracks some improvements we can make to the parser diagnostics we currently emit

  • Add octal fix-it for e.g [\7]
  • For e.g \p{otherLowercase}, redirect users to \p{isAlphabetic}

Engine limiters and checking task cancellation

During proposal reviews, concerns of RDoS and responsivity came up. We should add engine limiters which can halt execution if some threshold is exceeded, either counted in time or number of byte code instructions (possibly proportional to input length). Additionally, we should check Task.checkCancellation() in case the parent task has been cancelled.

  • Basic limiter infrastructure based on engine cycle count
  • Occasionally (say every thousand bytecode instructions executed) check cancellation
  • Testing infrastructure around limiters
  • Limiters based on input length
  • Surface as API

From https://forums.swift.org/t/se-0350-regex-type-and-overview/56530/19:

In the meantime, before that is available, would it be possible to include some calls to Task.checkCancellation() during the evaluation of the match? The use case I'm thinking of for this for evaluating a user-specified Regex in an interactive application. If the match operation is taking too long, it should be possible for the user to cancel this.

Fails with release build configuration

The string processing package (SwiftPM) fails to build in release mode because the PEG prototype uses @testable import _StringProcessing in order to access the matching engine. It’s not urgent to fix it since we don't ever release using SwiftPM, but i think we should either move PEG to a test target or make the matching engine symbols be SPI for PEG to access.

Support obtaining captures by name on `AnyRegexOutput`

extension Regex.Match where Output == AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

extension AnyRegexOutput {
    public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}

`\N{name}` needs to use fuzzy matching

The name in a \N{...} block should use fuzzy matching instead of exact equality.

"👩‍👩‍👧‍👦".contains(/(?u).\N{ZERO WIDTH JOINER}/)   // true
"👩‍👩‍👧‍👦".contains(/(?u).\N{ZeroWidthJoiner}/)     // false, should be true

Reject quantifiers on zero-width assertions

A repeated assertion, like \b+ causes an infinite loop, since it checks the same position over and over without advancing in the input. We should reject regexes with this pattern and prevent RegexBuilder regexes from creating them.

Update integration doc on README

The integration section in the README needs an update.

  • Remove _CUnicode module as it no longer exists.
  • Add RegexBuilder as another "specially integrated module".

AST post-processing

This issue tracks logic that needs to be implemented after the parser has produced an AST.

To be implemented:

  • Group reference validity checking (as group references may refer to groups that come after them)
    • This will require deciding how to handle the ambiguity with the syntax (?(xxx)). PCRE always treats this as a named reference. .NET only treats it as a named reference if there is a group defined with that name, otherwise it treats it as an arbitrary regex condition. It's possible we may want to require users explicitly spell named references (?('xxx')) to avoid this ambiguity.
  • Errors for AST nodes that are unsupported by the matching engine
  • Warnings for syntax that has a better spelling

Adopt swift-atomics as a dependency

If we take the approach in #457 of subsetting a sliver of swift-atomics, we should switch to rely on the actual package when we're able to do so. (i.e. when that doesn't pose problems for integration into the compiler, etc.)

MatchingEngine Capabilities and Roadmap

Details

TODO

Regex feature status

The matching engine supports (modulo bugs) the following

  • Basic constructs: concatenation, alternation, non-capturing grouping, etc.
  • Literal constructs: scalar literals, quotes,
  • Character classes, custom and built-in
    • Including ranges, nested custom character classes, some set operations (e.g. subtraction), inversion, etc
  • Quantification
    • eager, reluctant, possessive x* x*? x*+
    • bounded and unbounded x+ x{n,m} x{n,} x{,m} x{n} (and kind variants)
  • Character properties: named characters, general category, most UCD properties
  • Assertions: built-in and custom
    • Including anchors such as $, ^
    • Including custom look-ahead assertions
  • Arbitrary consumer call-outs (Input, Range<Input.Index>) -> Input.Index?
    • This is the basic extension point for library-driven pattern matching
    • This is how character classes are currently implemented
  • Arbitrary assertion call-outs (Input, Input.Index, Range<Input.Index>) -> Bool
    • This is how anchors are currently implemented
    • Note the provided bounds, as assertions often deal with boundary conditions
  • Arbitrary value-producing callouts (CustomRegexComponent)
  • A function call stack for recursive PEG-style grammars
    • Note: Backtracking properly manages this by restoring stack position
    • Note: This is only very very lightly tested (mostly for PEGs)
    • TODO: Hook up to (?R), etc.
  • Captures
  • Backreferences
  • Scripts and some missing Unicode scalar properties

The following has some corner-case known bugs in it

  • Backtracking to completely different function call stack
    • Currently we restore a stack index, but it's not clear if we need to restore entire stack

The following are currently unsupported

  • All Unicode scalar properties
  • Atomic grouping
    • Will need to figure out how best to play with the rest of the stdlib here
  • Custom look-behind assertions
  • Script runs and PCRE-style call outs
    • We have engine support, but isn't hooked up to syntax
    • We will likely want something strongly-typed and better checked
  • Matching options
    • Things like case-insensitivity, semantic mode switching, etc
  • Subpatterns
  • Conditional patterns
    • (awaiting parser support)
  • Oniguruma style absent functions
  • Keep/reset (\K)

The following is undetermined

  • Grapheme-semantic mode switching and behavior/design
  • Options, especially controlling backtracking
  • Provisioning for the interpreter
  • How best to do word-boundary analysis

Performance

TODO

Regex DSL Status

This tracks the status and progress of built-in result builder DSL APIs. It doesn't necessarily reflect other related API, the protocols used for library extension, or API details beyond result builders such as options and custom character classes.

See #63 and #132

Current Status

  • Concatenation
  • Alternation
  • Quantification
    • Shorthand operators (.?, .*, .+)
  • Conditional
  • Backreferences
  • Anchors
  • Lookaround assertions
  • Built-in character classes
  • Named subpatterns and subpattern invocations
  • Pattern inversions
  • Options and option scopes
  • Redundantly named captures, branch-reset, etc

Needed

  • Conditional patterns
  • Pattern inversions
  • Options and option scopes
  • Redundantly named captures, branch-reset, etc
  • Built-in properties
  • Balancing groups
  • Recursion level for backreferences
  • Named captures
  • TBD

Type system concerns / impact

TBD

Structural and flat captures

TBD

Near-Future Work

I want to gather up many areas of near-future work that we've been clarifying through the proposal reviews.

Loose categorization:

Language and integration

  • Ability to use a String-backed, CaseIterable enum as a regex component
  • Define errors types for compilation and type mismatches
  • Callouts from literals
  • A Regex-backed enum that will construct a ChoiceOf all cases in order

API

  • Ability to map over a regex, perhaps per-capture, to supply post-processing transforms at regex declaration time
  • A modifier on a regex to convert it to matches-anywhere semantics
    • E.g. regex.matchingAnywhere => Regex { /.*?/ ; regex ; /.*/ }.
    • But we'd preserve the matched range, i.e. reset start/end position
  • Character alignment queries
    • API for whether start/end is Character-aligned for whole match and each capture
  • API to query options (e.g. is this case insensitive?)
  • API for (?n), could be nice to strip out captures you don't care about, especially for type erased regexes.
    • compilation error if there are back-references or it if changes the semantics of the program

Algorithms

  • Add a replace(_:withTemplate:) method that recognizes $1 or \1 placeholders
  • A separator-preserving split variant
  • Suffix / from-the-end operations (trim etc)
  • Customize search

String and Unicode

  • Add unsupported Unicode properties to Unicode.Properties and support in regexes
  • Add Unicode.AllScalars as a public type (semi-tangential)
  • Add var Substring.range: Range<String.Index> to simplify getting the range of a capture group
  • Inits for making a NFC string from UTF-8
  • String.lines() and String.words()
  • Add option for canonical equivalence in scalar-semantic mode

Dynamic Regex API

  • Add a capture-description API to all regexes
    • some RAC of capture, which has a type and optionality
  • Missing match conversions
    • Regex<T>.Match.init?(_:ARO)
    • Regex<T>.Match.init?(_:Regex<ARO>.Match)

Builders

  • A high-level helper for separated/quoted repetitions, e.g Repeat(separator: \.whitespace) { ... }
  • A helper for repeated matching lookahead and negative lookahead, e.g. Repeat(while:) Repeat(whileNot:)
    • Until(negLookaheadCondition) { ... }
  • A func compile() throws to explicitly trigger compilation and get errors, such as quantifying the unquantifiable
    • This is useful when composing regexes together to check the final result instead of trapping at run time.
  • Default Reference capture type to Substring.self

Engine

  • Engine limiters, low-level backtracking control and timeouts
  • Provide a way to access all values of a repeated capture (e.g. subscribe)
  • Conditionals (?(x)...) (requires updated parsing)
  • Quoted string inside custom character classes (e.g. [a-z\q{ch}])

Parser

  • Support for duplicate group names through (?J) (requires figuring out typed captures)
  • Support for branch reset alternations (?|) (parsing is implemented, but requires figuring out typed captures)
  • Parsing of conditionals (?(x)...) in accordance to what is in the syntax proposal (we currently parse the condition differently)
    • Including interpolation conditions (?(?{...}))
    • Conditional conditions don't capture on their own, only for child nodes e.g (?((x))x). .NET also forbids named capture conditions, we should ban that.
    • Stop parsing named reference conditions for (?(x)...)
    • Don't allow (?(DEFINE)) to have a false branch
  • Support for regex property values \p{key=/regex/}
  • Support for transform matching e.g \p{toNFKC_Casefold=@toNFKC@}
  • Support for alternative character property separators?
    • UTS#18 suggests key≠value, key!=value
    • Perl allows key:value
  • Support a** syntax as explicitly eager quantification
    • I.e. it's not affected by API to change default quantification kind, (probably) not affected by (?U)

Substring matches base?

From the forums

    let regex = Regex { OneOrMore(.any) }
    print("abc".wholeMatch(of: regex)!.0) // prints "abc", as expected
    print("abc".suffix(1).wholeMatch(of: regex)!.0) // also prints "abc"

API design: Custom abort errors and API threading

For our extension points, including tryCapture, a return of nil signals a local matching failure and backtrack. A thrown error aborts. How should we surface thrown errors, and are they worth threading through our API?

Failing ParseableInterface test

_MatchingEngine is failing the ParseableInterface test in the Swift repo.

/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: type 'TypedIndex<C, 👻>' does not conform to protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: unavailable instance method 'replaceSubrange(_:with:)' was used to satisfy a requirement of protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
Swift.RangeReplaceableCollection:4:26: note: 'replaceSubrange(_:with:)' declared here
    public mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: C) where C : Collection, Self.Element == C.Element
                         ^
Swift.RangeReplaceableCollection:4:19: note: requirement 'replaceSubrange(_:with:)' declared here
    mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
                  ^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1464:14: error: no exact matches in call to instance method 'replaceSubrange'
    rawValue.replaceSubrange(rawRange, with: newElements)
             ^
Swift.RangeReplaceableCollection:4:19: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
    mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
                  ^
Swift.RangeReplaceableCollection:2:37: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
    @inlinable public mutating func replaceSubrange<C, R>(_ subrange: R, with newElements: __owned C) where C : Collection, R : RangeExpression, Self.Element == C.Element, Self.Index == R.Bound
                                    ^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1:1: error: failed to build module '_MatchingEngine' for importation due to the errors above; the textual interface may be broken by project issues or a compiler bug
// swift-interface-format-version: 1.0
^

--

********************
********************
Failed Tests (1):
  Swift-validation(macosx-x86_64) :: ParseableInterface/verify_all_overlays.py

Build warning "'matchLevel' is deprecated"

This warning has been around for some while in both the package and the compiler builds. Would be good to clear this.

Sources/_StringProcessing/ConsumerInterface.swift:139:12: warning build: 'matchLevel' is deprecated

`Regex.Match` element accessors should not materialize the whole output

Currently Regex.Match element accessors are implemented this way:

  /// Lookup a capture by name or number
  public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T {
    output[keyPath: keyPath]
  }

  // Allows `.0` when `Match` is not a tuple.
  @_disfavoredOverload
  public subscript(
    dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
  ) -> Output {
    output
  }

This is not correct as we should not materialize the entire output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.