apple / swift-experimental-string-processing Goto Github PK
View Code? Open in Web Editor NEWAn early experimental general-purpose pattern matching engine for Swift.
License: Apache License 2.0
An early experimental general-purpose pattern matching engine for Swift.
License: Apache License 2.0
The string processing package (SwiftPM) fails to build in release mode because the PEG prototype uses @testable import _StringProcessing
in order to access the matching engine. Itβs not urgent to fix it since we don't ever release using SwiftPM, but i think we should either move PEG to a test target or make the matching engine symbols be SPI for PEG to access.
The pretty printer converts a Regex
defined in syntax into a RegexBuilder
-style one, but it isn't up to date with the latest changes. This needs to be brought up to date, ideally with a validator that can take a regex and some input/output data, convert the regex to builder style, and then validate that both regexes have the same behavior.
For example, #"(?P)\w+ \d{2,} -"#
converts to
Group(/* TODO: changeMatchingOptions */) {
Concatenation {
/* TODO: assertions */
OneOrMore(.eager) {
.wordCharacter
}
"-"
}
}
instead of
Regex {
Anchor.startOfInput
OneOrMore(.eager) {
CharacterClass.wordCharacter
}
"-"
}.asciiOnlyCharacterClasses()
TODO
The matching engine supports (modulo bugs) the following
x*
x*?
x*+
x+
x{n,m}
x{n,}
x{,m}
x{n}
(and kind variants)$
, ^
(Input, Range<Input.Index>) -> Input.Index?
(Input, Input.Index, Range<Input.Index>) -> Bool
CustomRegexComponent
)(?R)
, etc.The following has some corner-case known bugs in it
The following are currently unsupported
\K
)The following is undetermined
TODO
Currently we throw
parser errors. While this is very convenient, we need to start recovering where we can, and still produce an AST.
Collection/string algorithms should include an overload of each one to take a @RegexComponentBuilder
extension RegexComponent {
public var captureCount: Int { get }
}
We currently don't parse end-of-line comments as comments in a character class such as:
let r = #/
[
a # comment
]
/#
However ICU supports this.
split
methods should have maxSplits
and omittingEmptySubsequences
configurations to match their stdlib String
counterpart:
split(separator:maxSplits:omittingEmptySubsequences:)
split(maxSplits:omittingEmptySubsequences:whereSeparator:)
https://developer.apple.com/documentation/swift/string/2894564-split
Throws a fatal error with FIXME: Not implemented
I'm raising this issue as request to not drop this functionality (however unlikely that is).
It will have its uses!
For performance, see what we can remove from implementations.
When doing a --bootstrapping=hosttools
Swift compiler build with Xcode 13.4, you get the error:
/Users/hamish/src/swift-dev/swift-experimental-string-processing/Sources/_RegexParser/Utility/TypeConstruction.swift:63:64: error: cannot infer return type for closure with multiple statements; add explicit type to disambiguate
let result = elementTypes.withContiguousStorageIfAvailable { elementTypesBuffer in
^
-> <#Result#>
We should ideally allow _RegexParser to compile with a 5.6 toolchain.
From the forums
let regex = Regex { OneOrMore(.any) }
print("abc".wholeMatch(of: regex)!.0) // prints "abc", as expected
print("abc".suffix(1).wholeMatch(of: regex)!.0) // also prints "abc"
A repeated assertion, like \b+
causes an infinite loop, since it checks the same position over and over without advancing in the input. We should reject regexes with this pattern and prevent RegexBuilder
regexes from creating them.
This issue tracks logic that needs to be implemented after the parser has produced an AST.
To be implemented:
(?(xxx))
. PCRE always treats this as a named reference. .NET only treats it as a named reference if there is a group defined with that name, otherwise it treats it as an arbitrary regex condition. It's possible we may want to require users explicitly spell named references (?('xxx'))
to avoid this ambiguity.https://gist.github.com/rjmccall/28fa7ff66402fddde4114970201f7da5
Reproducible with an asan build. (Thanks @rjmccall!)
buffer.storeBytes(
of: Self.currentSerializationVersion, as: SerializationVersion.self)
The following traps in Processor.cycle()
, due to the unescaped space inside the custom character class:
try! Regex(compiling: #"(?xx)[ \t]+"#).matchWhole(" \t ")
Removing the space ([\t]
) or escaping it ([\ \t]
) resolves the crash.
We currently nest optionals for regex literal captures, we should flatten them out to match the regex literal proposal.
PCRE allows (?i)[B-a]
, which matches e.g B
, [
, b
, and A
. However, we crash at runtime.
rdar://96898279
This tracks some improvements we can make to the parser diagnostics we currently emit
[\7]
\p{otherLowercase}
, redirect users to \p{isAlphabetic}
We should start emitting warnings for at least these cases:
-
is treated as literal in a custom character class]
is treated as literal when it is the first member of a custom character classDuring proposal reviews, concerns of RDoS and responsivity came up. We should add engine limiters which can halt execution if some threshold is exceeded, either counted in time or number of byte code instructions (possibly proportional to input length). Additionally, we should check Task.checkCancellation()
in case the parent task has been cancelled.
From https://forums.swift.org/t/se-0350-regex-type-and-overview/56530/19:
In the meantime, before that is available, would it be possible to include some calls to
Task.checkCancellation()
during the evaluation of the match? The use case I'm thinking of for this for evaluating a user-specified Regex in an interactive application. If the match operation is taking too long, it should be possible for the user to cancel this.
The name in a \N{...}
block should use fuzzy matching instead of exact equality.
"π©βπ©βπ§βπ¦".contains(/(?u).\N{ZERO WIDTH JOINER}/) // true
"π©βπ©βπ§βπ¦".contains(/(?u).\N{ZeroWidthJoiner}/) // false, should be true
If we take the approach in #457 of subsetting a sliver of swift-atomics, we should switch to rely on the actual package when we're able to do so. (i.e. when that doesn't pose problems for integration into the compiler, etc.)
\N
currently uses _CharacterClassModel
and is defined as an inverted newline sequence. However this doesn't seem correct as it shouldn't be affected by the options that change what \R
matches. It seems like it ought to use emitAny()
instead, as it should be identical to .
except not being affected by (?s)
(i.e never matching a newline).
To try out the functionality provided here, download the latest open source development toolchain. Import _StringProcessing in your source file to get access to the API and specify -Xfrontend -enable-experimental-string-processing to get access to the literals.
Following these steps from the README doesn't seem to be all that is necessary to try out the Regex features with the swift-DEVELOPMENT-SNAPSHOT-2022-04-21-a-osx
toolchain.
Does "1".matches(of: /(|\d)/)
match twice or three times?
This warning has been around for some while in both the package and the compiler builds. Would be good to clear this.
Sources/_StringProcessing/ConsumerInterface.swift:139:12: warning build: 'matchLevel' is deprecated
Following on the behavior improvement in #383, we should provide case-folded comparisons as stdlib SPI. We should be able to skip UTF-8 encoding and do character-by-character canonicalization much more efficiently that way.
Presents basic Regex type and gives an overview of how everything fits into the overall story
TODO: Should we pull more into this topic, perhaps either typed captures or more run-time creation aspects?
AnyRegexOutput
into this threadCovers the "interior" syntax, extended syntaxes, run-time construction of a regex from a string, and details of AnyRegexOutput
.
Covers the choice of regex delimiter and integration into the Swift programming language.
TODO: Should we pull more into this topic? E.g. introducing typed captures here?
Covers the result builder approach and basic API.
Proposes a slew of Regex-powered algorithms.
Introduces CustomMatchingRegexComponent
, which is a monadic-parser style interface for external parsers to be used as components of a regex.
TODO: Where is this at @natecook1000?
Covers three topics:
TODO:
Target date: Proposal ready for review by 3/7
buildPartialBlock
for result buildersIntroduces our general approach: regex literals and result builders together.
Presents the basic result builder approach of putting type arity and kind of capture in generic type parameter position.
The pitch proposes that the whole match ("capture 0") also be a generic parameter, pending a deeper look into details concerning variadic generics and/or additional result builder API.
TODO: Probably makes sense to slurp this up into one of the other pitches.
How are we handling this via DSL? How does it impact replacement?
The following doesn't parse correctly:
(?x)[ a - c ]
TBD
The regex parser is based over Character
, which is fine, but means that some programs could put combining scalars following meta characters and those will not compare equal. We have a few options:
The latter seems simpler and there's an easy (and highly advisable!) fall back path of representing the combining scalar through an escape.
Right now we will error out during regex-compilation for unsupported features or combinations, but the parser should also surface these errors for swift-compilation time.
UTS18 makes a distinction between providing an invalid name in a Unicode property character class (\p{name=...}
) and an individual named character (\N{...}
). If a programmer uses an invalid name in a property character class, the expression should compile and that character class should simply not match anything:
"π―".contains(/\p{name=TIGER FACE}/) // true
"π―".contains(/\p{name=TIEGR FACE}/) // false
However, an invalid name given in a \N{...}
named character should be a syntax/compilation error:
"π―".contains(/\N{TIEGR FACE}/) // error: Invalid Unicode scalar name
See https://unicode.org/reports/tr18/#Individually_Named_Characters
These character properties are described in the regex syntax proposal but are not implemented:
...
javaLowerCase
,javaUpperCase
,javaWhitespace
,javaMirrored
.
Named backreferences e.g (?<x>)\k<x>
are not currently supported, but it seems like we ought to be able to lookup the capture index from the capture list, and emit it the same as a numbered backreference (i.e \1
).
extension Regex.Match where Output == AnyRegexOutput {
public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}
extension AnyRegexOutput {
public subscript(_ name: String) -> AnyRegexOutput.Element { get }
}
This tracks the status and progress of built-in result builder DSL APIs. It doesn't necessarily reflect other related API, the protocols used for library extension, or API details beyond result builders such as options and custom character classes.
.?
, .*
, .+
)TBD
TBD
We currently reject (?#)
, but it should be allowed.
For our extension points, including tryCapture, a return of nil
signals a local matching failure and backtrack. A throw
n error aborts. How should we surface thrown errors, and are they worth threading through our API?
The integration section in the README needs an update.
According to regex101, PCRE2 requires a unique name for each capture, but the current implementation of Regex.init(compiling:)
doesn't throw an error when there's duplicate names.
Reposted from the Swift forums: https://forums.swift.org/t/bad-digit-matching-bugreport-regarding-se-0354-regex-literals/57262/1
Problem: Some digit character groups match number-like grapheme clusters.
// this matches:
try /[1-2]/.wholeMatch(in: "1οΈβ£")
// still matches:
try /[1-2]/.asciiOnlyDigits().wholeMatch(in: "1οΈβ£")
// does not match:
try /[12]/.wholeMatch(in: "1οΈβ£")
Above described behavior seems inconsistent and difficult to predict. Shouldn't [1-2] and [12] be identical? Should they match anything outside of ascii?
Note: 1οΈβ£ is U+0031 (ascii digit 1) U+FE0F (VARIATION SELECTOR-16) U+20E3 (COMBINING ENCLOSING KEYCAP)
Same is true for 1οΈβ£: U+0031 (ascii digit 1) U+FE0E (VARIATION SELECTOR-15) U+20E3 (COMBINING ENCLOSING KEYCAP)
rdar://96898279
I want to gather up many areas of near-future work that we've been clarifying through the proposal reviews.
Loose categorization:
String
-backed, CaseIterable enum as a regex componentChoiceOf
all cases in ordermap
over a regex, perhaps per-capture, to supply post-processing transforms at regex declaration timeregex.matchingAnywhere => Regex { /.*?/ ; regex ; /.*/ }
.Character
-aligned for whole match and each capture(?n)
, could be nice to strip out captures you don't care about, especially for type erased regexes.
replace(_:withTemplate:)
method that recognizes $1
or \1
placeholdersUnicode.Properties
and support in regexesUnicode.AllScalars
as a public type (semi-tangential)var Substring.range: Range<String.Index>
to simplify getting the range of a capture groupString.lines()
and String.words()
Regex<T>.Match.init?(_:ARO)
Regex<T>.Match.init?(_:Regex<ARO>.Match)
Repeat(separator: \.whitespace) { ... }
Repeat(while:)
Repeat(whileNot:)
Until(negLookaheadCondition) { ... }
func compile() throws
to explicitly trigger compilation and get errors, such as quantifying the unquantifiable
Reference
capture type to Substring.self
(?(x)...)
(requires updated parsing)[a-z\q{ch}]
)(?J)
(requires figuring out typed captures)(?|)
(parsing is implemented, but requires figuring out typed captures)(?(x)...)
in accordance to what is in the syntax proposal (we currently parse the condition differently)
(?(?{...}))
(?((x))x)
. .NET also forbids named capture conditions, we should ban that.(?(x)...)
(?(DEFINE))
to have a false branch\p{key=/regex/}
\p{toNFKC_Casefold=@toNFKC@}
keyβ value
, key!=value
key:value
a**
syntax as explicitly eager quantification
(?U)
_MatchingEngine is failing the ParseableInterface test in the Swift repo.
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: type 'TypedIndex<C, π»>' does not conform to protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1460:1: error: unavailable instance method 'replaceSubrange(_:with:)' was used to satisfy a requirement of protocol 'RangeReplaceableCollection'
extension _MatchingEngine.TypedIndex : Swift.RangeReplaceableCollection where C : Swift.RangeReplaceableCollection {
^
Swift.RangeReplaceableCollection:4:26: note: 'replaceSubrange(_:with:)' declared here
public mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: C) where C : Collection, Self.Element == C.Element
^
Swift.RangeReplaceableCollection:4:19: note: requirement 'replaceSubrange(_:with:)' declared here
mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1464:14: error: no exact matches in call to instance method 'replaceSubrange'
rawValue.replaceSubrange(rawRange, with: newElements)
^
Swift.RangeReplaceableCollection:4:19: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
mutating func replaceSubrange<C>(_ subrange: Range<Self.Index>, with newElements: __owned C) where C : Collection, Self.Element == C.Element
^
Swift.RangeReplaceableCollection:2:37: note: candidate requires that the types 'C.Element' and 'C.Element' be equivalent (requirement specified as 'Self.Element' == 'C.Element')
@inlinable public mutating func replaceSubrange<C, R>(_ subrange: R, with newElements: __owned C) where C : Collection, R : RangeExpression, Self.Element == C.Element, Self.Index == R.Bound
^
/Volumes/Media/Development/Swift/swift-source/build/Ninja-ReleaseAssert/swift-macosx-x86_64/lib/swift/macosx/_MatchingEngine.swiftmodule/x86_64-apple-macos.swiftinterface:1:1: error: failed to build module '_MatchingEngine' for importation due to the errors above; the textual interface may be broken by project issues or a compiler bug
// swift-interface-format-version: 1.0
^
--
********************
********************
Failed Tests (1):
Swift-validation(macosx-x86_64) :: ParseableInterface/verify_all_overlays.py
Currently Regex.Match
element accessors are implemented this way:
/// Lookup a capture by name or number
public subscript<T>(dynamicMember keyPath: KeyPath<Output, T>) -> T {
output[keyPath: keyPath]
}
// Allows `.0` when `Match` is not a tuple.
@_disfavoredOverload
public subscript(
dynamicMember keyPath: KeyPath<(Output, _doNotUse: ()), Output>
) -> Output {
output
}
This is not correct as we should not materialize the entire output.
Functions that take a generic argument should adopt light weight generics and refer to the type directly as the argument type.
For example,
public func contains<C: Collection>(_ other: C) -> Bool
would be
public func contains(_ other: some Collection<Element>) -> Bool
Adoption of standard library types is blocked by swiftlang/swift#41843
For the regex literal syntax, we're looking at supporting a syntactic superset of:
PCRE2, an "industry standard" of sorts, and a rough superset of Perl, Python, etc.
Oniguruma, an internationalization-oriented engine with some modern features
ICU, used by NSRegularExpression, a Unicode-focused engine
Our interpretation of UTS#18's guidance, which is about semantics, but we can infer syntactic feature sets.
TODO: .NET, which has delimiter-balancing and some interesting minor details on conditional patterns
These aren't all strictly compatible (e.g. a set operator in PCRE2 would just be a redundant statement of a set member). We can explore adding strict compatibility modes, but in general the syntactic superset is fairly straight-forward.
The below are (roughly) implemented. There may be bugs, but we have some support and some testing coverage:
a|b
(x)
, (?:x)
, (?<name>x)
\n
, \a
\u{...}
, \x{...}
, \uHHHH
.
, \d
, \w
, \s
[...]
, including binary operators &&
, ~~
, --
x?
, x+
, x*
, x{n,m}
\b
, ^
, $
\Q ... \E
(?#comment)
\p{...}
, [:...:]
\N{...}
, \N{U+hh}
(?=)
, (?!)
, (*pla:)
, (?*...)
, (?<*...)
, (napla:...)
(*script_run:...)
, (*sr:...)
, (*atomic_script_run:...)
, (*asr:...)
\ddd
, \o{...}
\1
, \g2
, \g{2}
, \k<name>
, \k'name'
, \g{name}
, \k{name}
, (?P=name)
(?m)
, (?-i)
, (?:si)
, (?^m)
\g<n>
, \g'n'
, (?R)
, (?1)
, (?&name)
, (?P>name)
(?(R)...)
, (?(n)...)
, (?(<n>)...)
, (?('n')...)
, (?(condition)then|else)
(?C2)
, (?C"text")
(*ACCEPT)
, (*SKIP:NAME)
(?<name1-name2>...)
\k<n+level>
, (?(n+level))
(?{...})
, (*name)
(?{...})
has in-line code in it, we could consider the same (for now, we just parse an arbitrary string)(?~absent)
(*LIMIT_MATCH=d)
, (*LF)
(?x)
/(?xx)
syntax allowing for non-semantic whitespace and end-of-line comments abc # comment
Additionally, we have (even more experimental) support for some syntactic conveniences, if specified. Note that each of these (except perhaps ranges) may introduce a syntactic incompatibility with existing traditional-syntax regexes. Thus, they are mostly illustrative, showing what happens and where we go as we slide down this "slippery slope".
/a b c/ === /abc/
/"a.b"/ === /\Qa.b\E/
/a{2..<10} b{...3}/ === /a{2,9}b{0,3}/
/a (_: b) c/ === /a(?:b)c/
TBD:
/a (name: b) c/ === /a(?<name>b)c/
/* comment */ or
// commentinstead of
(?#. comment)`// comment
X
: grapheme cluster semanticsO
: Unicode scalar semanticsb
: byte semanticsImplemented:
|
in alternation-
in [a-f]
TBD:
Initial parser support landed in swiftlang/swift#40595, using the delimiters '/.../'
, which are lexed in-package.
e.g. trimPrefix
on RangeReplaceableCollection
should call through to the customized version of RRC.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.