Giter Site home page Giter Site logo

slevithan / regex Goto Github PK

View Code? Open in Web Editor NEW
461.0 5.0 8.0 430 KB

Regex template tag for readable, high-performance, native JS regexes, with context-aware interpolation and always-on best practices

License: MIT License

JavaScript 98.19% HTML 1.16% TypeScript 0.65%
regex regular-expression one-of-a-kind

regex's People

Contributors

benblank avatar jaslong avatar josephshannon avatar rauschma avatar slevithan avatar subtlegradient avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

regex's Issues

Add a regex optimizer

There any many optimizations that regex could make to provided expressions to improve the readability, brevity, and potentially performance of generated source.

For example, regex`[a-zA-Z_0-9][A-Z_\da-z]*` could become /\w+/v.

This would be especially valuable for regex's Babel plugin, since its build-time transpilation would avoid any added runtime cost from such an optimizer.

Maximally optimizing provided regexes would be much easier if using a regex AST. The best JS regex AST builders are probably regexp-tree and regexpp. In fact, regexp-tree already includes what looks to be an excellent optimizer module, which could be used directly (after extending regexp-tree's supported syntax via its plugin API to allow for regex's extended syntax). The optimizer could potentially be augmented with regex-specific optimizations that take advantage of atomic groups. For example, an automatic ReDoS buster that works by inserting atomic groups when nested quantifiers match overlapping strings and are followed by a token that does not overlap with the preceding group. E.g. (?:\w+\s?)+$ would become (?>\w+\s?)+$.

So far, regex has intentionally avoided building a regex AST or using an existing library to do so since that would significantly add to bundle size, and remaining lightweight is a critical goal. However, if such an optimizer was created as a plugin (applied via regex's existing plugin support, documented here), the optimizer could then be selectively imported and applied by users, and regex's Babel plugin would be able to use it.

Two options for how such an optimizer could be applied by regex's Babel plugin:

  • Only when it's explicitly included in source via e.g. regex({plugins: [optimize]})`…`. This would maintain consistency in output with the base regex library. However, it would probably require updating the Babel plugin with special knowledge about the optimizer, possibly by just checking for plugin name optimize when transpiling each regex call.
  • The optimize plugin could be added to all regex tags processed by the Babel plugin when a new option optimize: true is set on the Babel plugin's configuration options.

If anyone wants to help with or take over this issue, that would be very welcome.

Invalid escape

Creating this query "native" works

console.log('Native: ', new RegExp('(?<Q>[abc])[\k<Q>]', 'v'))

Doing the same with regex throws an "invalid escape"

import { regex, pattern } from 'regex';
const JZON_REGEX = regex`(?<Q>[abc])[\k<Q>]`;
console.log(JZON_REGEX);

The output says that

Invalid regular expression: /(?<Q>[abc])[\k<Q\>]/v: Invalid escape

Notice the \ in front of the last > - I think there is a auto conversion that goes wong here.

image

Is there a way to check if this package is supported?

I stumbled over this package as I was trying to improve the performance of our RegExes and noticed the disclaimer about the v flag that's required. I know that not every customer has a "modern" machine, so is there a way to check if this package is supported at all?

Something like isSupported to check if I can even use this setup would be nice, so that I can implement a fall back for older browsers/systems.

I guess one way of doing it is to check via a try-catch, but I am not sure if I am missing anything else:

function isSupported() {
  try {
    new RegExp('', 'v');
    return true;
  } catch (e) {
    return false;
  }
}

Rename `partial` → `pattern`

I'm considering renaming function/tag partial as pattern, and class PartialPattern as Pattern. The latter isn't directly usable, but return values of partial are instances of this class. These changes would be included in v3.0.0.

@rauschma, since you wrote about regex and partial here, I wanted to give you a heads up about this. Do you think this is a good idea?

Saving unneeded draft `@@replace` override

regex's implementation of WrappedRegex (used when option subclass is true) overrides exec, which is then automatically used by all RegExp methods and RegExp-using String methods.

Following is draft code I was working on because realizing I only needed to override exec. I'm saving it here for posterity, or in case it's needed in the future.

class WrappedRegex extends RegExp {
  #captureNums;
  constructor(expression, flags, data) {
    // ...
  }
  // Adjust numbered backreferences to point to the correct capturing group, accounting for
  // anonymous captures added only for extended syntax emulation
  [Symbol.replace](str, replacement) {
    const replaceFn = RegExp.prototype[Symbol.replace];
    if (replacement instanceof Function) {
      return replaceFn.call(this, str, (...args) => {
        const newArgs = [];
        const hasGroupsArg = typeof args[args.length - 1] === 'object';
        for (let i = 0; i < args.length; i++) {
          const arg = args[i];
          const endOffset = args.length - 1 - i;
          if (
            // Keep last 2 args
            endOffset < 2 ||
            // Keep third last arg if a groups object is included at the end
            (endOffset === 2 && hasGroupsArg) ||
            // Keep backreferences that weren't added only for extended syntax emulation
            this.#captureNums[i] !== null
          ) {
            newArgs.push(arg);
          }
        }
        return replacement(...newArgs);
      });
    }
    replacement = replaceFn.call(/\$\$|\$([1-9]\d*)/g, replacement, (m, refNumStr) => {
      if (m === '$$') {
        return m;
      }
      // Ex: `[0, 1, null, 3]` to `[0, 1, 3]`
      const filtered = this.#captureNums.filter(c => c !== null);
      const mappedNum = filtered[+refNumStr];
      if (mappedNum) {
        return '$' + mappedNum;
      }
      return '$$' + refNumStr;
    });
    return replaceFn.call(this, str, replacement);
  }
  [Symbol.split](str, limit) {
    // Native `Symbol.split` first copies the regex before using it
    const result = RegExp.prototype[Symbol.split].call(this, str, limit);
    // TODO: Reimplement `split` to work this `this.#captureNums`
    return result;
  }
  [Symbol.matchAll](str) {
    // Native `Symbol.matchAll` first copies the regex before using it
    const result = RegExp.prototype[Symbol.matchAll].call(this, str);
    // TODO: Reimplement `matchAll` to work this `this.#captureNums`
    return result;
  }
}

Feat: Add types

I would be happy to add types to regex, along with its extension regex-recursion and their shared utilities regex-utilities.

I'd love help with this or for someone to take over this issue. The types are fairly complex and I'm not yet well-versed with TypeScript.

Breaking: Throw for `[^\p{…}]` with flag `i` in environments without native flag `v`

regex v2.1.0 introduced src/backcompat.js, which extended support for most usage backward to environments without native flag v. This file is a postprocessor that is conditionally applied when flag v is not supported natively, and it does several things:

  1. Transpiles flag v's escaping rules to u's (by un-escaping some characters).
  2. Throws for character class syntax that is invalid with v but valid with u (unescaped (){}/|, reserved double punctuators, and leading/trailing -, since all of these would throw in environments with v).
  3. Throws for features that require native flag v (character class set operators -- and &&, and a more descriptive error for nested character classes which are already invalid with u).
  4. Throws when using doubly-negated [^\P{…}] if the regex uses flag i, to prevent an unintuitive (likely) behavior difference (likely, since many but not all Unicode properties include letters with case).

Regarding # 4, this is fine, but it's incomplete. At the time, I was referencing this and this, which only show examples with doubly-negated [^\P{…}]. However, as @rauschma points out here, this v/u incompatibility applies equally with \p (lowercase) in /[^\p{…}]/iu.

Updating this in src/backcompat.js to also cover \p would be easy, but would also be a (rarely applicable) breaking change. IMO it wouldn't be a problem to start throwing without a new major version if the behavior was always incompatible (since anyone with such patterns in their code would already have a bug in environments without v), but since not all uses of /[^\p{…}]/iu suffer from this issue, I will hold off on updating it until there are additional breaking changes to include in a v4.0.

Avoid eager escaping of "lone double punctuators"

In environments with native v, regex`[>]` returns /[\>]/v (with escaped \>). This is fine, since the escaping doesn't change the meaning with flag v. And since regex automatically applies flag v's escaping rules when flag v isn't natively supported or when v is explicitly disabled, regex({disable: {v: true}})`[>]` correctly returns /[>]/u. Here, > is unescaped, since it is invalid to escape this character within a character class with flag u, and since \> is valid with v, regex converts it to its valid equivalent (>) for flag u.

Under the hood, the escaping is coming from the emulation of implicit flag x and its rules for sandboxing lone double punctuators, as discussed in #21. This means that regex({disable: {x: true}})`[>]` returns /[>]/v, which is also fine.

However, an (extreme edge case) issue arises since regex includes an advanced option for changing or removing the unicodeSetsPlugin that applies flag v's character class escaping rules. If you replace it with your own function, well, the function you provide assumes responsibility for unescaping characters (like >) that are can be escaped within character classes with flag v but are not allowed to be escaped with flag u. So far, still no issues. But, if you explicitly set unicodeSetsPlugin to null (rather than replacing it with your own plugin) AND explicitly disable flag v (or run in an environment without native v) AND don't disable flag x, now we have an issue. regex({unicodeSetsPlugin: null, disable: {v: true}})`[>]` returns /[\>]/u, which is an error. And since the user didn't escape their > in the input, this error doesn't seem appropriate.

This extreme edge case is irrelevant for regular use, but could be relevant for tools that want to use regex as the handler for regexes in their own APIs but expect users to provide flag u syntax rather than flag v syntax. Apart from the issue discussed here, regex supports such usage via the combination of options regex({unicodeSetsPlugin: null, disable: {v: true}})`…`.

The solution would be to update flag x's handling to be less eager about escaping lone double punctuators in character classes, and only do so in cases where it is required for accurate sandboxing. An alternative, however, would be to not chase support for flag-u syntax as input, and just assert that setting unicodeSetsPlugin to null can cause issues in cases where regex emits flag-v-only escapes (I'd prefer not to do this).

Thanks

Amazing thing you have created here. Thank you for creating this for the world to enjoy.

(you can close this issue now that you read this.)

Operations to avoid `.lastIndex` hacks?

Are there any ideas for operations so that we don’t need .lastIndex hacks anymore (I use them to tokenize efficiently)?

Maybe (with better names...):

  • matchAt(string, regex, pos): {match, newPos}
    • Eventually: String.prototype.matchAt()
  • testAt(regex, string, pos): {isMatch, newPos}
    • Eventually: RegExp.prototype.testAt()

Question about edge case handling with interpolated regexes

Note: The following is all a non-issue if flag n isn't explicitly disabled, since flag n prevents the use numbered backreferences in the outer regex.

When flag n is disabled and a RegExp instance with captures is interpolated into a regex, numbered backreferences in the outer regex are not adjusted, which might be nonintuitive for numbered backreferences in the outer-regex that point to captures that appear after the interpolation.

Example:

regex({disable: {n: true}})`${/()/}()\1`;
// Currently returns → /()()\1/v
// Potential alternative → /()()\2/v

This is not a bug. The documentation doesn't claim that numbered backreferences in the outer regex are adjusted to account for captures in inner (interpolated) regexes. The reason that numbered backreferences within inner regexes are adjusted to work within the overall pattern is because they are self-contained, independent patterns (with no context on or ability to reference the pattern outside themselves). But this logic doesn't apply to the outer regex, where the developer is in control of the expression and knows the position where interpolated regexes will appear.

That said, arguably, given that the numbered backreferences of inner regexes are adjusted, the current behavior for the outer regex is nonintuitive, and maybe regex should adjust for cases like this.

Note that regex({disable: {n: true}})`()${/()\1/}\1` already works intuitively, returning /()()\2\1/v. In this case, the \1 in the outer regex references a capture that appears before the inner regex, so it correctly continues referring to the same group.

I'll leave this issue open for now to think more about it, but I suspect I'll end up closing it with no change. Feedback is welcome.

add cjs dist build output?

for horrifically annoying and stupid reasons beyond my ability/interest to control, I am forced to keep using CommonJS in 2024. Only way I could use this was by manually copying the min.js file into our code and exporting it myself.

it'd be hip if there was a cjs CommonJS file in the distribution so this Just Works™

also, SO thrilled I'm finally getting to use your stuff!
I've been a huge fan for like 372 years

Combine types into single file

#6 added TypeScript .d.ts files. However, multiple .d.ts files are currently generated. It would be nicer to generate a single file index.d.ts. PRs (or any other help) would be welcome from someone who wants to take on this issue. I'd be fine with switching from esbuild to another bundler if that helps keep things simple.

@rauschma asked about how to do this in this Mastodon thread. Saving the post and replies below for posterity.

Axel Rauschmayer:
#TypeScript via JSDoc—the following works but produces multiple .d.ts files (only index.js is in the package exports, so a single .d.ts file would be better):

tsc src/index.js --rootDir src --declaration --allowJs --emitDeclarationOnly --outdir types

If I do --outfile index.d.ts (vs. --outdir) then I do get a single file but each module now has its own namespace which doesn’t go well with package exports.

Jon Koops:
Not sure if it fits the bill exactly but https://tsup.egoist.dev/?

thurti:
I had the same problem recently. I looked at how others do it (svelte) and ended up writing a script that bundles the d.ts files.
Something like https://github.com/TranscribeJs/transcribe.js/blob/main/scripts/bundle-types.js

Nicholas C. Zakas:
There is a Rollup plugin that combines .d.ts files.
Though these days I usually Rollup the JS into a single file and then run tsc on the bundle file to avoid the mess.

Note: The Rollup plugin is probably referring to rollup-plugin-dts.

amsyar:
https://github.com/timocov/dts-bundle-generator

Clarify the complete absence of numbered capturing groups with the n flag

I quite like the n flag, but not having previous experience with it, I still had questions after reading that section of the README which I was ultimately only able to answer by reading the source. Specifically, I was unclear on whether numbered / unnamed capturing groups were simply no longer possible or could still be created via some other syntax.

  • The flag is named "no auto capture", but that name doesn't make it clear that there isn't an explicit way to still created numbered capturing groups, either.
  • The (…) syntax is called out as no longer capturing, but there's nothing which makes it clear that's the only syntax for creating numbered capturing groups.
  • It's called out that "numbered backreferences to named groups" are also disabled, but doesn't mention that numbered backreferences are entirely disallowed. (After all, there's no way to create a numbered capturing group, so they can't refer to anything, anyway.)

I'm probably picking at nits here, but something about how that section is phrased left me wondering whether it was implying that numbered capturing groups could be created in some other way, even as a fairly experienced user of regular expressions. 🙂

Changing the name of the flag to something like "no numbered capture" might be more clear, but adding another implementation-specific name for the flag might not be great. Tweaking the description could also clear things up considerably, with or without a name change. Perhaps even just changing the first sentence to something like:

Flag n gives you no auto capture mode, which disables numbered capturing groups (so a plain (…) group is always non-capturing) but preserves named capture.

And thanks for creating this library! It looks like it adds some very helpful features and safe defaults to a regex engine which could use them.

Potential alternative for flags syntax

(Very useful library, thanks for creating it!)

Another option for specifying RegExp flags is:

regex`/^.+/gm`

That is equivalent to:

regex('gm')`^.+`

Benefits:

  • Fewer characters to type.
  • Syntax should be more familiar to programmers.
  • Easier to switch between RegExp literals and tagged templates.

One could make the slashes optional if there are no flags.

hermes support for react native?

Not sure what it would take to make it work on RN in Hermes.
https://github.com/facebook/hermes/blob/main/README.md


https://github.com/facebook/hermes/blob/73cb6664fe233150e1313553a135ffc472c16227/doc/RegExp.md

## RegExp


The Hermes regexp engine is a traditional engine using a backtracking stack. 
It compiles a regexp into bytecode which can be executed efficiently. For regexp literals like `/abc/`, 
this occurs at compile time: the regexp bytecode is embedded into the Hermes bytecode file. 
Note regexp bytecode is distinct from Hermes bytescode.


The regexp engine proceeds as follows:


1. *Parse phase.* The regexp parser emits a tree of nodes, effectively an IR.
1. *Optimization phase.* The node tree is traversed and optimized in various ways.
1. *Emitting phase.* The node tree is traversed and emits regexp bytecode.
1. *Execution phase.* The bytecode is executed against an input string.


## Supported Syntax


As of this writing, Hermes regexp supports


1. All of ES6, including global, case-insensitive, multiline, sticky, and Unicode (and legacy).
1. ES9 lookbehinds.
1. Named capture groups.
1. Unicode property escapes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.