Giter Site home page Giter Site logo

regexp-tree's Introduction

regexp-tree

Build Status npm version npm downloads

Regular expressions processor in JavaScript

TL;DR: RegExp Tree is a regular expressions processor, which includes parser, traversal, transformer, optimizer, and interpreter APIs.

You can get an overview of the tool in this article.

Table of Contents

Installation

The parser can be installed as an npm module:

npm install -g regexp-tree

You can also try it online using AST Explorer.

Development

  1. Fork https://github.com/DmitrySoshnikov/regexp-tree repo
  2. If there is an actual issue from the issues list you'd like to work on, feel free to assign it yourself, or comment on it to avoid collisions (open a new issue if needed)
  3. Make your changes
  4. Make sure npm test still passes (add new tests if needed)
  5. Submit a PR

The regexp-tree parser is implemented as an automatic LR parser using Syntax tool. The parser module is generated from the regexp grammar, which is based on the regular expressions grammar used in ECMAScript.

For development from the github repository, run build command to generate the parser module, and transpile JS code:

git clone https://github.com/<your-github-account>/regexp-tree.git
cd regexp-tree
npm install
npm run build

NOTE: JS code transpilation is used to support older versions of Node. For faster development cycle you can use npm run watch command, which continuously transpiles JS code.

Usage as a CLI

Note: the CLI is exposed as its own regexp-tree-cli module.

Check the options available from CLI:

regexp-tree-cli --help
Usage: regexp-tree-cli [options]

Options:
   -e, --expression   A regular expression to be parsed
   -l, --loc          Whether to capture AST node locations
   -o, --optimize     Applies optimizer on the passed expression
   -c, --compat       Applies compat-transpiler on the passed expression
   -t, --table        Print NFA/DFA transition tables (nfa/dfa/all)

To parse a regular expression, pass -e option:

regexp-tree-cli -e '/a|b/i'

Which produces an AST node corresponding to this regular expression:

{
  type: 'RegExp',
  body: {
    type: 'Disjunction',
    left: {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    right: {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    }
  },
  flags: 'i',
}

NOTE: the format of a regexp is / Body / OptionalFlags.

Usage from Node

The parser can also be used as a Node module:

const regexpTree = require('regexp-tree');

console.log(regexpTree.parse(/a|b/i)); // RegExp AST

Note, regexp-tree supports parsing regexes from strings, and also from actual RegExp objects (in general -- from any object which can be coerced to a string). If some feature is not implemented yet in an actual JavaScript RegExp, it should be passed as a string:

// Pass an actual JS RegExp object.
regexpTree.parse(/a|b/i);

// Pass a string, since `s` flag may not be supported in older versions.
regexpTree.parse('/./s');

Also note, that in string-mode, escaping is done using two slashes \\ per JavaScript:

// As an actual regexp.
regexpTree.parse(/\n/);

// As a string.
regexpTree.parse('/\\n/');

Capturing locations

For source code transformation tools it might be useful also to capture locations of the AST nodes. From the command line it's controlled via the -l option:

regexp-tree-cli -e '/ab/' -l

This attaches loc object to each AST node:

{
  type: 'RegExp',
  body: {
    type: 'Alternative',
    expressions: [
      {
        type: 'Char',
        value: 'a',
        symbol: 'a',
        kind: 'simple',
        codePoint: 97,
        loc: {
          start: {
            line: 1,
            column: 1,
            offset: 1,
          },
          end: {
            line: 1,
            column: 2,
            offset: 2,
          },
        }
      },
      {
        type: 'Char',
        value: 'b',
        symbol: 'b',
        kind: 'simple',
        codePoint: 98,
        loc: {
          start: {
            line: 1,
            column: 2,
            offset: 2,
          },
          end: {
            line: 1,
            column: 3,
            offset: 3,
          },
        }
      }
    ],
    loc: {
      start: {
        line: 1,
        column: 1,
        offset: 1,
      },
      end: {
        line: 1,
        column: 3,
        offset: 3,
      },
    }
  },
  flags: '',
  loc: {
    start: {
      line: 1,
      column: 0,
      offset: 0,
    },
    end: {
      line: 1,
      column: 4,
      offset: 4,
    },
  }
}

From Node it's controlled via setOptions method exposed on the parser:

const regexpTree = require('regexp-tree');

const parsed = regexpTree
  .parser
  .setOptions({captureLocations: true})
  .parse(/a|b/);

The setOptions method sets global options, which are preserved between calls. It is also possible to provide options per a single parse call, which might be more preferred:

const regexpTree = require('regexp-tree');

const parsed = regexpTree.parse(/a|b/, {
  captureLocations: true,
});

Parsing options

The parser supports several options which can be set globally via the setOptions method on the parser, or by passing them with each parse method invocation.

Example:

const regexpTree = require('regexp-tree');

const parsed = regexpTree.parse(/a|b/, {
  allowGroupNameDuplicates: true,
});

The following options are supported:

  • captureLocations: boolean -- whether to capture AST node locations (false by default)
  • allowGroupNameDuplicates: boolean -- whether to skip duplicates check of the named capturing groups

Set allowGroupNameDuplicates would make the following expression possible:

/
  # YYY-MM-DD date format:

  (?<year>  \d{4}) -
  (?<month> \d{2}) -
  (?<day>   \d{2})

  |

  # DD.MM.YYY date format

  (?<day>   \d{2}) .
  (?<month> \d{2}) .
  (?<year>  \d{4})

/x

Using traversal API

The traverse module allows handling needed AST nodes using the visitor pattern. In Node the module is exposed as the regexpTree.traverse method. Handlers receive an instance of the NodePath class, which encapsulates node itself, its parent node, property, and index (in case the node is part of a collection).

Visiting a node follows this algorithm:

  • call pre handler.
  • recurse into node's children.
  • call post handler.

For each node type of interest, you can provide either:

  • a function (pre).
  • an object with members pre and post.

You can also provide a * handler which will be executed on every node.

Example:

const regexpTree = require('regexp-tree');

// Get AST.
const ast = regexpTree.parse('/[a-z]{1,}/');

// Traverse AST nodes.
regexpTree.traverse(ast, {

  // Visit every node before any type-specific handlers.
  '*': function({node}) {
    ...
  },

  // Handle "Quantifier" node type.
  Quantifier({node}) {
    ...
  },

  // Handle "Char" node type, before and after.
  Char: {
    pre({node}) {
      ...
    },
    post({node}) {
      ...
    }
  }

});

// Generate the regexp.
const re = regexpTree.generate(ast);

console.log(re); // '/[a-z]+/'

Using transform API

NOTE: you can play with transformation APIs, and write actual transforms for quick tests in AST Explorer. See this example.

While traverse module provides basic traversal API, which can be used for any purposes of AST handling, transform module focuses mainly on transformation of regular expressions.

It accepts a regular expressions in different formats (string, an actual RegExp object, or an AST), applies a set of transformations, and retuns an instance of TransformResult. Handles receive as a parameter the same NodePath object used in traverse.

Example:

const regexpTree = require('regexp-tree');

// Handle nodes.
const re = regexpTree.transform('/[a-z]{1,}/i', {

  /**
   * Handle "Quantifier" node type,
   * transforming `{1,}` quantifier to `+`.
   */
  Quantifier(path) {
    const {node} = path;

    // {1,} -> +
    if (
      node.kind === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      path.replace({
        type: 'Quantifier',
        kind: '+',
        greedy: node.greedy,
      });
    }
  },
});

console.log(re.toString()); // '/[a-z]+/i'
console.log(re.toRegExp()); // /[a-z]+/i
console.log(re.getAST()); // AST for /[a-z]+/i

Transform plugins

A transformation plugin is a module which exports a transformation handler. We have seen above how we can pass a handler object directly to the regexpTree.transform method, here we extract it into a separate module, so it can be implemented and shared independently:

Example of a plugin:

// file: ./regexp-tree-a-to-b-transform.js


/**
 * This plugin replaces chars 'a' with chars 'b'.
 */
module.exports = {
  Char({node}) {
    if (node.kind === 'simple' && node.value === 'a') {
      node.value = 'b';
      node.symbol = 'b';
      node.codePoint = 98;
    }
  },
};

Once we have this plugin ready, we can require it, and pass to the transform function:

const regexpTree = require('regexp-tree');
const plugin = require('./regexp-tree-a-to-b-transform');

const re = regexpTree.transform(/(a|c)a+[a-z]/, plugin);

console.log(re.toRegExp()); // /(b|c)b+[b-z]/

NOTE: we can also pass a list of plugins to the regexpTree.transform. In this case the plugins are applied in one pass in order. Another approach is to run several sequential calls to transform, setting up a pipeline, when a transformed AST is passed further to another plugin, etc.

You can see other examples of transform plugins in the optimizer/transforms or in the compat-transpiler/transforms directories.

Using generator API

The generator module generates regular expressions from corresponding AST nodes. In Node the module is exposed as regexpTree.generate method.

Example:

const regexpTree = require('regexp-tree');

const re = regexpTree.generate({
  type: 'RegExp',
  body: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  flags: 'i',
});

console.log(re); // '/a/i'

Using optimizer API

Optimizer transforms your regexp into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.

NOTE: the Optimizer is implemented as a set of regexp-tree plugins.

Example:

const regexpTree = require('regexp-tree');

const originalRe = /[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/;

const optimizedRe = regexpTree
  .optimize(originalRe)
  .toRegExp();

console.log(optimizedRe); // /\w+e+/

From CLI the optimizer is available via --optimize (-o) option:

regexp-tree-cli -e '/[a-zA-Z_0-9][A-Z_\da-z]*\e{1,}/' -o

Result:

Optimized: /\w+e+/

See the optimizer README for more details.

Optimizer ESLint plugin

The optimizer module is also available as an ESLint plugin, which can be installed at: eslint-plugin-optimize-regex.

Using compat-transpiler API

The compat-transpiler module translates your regexp in new format or in new syntax, into an equivalent regexp in a legacy representation, so it can be used in engines which don't yet implement the new syntax.

NOTE: the compat-transpiler is implemented as a set of regexp-tree plugins.

Example, "dotAll" s flag:

/./s

Is translated into:

/[\0-\uFFFF]/

Or named capturing groups:

/(?<value>a)\k<value>\1/

Becomes:

/(a)\1\1/

To use the API from Node:

const regexpTree = require('regexp-tree');

// Using new syntax.
const originalRe = '/(?<all>.)\\k<all>/s';

// For legacy engines.
const compatTranspiledRe = regexpTree
  .compatTranspile(originalRe)
  .toRegExp();

console.log(compatTranspiledRe); // /([\0-\uFFFF])\1/

From CLI the compat-transpiler is available via --compat (-c) option:

regexp-tree-cli -e '/(?<all>.)\k<all>/s' -c

Result:

Compat: /([\0-\uFFFF])\1/

Compat-transpiler Babel plugin

The compat-transpiler module is also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

Note, the plugin also includes extended regexp features.

RegExp extensions

Some of the non-standard feature are also supported by regexp-tree.

NOTE: "non-standard" means specifically ECMAScript standard, since in other regexp egnines, e.g. PCRE, Python, etc. these features are standard.

One of such features is the x flag, which enables extended mode of regular expressions. In this mode most of whitespaces are ignored, and expressions can use #-comments.

Example:

/
  # A regular expression for date.

  (?<year>\d{4})-    # year part of a date
  (?<month>\d{2})-   # month part of a date
  (?<day>\d{2})      # day part of a date

/x

This is normally parsed by the regexp-tree parser, and compat-transpiler has full support for it; it's translated into:

/(\d{4})-(\d{2})-(\d{2})/

RegExp extensions Babel plugin

The regexp extensions are also available as a Babel plugin, which can be installed at: babel-plugin-transform-modern-regexp.

Note, the plugin also includes compat-transpiler features.

Creating RegExp objects

To create an actual RegExp JavaScript object, we can use regexpTree.toRegExp method:

const regexpTree = require('regexp-tree');

const re = regexpTree.toRegExp('/[a-z]/i');

console.log(
  re.test('a'), // true
  re.test('Z'), // true
);

Executing regexes

It is also possible to execute regular expressions using exec API method, which has support for new syntax, and features, such as named capturing group, etc:

const regexpTree = require('regexp-tree');

const re = `/

  # A regular expression for date.

  (?<year>\\d{4})-    # year part of a date
  (?<month>\\d{2})-   # month part of a date
  (?<day>\\d{2})      # day part of a date

/x`;

const string = '2017-04-14';

const result = regexpTree.exec(re, string);

console.log(result.groups); // {year: '2017', month: '04', day: '14'}

Using interpreter API

NOTE: you can read more about implementation details of the interpreter in this series of articles.

In addition to executing regular expressions using JavaScript built-in RegExp engine, RegExp Tree also implements own interpreter based on classic NFA/DFA finite automaton engine.

Currently it aims educational purposes -- to trace the regexp matching process, transitioning in NFA/DFA states. It also allows building state transitioning table, which can be used for custom implementation. In API the module is exposed as fa (finite-automaton) object.

Example:

const {fa} = require('regexp-tree');

const re = /ab|c*/;

console.log(fa.test(re, 'ab')); // true
console.log(fa.test(re, '')); // true
console.log(fa.test(re, 'c')); // true

// NFA, and its transition table.
const nfa = fa.toNFA(re);
console.log(nfa.getTransitionTable());

// DFA, and its transition table.
const dfa = fa.toDFA(re);
console.log(dfa.getTransitionTable());

For more granular work with NFA and DFA, fa module also exposes convenient builders, so you can build NFA fragments directly:

const {fa} = require('regexp-tree');

const {
  alt,
  char,
  or,
  rep,
} = fa.builders;

// ab|c*
const re = or(
  alt(char('a'), char('b')),
  rep(char('c'))
);

console.log(re.matches('ab')); // true
console.log(re.matches('')); // true
console.log(re.matches('c')); // true

// Build DFA from NFA
const {DFA} = fa;

const reDFA = new DFA(re);

console.log(reDFA.matches('ab')); // true
console.log(reDFA.matches('')); // true
console.log(reDFA.matches('c')); // true

Printing NFA/DFA tables

The --table option allows displaying NFA/DFA transition tables. RegExp Tree also applies DFA minimization (using N-equivalence algorithm), and produces the minimal transition table as its final result.

In the example below for the /a|b|c/ regexp, we first obtain the NFA transition table, which is further converted to the original DFA transition table (down from the 10 non-deterministic states to 4 deterministic states), and eventually minimized to the final DFA table (from 4 to only 2 states).

./bin/regexp-tree-cli -e '/a|b|c/' --table all

Result:

> - starting
โœ“ - accepting

NFA transition table:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     โ”‚ a โ”‚ b โ”‚ c  โ”‚ ฮต*          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 1 > โ”‚   โ”‚   โ”‚    โ”‚ {1,2,3,7,9} โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 2   โ”‚   โ”‚   โ”‚    โ”‚ {2,3,7}     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 3   โ”‚ 4 โ”‚   โ”‚    โ”‚ 3           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 4   โ”‚   โ”‚   โ”‚    โ”‚ {4,5,6}     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 5   โ”‚   โ”‚   โ”‚    โ”‚ {5,6}       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 6 โœ“ โ”‚   โ”‚   โ”‚    โ”‚ 6           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 7   โ”‚   โ”‚ 8 โ”‚    โ”‚ 7           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 8   โ”‚   โ”‚   โ”‚    โ”‚ {8,5,6}     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 9   โ”‚   โ”‚   โ”‚ 10 โ”‚ 9           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 10  โ”‚   โ”‚   โ”‚    โ”‚ {10,6}      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


DFA: Original transition table:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”
โ”‚     โ”‚ a โ”‚ b โ”‚ c โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 1 > โ”‚ 4 โ”‚ 3 โ”‚ 2 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 2 โœ“ โ”‚   โ”‚   โ”‚   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 3 โœ“ โ”‚   โ”‚   โ”‚   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 4 โœ“ โ”‚   โ”‚   โ”‚   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜


DFA: Minimized transition table:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”
โ”‚     โ”‚ a โ”‚ b โ”‚ c โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 1 > โ”‚ 2 โ”‚ 2 โ”‚ 2 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”ค
โ”‚ 2 โœ“ โ”‚   โ”‚   โ”‚   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”˜

AST nodes specification

Below are the AST node types for different regular expressions patterns:

Char

A basic building block, single character. Can be escaped, and be of different kinds.

Simple char

Basic non-escaped char in a regexp:

z

Node:

{
  type: 'Char',
  value: 'z',
  symbol: 'z',
  kind: 'simple',
  codePoint: 122
}

NOTE: to test this from CLI, the char should be in an actual regexp -- /z/.

Escaped char
\z

The same value, escaped flag is added:

{
  type: 'Char',
  value: 'z',
  symbol: 'z',
  kind: 'simple',
  codePoint: 122,
  escaped: true
}

Escaping is mostly used with meta symbols:

// Syntax error
*
\*

OK, node:

{
  type: 'Char',
  value: '*',
  symbol: '*',
  kind: 'simple',
  codePoint: 42,
  escaped: true
}
Meta char

A meta character should not be confused with an escaped char.

Example:

\n

Node:

{
  type: 'Char',
  value: '\\n',
  symbol: '\n',
  kind: 'meta',
  codePoint: 10
}

Among other meta character are: ., \f, \r, \n, \t, \v, \0, [\b] (backspace char), \s, \S, \w, \W, \d, \D.

NOTE: Meta characters representing ranges (like ., \s, etc.) have undefined value for symbol and NaN for codePoint.

NOTE: \b and \B are parsed as Assertion node type, not Char.

Control char

A char preceded with \c, e.g. \cx, which stands for CTRL+x:

\cx

Node:

{
  type: 'Char',
  value: '\\cx',
  symbol: undefined,
  kind: 'control',
  codePoint: NaN
}
HEX char-code

A char preceded with \x, followed by a HEX-code, e.g. \x3B (symbol ;):

\x3B

Node:

{
  type: 'Char',
  value: '\\x3B',
  symbol: ';',
  kind: 'hex',
  codePoint: 59
}
Decimal char-code

Char-code:

\42

Node:

{
  type: 'Char',
  value: '\\42',
  symbol: '*',
  kind: 'decimal',
  codePoint: 42
}
Octal char-code

Char-code started with \0, followed by an octal number:

\073

Node:

{
  type: 'Char',
  value: '\\073',
  symbol: ';',
  kind: 'oct',
  codePoint: 59
}
Unicode

Unicode char started with \u, followed by a hex number:

\u003B

Node:

{
  type: 'Char',
  value: '\\u003B',
  symbol: ';',
  kind: 'unicode',
  codePoint: 59
}

When using the u flag, unicode chars can also be represented using \u followed by a hex number between curly braces:

\u{1F680}

Node:

{
  type: 'Char',
  value: '\\u{1F680}',
  symbol: '๐Ÿš€',
  kind: 'unicode',
  codePoint: 128640
}

When using the u flag, unicode chars can also be represented using a surrogate pair:

\ud83d\ude80

Node:

{
  type: 'Char',
  value: '\\ud83d\\ude80',
  symbol: '๐Ÿš€',
  kind: 'unicode',
  codePoint: 128640,
  isSurrogatePair: true
}

Character class

Character classes define a set of characters. A set may include as simple characters, as well as character ranges. A class can be positive (any from the characters in the class match), or negative (any but the characters from the class match).

Positive character class

A positive character class is defined between [ and ] brackets:

[a*]

A node:

{
  type: 'CharacterClass',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Char',
      value: '*',
      symbol: '*',
      kind: 'simple',
      codePoint: 42
    }
  ]
}

NOTE: some meta symbols are treated as normal characters in a character class. E.g. * is not a repetition quantifier, but a simple char.

Negative character class

A negative character class is defined between [^ and ] brackets:

[^ab]

An AST node is the same, just negative property is added:

{
  type: 'CharacterClass',
  negative: true,
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    }
  ]
}
Character class ranges

As mentioned, a character class may also contain ranges of symbols:

[a-z]

A node:

{
  type: 'CharacterClass',
  expressions: [
    {
      type: 'ClassRange',
      from: {
        type: 'Char',
        value: 'a',
        symbol: 'a',
        kind: 'simple',
        codePoint: 97
      },
      to: {
        type: 'Char',
        value: 'z',
        symbol: 'z',
        kind: 'simple',
        codePoint: 122
      }
    }
  ]
}

NOTE: it is a syntax error if to value is less than from value: /[z-a]/.

The range value can be the same for from and to, and the special range - character is treated as a simple character when it stands in a char position:

// from: 'a', to: 'a'
[a-a]

// from: '-', to: '-'
[---]

// simple '-' char:
[-]

// 3 ranges:
[a-zA-Z0-9]+

Unicode properties

Unicode property escapes are a new type of escape sequence available in regular expressions that have the u flag set. With this feature it is possible to write Unicode expressions as:

const greekSymbolRe = /\p{Script=Greek}/u;

greekSymbolRe.test('ฯ€'); // true

The AST node for this expression is:

{
  type: 'UnicodeProperty',
  name: 'Script',
  value: 'Greek',
  negative: false,
  shorthand: false,
  binary: false,
  canonicalName: 'Script',
  canonicalValue: 'Greek'
}

All possible property names, values, and their aliases can be found at the specification.

For General_Category it is possible to use a shorthand:

/\p{Letter}/u;   // Shorthand

/\p{General_Category=Letter}/u; // Full notation

Binary names use the single value as well:

/\p{ASCII_Hex_Digit}/u; // Same as: /[0-9A-Fa-f]/

The capitalized P defines the negation of the expression:

/\P{ASCII_Hex_Digit}/u; // NOT a ASCII Hex digit

Alternative

An alternative (or concatenation) defines a chain of patterns followed one after another:

abc

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    },
    {
      type: 'Char',
      value: 'c',
      symbol: 'c',
      kind: 'simple',
      codePoint: 99
    }
  ]
}

Another examples:

// 'a' with a quantifier, followed by 'b'
a?b

// A group followed by a class:
(ab)[a-z]

Disjunction

The disjunction defines "OR" operation for regexp patterns. It's a binary operation, having left, and right nodes.

Matches a or b:

a|b

A node:

{
  type: 'Disjunction',
  left: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  right: {
    type: 'Char',
    value: 'b',
    symbol: 'b',
    kind: 'simple',
    codePoint: 98
  }
}

Groups

The groups play two roles: they define grouping precedence, and allow to capture needed sub-expressions in case of a capturing group.

Capturing group

"Capturing" means the matched string can be referred later by a user, including in the pattern itself -- by using backreferences.

Char a, and b are grouped, followed by the c char:

(ab)c

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Group',
      capturing: true,
      number: 1,
      expression: {
        type: 'Alternative',
        expressions: [
          {
            type: 'Char',
            value: 'a',
            symbol: 'a',
            kind: 'simple',
            codePoint: 97
          },
          {
            type: 'Char',
            value: 'b',
            symbol: 'b',
            kind: 'simple',
            codePoint: 98
          }
        ]
      }
    },
    {
      type: 'Char',
      value: 'c',
      symbol: 'c',
      kind: 'simple',
      codePoint: 99
    }
  ]
}

As we can see, it also tracks the number of the group.

Another example:

// A grouped disjunction of a symbol, and a character class:
(5|[a-z])
Named capturing group

A capturing group can be given a name using the (?<name>...) syntax, for any identifier name.

For example, a regular expressions for a date:

/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u

For the group:

(?<foo>x)

We have the following node (the name property with value foo is added):

{
  type: 'Group',
  capturing: true,
  name: 'foo',
  nameRaw: 'foo',
  number: 1,
  expression: {
    type: 'Char',
    value: 'x',
    symbol: 'x',
    kind: 'simple',
    codePoint: 120
  }
}

Note: The nameRaw property represents the name as parsed from the original source, including escape sequences. The name property represents the canonical decoded form of the name.

For example, given the /u flag and the following group:

(?<\u{03C0}>x)

We would have the following node:

{
  type: 'Group',
  capturing: true,
  name: 'ฯ€',
  nameRaw: '\\u{03C0}',
  number: 1,
  expression: {
    type: 'Char',
    value: 'x',
    symbol: 'x',
    kind: 'simple',
    codePoint: 120
  }
}
Non-capturing group

Sometimes we don't need to actually capture the matched string from a group. In this case we can use a non-capturing group:

Char a, and b are grouped, but not captured, followed by the c char:

(?:ab)c

The same node, the capturing flag is false:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Group',
      capturing: false,
      expression: {
        type: 'Alternative',
        expressions: [
          {
            type: 'Char',
            value: 'a',
            symbol: 'a',
            kind: 'simple',
            codePoint: 97
          },
          {
            type: 'Char',
            value: 'b',
            symbol: 'b',
            kind: 'simple',
            codePoint: 98
          }
        ]
      }
    },
    {
      type: 'Char',
      value: 'c',
      symbol: 'c',
      kind: 'simple',
      codePoint: 99
    }
  ]
}
Backreferences

A capturing group can be referenced in the pattern using notation of an escaped group number.

Matches abab string:

(ab)\1

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Group',
      capturing: true,
      number: 1,
      expression: {
        type: 'Alternative',
        expressions: [
          {
            type: 'Char',
            value: 'a',
            symbol: 'a',
            kind: 'simple',
            codePoint: 97
          },
          {
            type: 'Char',
            value: 'b',
            symbol: 'b',
            kind: 'simple',
            codePoint: 98
          }
        ]
      }
    },
    {
      type: 'Backreference',
      kind: 'number',
      number: 1,
      reference: 1,
    }
  ]
}

A named capturing group can be accessed using \k<name> pattern, and also using a numbered reference.

Matches www:

(?<foo>w)\k<foo>\1

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Group',
      capturing: true,
      name: 'foo',
      nameRaw: 'foo',
      number: 1,
      expression: {
        type: 'Char',
        value: 'w',
        symbol: 'w',
        kind: 'simple',
        codePoint: 119
      }
    },
    {
      type: 'Backreference',
      kind: 'name',
      number: 1,
      reference: 'foo',
      referenceRaw: 'foo'
    },
    {
      type: 'Backreference',
      kind: 'number',
      number: 1,
      reference: 1
    }
  ]
}

Note: The referenceRaw property represents the reference as parsed from the original source, including escape sequences. The reference property represents the canonical decoded form of the reference.

For example, given the /u flag and the following pattern (matches www):

(?<ฯ€>w)\k<\u{03C0}>\1

We would have the following node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Group',
      capturing: true,
      name: 'ฯ€',
      nameRaw: 'ฯ€',
      number: 1,
      expression: {
        type: 'Char',
        value: 'w',
        symbol: 'w',
        kind: 'simple',
        codePoint: 119
      }
    },
    {
      type: 'Backreference',
      kind: 'name',
      number: 1,
      reference: 'ฯ€',
      referenceRaw: '\\u{03C0}'
    },
    {
      type: 'Backreference',
      kind: 'number',
      number: 1,
      reference: 1
    }
  ]
}

Quantifiers

Quantifiers specify repetition of a regular expression (or of its part). Below are the quantifiers which wrap a parsed expression into a Repetition node. The quantifier itself can be of different kinds, and has Quantifier node type.

? zero-or-one

The ? quantifier is short for {0,1}.

a?

Node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: '?',
    greedy: true
  }
}
* zero-or-more

The * quantifier is short for {0,}.

a*

Node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: '*',
    greedy: true
  }
}
+ one-or-more

The + quantifier is short for {1,}.

// Same as `aa*`, or `a{1,}`
a+

Node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: '+',
    greedy: true
  }
}
Range-based quantifiers

Explicit range-based quantifiers are parsed as follows:

Exact number of matches
a{3}

The type of the quantifier is Range, and from, and to properties have the same value:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: 'Range',
    from: 3,
    to: 3,
    greedy: true
  }
}
Open range

An open range doesn't have max value (assuming semantic "more", or Infinity value):

a{3,}

An AST node for such range doesn't contain to property:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: 'Range',
    from: 3,
    greedy: true
  }
}
Closed range

A closed range has explicit max value: (which syntactically can be the same as min value):

a{3,5}

// Same as a{3}
a{3,3}

An AST node for a closed range:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: 'Range',
    from: 3,
    to: 5,
    greedy: true
  }
}

NOTE: it is a syntax error if the max value is less than min value: /a{3,2}/

Non-greedy

If any quantifier is followed by the ?, the quantifier becomes non-greedy.

Example:

a+?

Node:

{
  type: 'Repetition',
  expression: {
    type: 'Char',
    value: 'a',
    symbol: 'a',
    kind: 'simple',
    codePoint: 97
  },
  quantifier: {
    type: 'Quantifier',
    kind: '+',
    greedy: false
  }
}

Other examples:

a??
a*?
a{1}?
a{1,}?
a{1,3}?

Assertions

Assertions appear as separate AST nodes, however instread of manipulating on the characters themselves, they assert certain conditions of a matching string. Examples: ^ -- beginning of a string (or a line in multiline mode), $ -- end of a string, etc.

^ begin marker

The ^ assertion checks whether a scanner is at the beginning of a string (or a line in multiline mode).

In the example below ^ is not a property of the a symbol, but a separate AST node for the assertion. The parsed node is actually an Alternative with two nodes:

^a

The node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Assertion',
      kind: '^'
    },
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    }
  ]
}

Since assertion is a separate node, it may appear anywhere in the matching string. The following regexp is completely valid, and asserts beginning of the string; it'll match an empty string:

^^^^^
$ end marker

The $ assertion is similar to ^, but asserts the end of a string (or a line in a multiline mode):

a$

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Assertion',
      kind: '$'
    }
  ]
}

And again, this is a completely valid regexp, and matches an empty string:

^^^^$$$$$

// valid too:
$^
Boundary assertions

The \b assertion check for word boundary, i.e. the position between a word and a space.

Matches x in x y, but not in xy:

x\b

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'x',
      symbol: 'x',
      kind: 'simple',
      codePoint: 120
    },
    {
      type: 'Assertion',
      kind: '\\b'
    }
  ]
}

The \B is vice-versa checks for non-word boundary. The following example matches x in xy, but not in x y:

x\B

A node is the same:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'x',
      symbol: 'x',
      kind: 'simple',
      codePoint: 120
    },
    {
      type: 'Assertion',
      kind: '\\B'
    }
  ]
}
Lookahead assertions

These assertions check whether a pattern is followed (or not followed for the negative assertion) by another pattern.

Positive lookahead assertion

Matches a only if it's followed by b:

a(?=b)

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Assertion',
      kind: 'Lookahead',
      assertion: {
        type: 'Char',
        value: 'b',
        symbol: 'b',
        kind: 'simple',
        codePoint: 98
      }
    }
  ]
}
Negative lookahead assertion

Matches a only if it's not followed by b:

a(?!b)

A node is similar, just negative flag is added:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Char',
      value: 'a',
      symbol: 'a',
      kind: 'simple',
      codePoint: 97
    },
    {
      type: 'Assertion',
      kind: 'Lookahead',
      negative: true,
      assertion: {
        type: 'Char',
        value: 'b',
        symbol: 'b',
        kind: 'simple',
        codePoint: 98
      }
    }
  ]
}
Lookbehind assertions

NOTE: Lookbehind assertions are not yet supported by JavaScript RegExp. It is an ECMAScript proposal which is at stage 3 at the moment.

These assertions check whether a pattern is preceded (or not preceded for the negative assertion) by another pattern.

Positive lookbehind assertion

Matches b only if it's preceded by a:

(?<=a)b

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Assertion',
      kind: 'Lookbehind',
      assertion: {
        type: 'Char',
        value: 'a',
        symbol: 'a',
        kind: 'simple',
        codePoint: 97
      }
    },
    {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    },
  ]
}
Negative lookbehind assertion

Matches b only if it's not preceded by a:

(?<!a)b

A node:

{
  type: 'Alternative',
  expressions: [
    {
      type: 'Assertion',
      kind: 'Lookbehind',
      negative: true,
      assertion: {
        type: 'Char',
        value: 'a',
        symbol: 'a',
        kind: 'simple',
        codePoint: 97
      }
    },
    {
      type: 'Char',
      value: 'b',
      symbol: 'b',
      kind: 'simple',
      codePoint: 98
    },
  ]
}

regexp-tree's People

Contributors

brainmaestro avatar brettz9 avatar chantastic avatar danielruf avatar davisjam avatar dependabot[bot] avatar dmitrysoshnikov avatar duncanc avatar fisker avatar golmote avatar hg42 avatar jlhwung avatar just1ngray avatar kurtextrem avatar mathiasbynens avatar maxim-mazurok avatar rbuckton avatar rreverser avatar shvaikalesh avatar silentroach avatar teppeis avatar tjenkinson avatar voxpelli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexp-tree's Issues

Traverse: wrap nodes into NodePath objects

The NodePath class encapsulates node, parent, parentPath (a NodePath), prop, index, and also provides access to API methods, such as removeNode, replaceNode, etc.

Example:

traverse(ast, {
  onChar({node}) {
    ...
  }
});

Using methods:

traverse(ast, {
  onChar(path) {
    const {node} = path;
    if (...) {
      path.removeNode(node);
    }
  }
});

Interface:

class NodePath {
  node: Object;
  parent: Object;
  parentPath: NodePath;
  property: string?;
  index: number?;

  remove();
  replace(node: Object);
}

Backreferences appearing before their capture groups incorrectly treated as decimal escapes

The ES6 spec says that an expression like \5 is a backreference "only if the integer value of DecimalEscape is <= NcapturingParens", referring to the total number of capturing groups in the entire expression. However regexp-tree only compares to the number of capture groups encountered by the time the backreference is parsed.

See 15.10.2.11_A1_T5 from test262:

/\1(A)/.exec("AA");

This is expected to match the first "A". The backreference is to a capture group that has not captured anything, which unconditionally succeeds.

Example:

./bin/regexp-tree -e '/\1(a)/'

reports:

{
  "type": "RegExp",
  "body": {
    "type": "Alternative",
    "expressions": [
      {
        "type": "Char",
        "value": "\\1",
        "kind": "decimal",
        "symbol": "\u0001"
      },
      {
        "type": "Group",
        "capturing": true,
        "number": 1,
        "expression": {
          "type": "Char",
          "value": "a",
          "kind": "simple"
        }
      }
    ]
  },
  "flags": ""
}

The first expression should be a backreference, not a decimal.

Optimizer: implement "remove empty groups" transform

We should remove empty groups:

/a()/ -> /a/
/a(?:)/ -> /a/

NOTE: removing a capturing group potentially changes semantics (the number of groups will be different), but this seems should be fine -- I cannot imagine any actual use-case when an empty group should be used, so this is mostly a "lint rule".

Implement regexp diagrams

We can use something like railroad-diagrams to generate an .html file with the diagram, drawn for a regexp.

./bin/regexp-tree -e '/[0-9]+/' --diagram

(we can consider other diagram frameworks, if there are better ones)

Should the optimizer avoid rewriting to `\W`?

See https://mathiasbynens.be/notes/es6-unicode-regex#impact-i.

Per ES6, /\W/iu was equivalent to /[^0-9a-jl-rt-zA-JL-RT-Z_]/u, and engines implemented it that way.

This was fixed in the spec in June 2016. Now, /\W/iu is equivalent to /[^0-9a-zA-Z_\u{017F}\u{212A}]/u. It has been fixed in the abovementioned engines, too, but some people are still using old versions.

TL;DR: inserting \W into character classes potentially changes the behavior of the regular expression, depending on the engine executing the optimized code.

Parser: Location: make start/end to store line/column/offset

Since we currently support only 1-liner rexes, there was no need to store line/column in locations information. With the x flag, which will enable multiline mode (ignoring the most of the whitespaces), and comments, we'll need to store line numbers as well.

Following ESTree location interface, we need store:

{
  type: 'Char',
  value: 'a',
  kind: 'simple',
  loc: {
    source: 'a',
    start: {
      line: 2,
      column: 3,
      offset: 14
    },
    end: {
      line: 2,
      column: 4,
      offset: 15
    },
  }

To check current format:

./bin/regexp-tree -e '/a/' -l

(the offset is added for convenience, and is a deviation from ESTree Position interface)

Show symbol for char-code nodes

For HEX, Octal, Decimal, and Unicode char codes we can show also an actual symbol corresponding to the code:

\u003B

Currently produces:

{
  "type": "Char",
  "value": "\\u003B",
  "kind": "unicode"
}

We should add symbol property to the node:

{
  "type": "Char",
  "value": "\\u003B",
  "kind": "unicode",
  "symbol": ";"
}

Transform: Add `remove` method

We should be able to remove needed node:

To replace this:

/aa*/

With this:

/a+/

We need to remove first Char node:

regexpTree.transform(regexp, {
  onChar(node) {
    // if condition is met:
    this.remove();
  }
});

reordering `a(b)` to `b(a)` does not set parentPath correctly (parent not in sync with parentPath)

I have a test case, where the parents get out of sync.

I think there is something wrong or missing in my latest setChild code or perhaps in getForNode.

Originally I used replace() in the test case and blamed it for the error.
But now I have rewritten it using setChild. I generally think replace() shouldn't be used for such reorders. But now I have a very similar error.

I already tried to find the bug, but I didn't succeed, yet.
My debugger is non-functional currently, so I only try to use logging and good conclusions.

Perhaps someone else wants to try?

But of course I'm not out of options (I guess I never will), so I'll try more myself.
I just wanted to let everyone know, that there is a problem.

This is the test case:

  it('replaces with already used node with another parent', () => {
    const ast = parser.parse('/a(b)/');

    const bodyPath = new NodePath(ast.body);
    const aCharPath = bodyPath.getChild(0);
    const groupPath = bodyPath.getChild(1);
    const bCharPath = groupPath.getChild();
    const aNode = aCharPath.node;
    const bNode = bCharPath.node;

    bodyPath.node.mark = "body";
    groupPath.node.mark = "group";

    // swap a and b
    groupPath.setChild(aNode);
    bodyPath.setChild(bNode, 0);

    // body and group still at same position
    expect(bodyPath.node.mark).toEqual('body')
    expect(groupPath.node.mark).toEqual('group')
    expect(bodyPath.getChild(1).node.mark).toEqual('group')
    // second is Char 'a' in the Group
    expect(bodyPath.getChild(1).getChild().node.type).toEqual('Char')
    expect(bodyPath.getChild(1).getChild().node.value).toEqual('a')
    // now the parent of 'a' is the Group
    expect(bodyPath.getChild(1).getChild().parent.mark).toEqual('group')
    
    // ...
  });

the last expect fails like this:

    Expected value to equal:
      "group"
    Received:
      "body"

If you add this (instead of the last expect):

expect(bodyPath.getChild(1).getChild().parentPath.node)
 .toBe(bodyPath.getChild(1).getChild().parent)

you get this error:

    - Expected
    + Received

    Object {
    -  "expressions": Array [
    -    Object {
    -      "kind": "simple",
    -      "symbol": undefined,
    -      "type": "Char",
    -      "value": "b",
    -    },
    -    Object {
    -      "capturing": true,
    -      "expression": Object {
    -        "kind": "simple",
    -        "symbol": undefined,
    -        "type": "Char",
    -        "value": "a",
    -      },
    -      "mark": "group",
    -      "number": 1,
    -      "type": "Group",
    -    },
    -  ],
    -  "mark": "body",
    -  "type": "Alternative",
    +  "capturing": true,
    +  "expression": Object {
    +    "kind": "simple",
    +    "symbol": undefined,
    +    "type": "Char",
    +    "value": "a",
    +  },
    +  "mark": "group",
    +  "number": 1,
    +  "type": "Group",
     }

so the node in the parentPath of 'a' is the body (which is wrong) and the parent is the group (which is correct). Of course they should be identical (=group).

@DmitrySoshnikov you added the same NodePath.getForNode to replace() like I did.
I still think, this is correct.

EDIT: I removed some lines about getForNode and index parameter, because that's not ready to be discussed

Parser: store group number for a capturing group

This:

./bin/regexp-tree -e '/(a)/'

Should give us (need to add "number" property):

{
  "type": "RegExp",
  "body": {
    "type": "Group",
    "capturing": true,
    "number": 1,
    "expression": {
      "type": "Char",
      "value": "a",
      "kind": "simple"
    }
  },
  "flags": ""
}

NodePath: implement `insert` functionality

This should handle nodes insertion in to a collection. Be careful to track inserted index to adjust in the traversal -- we should replace the removedIndices which tracks only removals with the actionJournal. The actionJournal will store records of type: [{remove: }, {insert: }, ...], and adjust the traversing index accordingly.

Re: API

Currently a path applies operations on itself, e.g. path.remove(), path.replace(), etc. The insert operation in contrast operations on another node. So it's a path instance is used to insert a new node to needed position.

Proposed options:

// Inserts a node at index.
path.insertNodeAt(node, index); // or just `insertAt`

Other convenient methods (which can be based on path.insertNodeAt internally):

// Makes a previous sibling.
path.insertBefore(node);

// Makes a next sibling
path.insertAfter(node);

Parser: some empty nodes fail when capturing locations

This one is OK:

regexp-tree -e '/(|)/'

This one fails:

regexp-tree -e '/(|)/' -l
TypeError: Cannot read property 'startOffset' of null

We should also return an empty collection for things like /[]/, and other similar, instead of null.

Parser/generator: implement PCRE mode

Currently we support only JavaScript regexes. However, regexp-tree can be language-agnostic, and support any other formats/modes/syntaxes of regular expressions. Thus, transform, and traverse modules can be fully reused for any syntax.

E.g., a named capturing group in PCRE, and Python syntax is:

(?P<foo>bar)

Similarly, we should support other syntax specifics.

Transform: Add `replaceWith` method

Currently transforming a node is easy, however in order to replace a node, one have to do more actions handling more params (parent, and prop). It'd be good to abstract this operation with the replaceWith method on the traversal module.

Modify node:

regexpTree.traverse(ast, {
  onQuantifier(node) {
    // {1,} -> +
    if (
      node.kind === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      // Modify node inline.
      node.kind = '+';
      delete node.from;
    }
  },
});

Replace a node (current approach):

regexpTree.traverse(ast, {
  onQuantifier(node, parent, prop) {
    // {1,} -> +
    if (
      node.kind === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      // Replace the node.
      parent[prop] = {
        type: 'Quantifier',
        kind: '+',
        loc: {
          start: node.loc.start,
          end: node.loc.start + 1,
        },
      };
    }
  },
});

Replace with the method (in transform module):

regexpTree.transform(regexp, {
  onQuantifier(node) {
    // {1,} -> +
    if (
      node.kind === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      // Replace the node.
      this.replaceWith({
        type: 'Quantifier',
        kind: '+',
        loc: {
          start: node.loc.start,
          end: node.loc.start + 1,
        },
      });
    }
  },
});

Under the hood, the traversal module should just handle the same parent, and prop.

Make flags an alpha-sorted string

Currently flags is an array, we should make it as an alphabetically sorted string, as in RegExp.prototype.flags.

regexpTree.parse('/a/ig');

Should be:

{
  "type": "RegExp",
  "body": {
    "type": "Char",
    "value": "a",
    "kind": "simple"
  },
  "flags": "gi"
}

Compat-transpiler: implement `u` flag

The u flag should already be implemented by most of the engines, however, there are still old engines which do not implement it, so it'll still be nice to have it in compat-transpiler (ES2015 -> ES5).

There is an explicit implementation for the u flag in regexpu package. Potentially can be ported to regexp-tree.

traverse/transform: path.remove() or any other way to change the collection does not work

Hi,

Changing the parent collection (e.g. expressions) in a traverse() or transform() has an effect on the traversion.
E.g. the next node will be skipped if I use this (in coffeescript):

    RegExpTree.traverse(ast, {

      Char: (path) ->
        {node, parent, property, index} = path
        if parent.type == "Alternative"
          if node.value == " " or node.value == "\n"
            path.remove()
      }
    )

I assume, the index is used in the iteration and that changes when removing the item from the collection.

Perhaps I am doing it wrong? but I assume the remove() of NodePath is for exactly this purpose.
What is the preferred way to delete a node inside traverse()?

As a workaround I tried to set node.type = null or to an undefined type name, but this gives me an error.
Would be nice if traverse() and generate() could skip such a node.
Also optimize could remove such nodes.

Traverse: remove `on` prefix on handler

The on prefix seems unnecessary, and takes time to build the name; a handler name can be just the node type name.

onQuantifier(...) {
  ...
}

A simpler version:

Quantifier(...) {
  ...
}

Transform/runtime: Implement "named capturing groups"

Parser support for named capturing groups has been added in issue #28. It's an ECMAScript proposal at stage 3.

To be able to use it in older engines, we can translate:

/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/

Into:

new RegExpTree(/(\d{4})-(\d{2})-(\d{2})/, ['year', 'month', 'day']);

The regexpTree creates a lightweight facade wrapper on top of ES6-compatible regexp, and provides the same API (test, exec methods, etc).

class RegExpTree {
  constructor(re, groups) {
    this.re = re;
    this.groups = groups;
  }

  test(string) {
    return this.re.test(string);
  }
 
  exec(string) {
    const result = this.re.exec(string);

    // process `result`, and attach `groups`:
    // result.groups.year -> result[1]
    // result.groups.month -> result[2]
    // etc.
  }
  ...
}

As well as overrides String.prototype.match method to attach group names to the groups property on the match result.

const originalMatch = String.prototype.match;

String.prototype.match = function(regexp) {
  // Simple regexp.
  if (!(regexp instanceof RegExpTree) {
    return originalMatch.call(this, regexp);
  }
  
  // RegExpTree. 
  return regexp.exec(this);
};

exec(...): exception if javascript RegExp fails

I got an exception:

    result.groups = {};
                  ^

TypeError: Cannot set property 'groups' of null

which is internal stuff.

So, result should be checked in src/compat-transpiler/runtime/index.js:

  /**
   * Facade wrapper for RegExp `exec` method.
   */
  exec(string) {
    const result = this._re.exec(string);

    if (!this._groups) {
      return result;
    }

    result.groups = {};
  ...

Investigate whether we can compat-transpiler lookbehind assertions

The lookbehind assertions (stage 3 proposal) allow matching a pattern only if it's preceded (or not preceded for negative) by another pattern.

We should investigate, whether it's possible (in any approximation) to compat-transpile them to ES2018->ES2015.

Matches y only if it's preceded by x:

/(?<=x)y/

One of the approaches is to use non-capturing group:

/(?:x)y/

We're talking about covering exec, and other methods, since for just replace method it'd be possible to emulate as:

/(?<=x)y/
// <match>x
'xyx'.replace(/(x)?y/g, ($0, $1) => $1 ? '<match>' : $0);

But we need a more-less acceptable solution for transformation.

Parser: support named groups

The named groups proposal is at stage 3, so it's safe to add it to the parser.

/(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/u;

Backreferences are done via \k escape:

let duplicate = /^(?<half>.*).\k<half>$/u;
duplicate.test('a*b'); // false
duplicate.test('a*a'); // true

Named references can also be used simultaneously with numbered references:

let triplicate = /^(?<part>.*).\k<part>.\1$/u;
triplicate.test('a*a*a'); // true
triplicate.test('a*a*b'); // false

See the spec for productions.

Implement code generator from AST

An oposite operation to parsing, code generation.

The issue depends on issue #6 (traversing an AST, we should just generate code for each AST-node).

See TODO item, generator lives in src/generator.

Invariant:

regexpTree.generate(regexpTree.parse('/[a-z]+/g')); // '/[a-z]+/g'

Another example:

const re = regexpTree.generate({
  type: 'RegExp',
  body: {
    type: 'Alternative',
    expressions: [
      {
        type: 'Char',
        value: 'a',
        kind: 'simple'
      },
      {
        type: 'Char',
        value: 'b',
        kind: 'simple'
      },
      {
        type: 'Char',
        value: 'c',
        kind: 'simple'
      },
    ],
  },
  flags: ['i'],
});

console.log(re); // '/abc/i'

Transform:

const ast = regexpTree.parse('/a{1,}/');

// transform
ast.body.quantifier = {
  type: '+'
};

console.log(regexpTree.generate(ast)); // '/a+/'

Example for different AST format: regjsgen.

Implement interpreter module

This is a large size issue, but eventually we'll need to have an interpreter in order to test new features (like lookbehind assertions, named groups, etc) which are not yet implemented in JS engines.

The spec is here. We can follow the ECMAScript spec, though can also simplify the handling, as long as the semantic is preserved.

const regexpTree = require('regexp-tree');

// Lookbehind.
regexpTree.test('/(?<=a)b/', 'ab'); // true

// Named groups.
const match = regexpTree.exec('/(?<data>x)/', 'x'); // MatchResult

console.log(match[0]); // x
console.log(match.groups.data); // x

A separate engine (using NFA/DFA) can be implemented in another module.

Implement plugins system

I'd be good in the transform API (issue #7) to accept a list of plugins:

const regexpTree = require('regexp-tree');
const {Plugin} = regexpTree;

// Require custom plugin.
const simpleRepeatPlugin = require('regexp-tree-simple-repeat-plugin');

regexpTree.transform('/[a-zA-Z_0-9]/', [
  // Define a plugin inline.
  new Plugin('short-hand', {
    onChar(node) {
      ...
    },
  }),

  // Required
  simpleRepeatPlugin,
]);

Plugins are applied in order. If a node is removed by one plugin, further plugins in the list aren't applied, and the next node is proceeded.

API:

class Plugin {
  constructor(name, handler) {
    ...
  }
}

This issue depends on issue #7.

Parser: implement `x` flag

When x flag is used in a regexp, whitespace are ignored, and #-comments can be used.

This is not a standard flag (yet) of JavaScript regexes, however, in order to support different syntaxes (pcre, python etc), we can add support of x to the main parser.

In the future, if different syntaxes differ too much, we'll have to fork grammar file, and implement one per syntax. Currently handling x is trivial in the main grammar, so not forking yet.

/
  # A regular expression for date.

  (?<year>\d{4})-    # year part of a date  
  (?<month>\d{2})-   # month part of a date
  (?<day>\d{2})      # day part of a date

/x

Re: JavaScript: this can also be handled in compat-transpiler (and later proposed for standardization).

Development: Add Flow types for static types safety

Currently the code base is not statically type checked, we can use Flow type checker to validate it.

(actually, this will require another build step with transpiling JS code -- the flow type annotations should be stripped)

Add `fromRegExp` method

The fromRegExp method should accept an actual regular expression object, and return a parsed AST for it. This is a counterpart for the toRegExp method.

const regexpTree = require('regexp-tree');

regexpTree.fromRegExp(/[a-z]+/i); // AST
regexpTree.fromRegExp(new RegExp('[a-z]+', 'i')); // AST

Unicode code point escapes should accept any number of digits

E.g. this is correct:

$ regexp-tree -e '/\u{1D306}/u'
{
  "type": "RegExp",
  "body": {
    "type": "Char",
    "value": "\\u{1D306}",
    "kind": "unicode"
  },
  "flags": [
    "u"
  ]
}

However, this parses incorrectly:

$ regexp-tree -e '/\u{000001D306}/u'
{
  "type": "RegExp",
  "body": {
    "type": "Alternative",
    "expressions": [
      {
        "type": "Char",
        "value": "u",
        "kind": "simple",
        "escaped": true
      },
      {
        "type": "Char",
        "value": "{",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "1",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "D",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "3",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "0",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "6",
        "kind": "simple"
      },
      {
        "type": "Char",
        "value": "}",
        "kind": "simple"
      }
    ]
  },
  "flags": [
    "u"
  ]
}

Implement transform module

The transform method is a wrapper on top of "parse-traverse-generate" tool chain. Getting a regular expression string, and the nodes handler, transform should return a new (transformed) regular expression.

The issue depend on issue #5.

An example of manual traversal, and code generation is shown in the Using traversal API section. A transform version of it should look like:

const regexpTree = require('regexp-tree');

const re = regexpTree.transform('/[a-z]{1,}/', {
  // Handle "Quantifier" node type,
  // transforming `{1,}` quantifier to `+`.
  onQuantifier(node) {
    // {1,} -> +
    if (
      node.type === 'Range' &&
      node.from === 1 &&
      !node.to
    ) {
      node.type = '+';
      delete node.from;
    }
  },
});

console.log(re); // '/[a-z]+/'

Generator: fix empty groups

This currently throws in the generator:

regexpTree.generate(regexpTree.parse('/()/'));
regexpTree.generate(regexpTree.parse('/(?:)/'));

Implement optimizer module

Optimizer module should transform a regular expression into an optimized version, replacing some sub-expressions with their idiomatic patterns. This might be good for different kinds of minifiers, as well as for regexp machines.

Example:

/[a-zA-Z_0-9][a-zA-Z_0-9]*\e{1,}/

Is transformed into:

/\w+e+/

We can implement the optimizer as a set of small transforms (this depends on issue #7, and issue #23).

Patterns to replace:

  • aa* -> a+
  • (), (?:) - <empty> (remove empty groups)
  • a{1,} -> a+
  • a{1} -> a
  • a{3,3} -> a{3}
  • [a-zA-Z_0-9] -> \w
  • [^a-zA-Z_0-9] -> \W
  • [\d\d] -> [\d] (remove duplicates from char-class)
  • [\d] -> \d (remove char class for single char)
  • [0-9] -> \d
  • [^0-9] -> \D
  • [ \t\r\n\f] -> \s
  • [^ \t\r\n\f] -> \S
  • \e -> e (remove unnecessary escape)
  • ...

See TODO item, optimizer lives in src/optimizer.

Transform: utils for supporting tree consistency

Some utils might be useful to abstract away a boilerplate users have to do if they want to keep an AST consistent after some modifications.

Example: all capturing groups have number property, which is built during parsing. However, if some transform in pipeline adds/removes a capturing group, the numbers of all other existing groups become unsync, and further may not work as expected.

Responsibility of rebuilding needed data is on transform implementers, however, we can provide some useful utils to work with AST (and which do not fit to be on path). E.g.:

// Original, the group containing `foo`
// has index 1 in AST node.
/(foo)/

Transformed (inserted a group with bar):

// Bug, foo still has index 1.
/(bar)(foo)/

For this callers can do:

// Rest all the indices of capturing groups.
transformUtils.rebuildGroupIndex(ast);

There are some other inconsistencies, most noticeable of which are: location data, which store offsets, and original source -- obviously when inserting/replacing nodes, all such data become unsync. Some cases we can keep untouched (and mark them as "unreliable" if several transforms are applied in a pipeline).

Besides, the utils can contain other useful methods for higher-abstract tree manipulations/cleanups/etc (which are not on path).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.