Giter Site home page Giter Site logo

intxeger's Issues

Using a regex `.` causes newline breaks to be sampled

It seems that when used to sample a regex that includes a . (which should match anything except a newline break), \n will show up in the sample

import re
import intxeger

regex = r"."
x = intxeger.build(regex)
samples = x.sample(min(100, x.length))
non_matches = [item for item in samples if re.fullmatch(regex, item) is None]
print(non_matches)
# ['\n']

Expand user API

  • Add an intxeger.sample(regex, N) method which builds the tree, optimize it, and uses it to generate N samples.
  • Add an intxeger.iterator(regex, ordered=False) generator which yields random or ordered samples.

ValueError raised

Hi there, I'm evaluating using this library instead of the alternatives since it looks quite nice. But I am enountering some issues.

For example, given this input:

from intxeger import build

regex = "a$"
result = build(regex)

I am getting this:

op = AT, args = AT_END, max_repeat = 10

    def _to_node(op, args, max_repeat):
        if op == sre_parse.IN:
            nodes = []
            for op, args in args:
                nodes.append(_to_node(op, args, max_repeat))
            if nodes[0] == "NEGATE":
                values = [c[i] for c in nodes[1:] for i in range(c.length)]
                nodes = [Constant(c) for c in string.printable if c not in values]
            return Choice(nodes)
        elif op == sre_parse.RANGE:
            min_value, max_value = args
            return Choice(
                [Constant(chr(value)) for value in range(min_value, max_value + 1)]
            )
        elif op == sre_parse.LITERAL:
            return Constant(chr(args))
        elif op == sre_parse.NEGATE:
            return "NEGATE"
        elif op == sre_parse.CATEGORY:
            return Choice([Constant(c) for c in CATEGORY_MAP[args]])
        elif op == sre_parse.ANY:
            return Choice([Constant(c) for c in string.printable])
        elif op == sre_parse.ASSERT:
            nodes = []
            for op, args in args[1]:
                nodes.append(_to_node(op, args, max_repeat))
            return Concatenate(nodes)
        elif op == sre_parse.BRANCH:
            nodes = []
            for group in args[1]:
                subnodes = []
                for op, args in group:
                    subnodes.append(_to_node(op, args, max_repeat))
                nodes.append(Concatenate(subnodes))
            return Choice(nodes)
        elif op == sre_parse.SUBPATTERN:
            nodes = []
            ref_id = args[0]
            for op, args in args[3]:
                nodes.append(_to_node(op, args, max_repeat))
            return Group(Concatenate(nodes), ref_id)
        elif op == sre_parse.GROUPREF:
            return GroupRef(ref_id=args)
        elif op == sre_parse.MAX_REPEAT or op == sre_parse.MIN_REPEAT:
            min_, max_, args = args
            op, args = args[0]
            if max_ == sre_parse.MAXREPEAT:
                max_ = max_repeat
            return Repeat(_to_node(op, args, max_repeat), min_, max_)
        elif op == sre_parse.NOT_LITERAL:
            return Choice([Constant(c) for c in string.printable if c != chr(args)])
        else:
>           raise ValueError(f"{op} {args}")
E           ValueError: AT AT_END

Strings may not be unique if the regex is ambiguous

For example, if your regex is:

(abc)|(abc)

Then it will say that length=2 and generate ["abc", "abc"] since they're generated by different nodes in the tree. It's not clear what the solution is but this is a not a problem unique to intxeger, other libraries such as exrex also have this issue.

Support countably infinite regular expressions

The goal is to support array-based indexing for regular expressions with unbounded repeats. Currently, the max_repeats parameter limits the number of times any sequence can be repeated, making it so that there are always a finite number of strings which can be generated from a regex.

After this change, the user will be able to choose between (1) specifying max_repeats and having a finite set of strings or (2) not specifying max_repeats and having an infinite set of strings they can iterate over and/or index into.

Repeat

Modify the Repeat class to apply Cantor's pairing function when max_repeat is not specified.

x -> (a, b) # decompose the index into two values
b -> interpret this as the length of repeated sequence
a -> (a1, a2, a3, ... a_b) # convert it into `b` values
a_i -> the integer index of the `i`th element in the sequence

Note that the length attribute will be set to float(-inf).

Choice

Modify the Choice class to handle both finite nodes and infinite nodes. It should assign the smallest integers to the finite nodes; then, once those are all assigned, it should start handling the infinite nodes by rotating between them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.