Giter Site home page Giter Site logo

lorenzofelletti / pyregex Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 2.0 1.88 MB

Backtracking regular expression engine written in Python

Home Page: https://lorenzofelletti.github.io/pyregex

License: MIT License

Python 98.96% Shell 1.04%
backtracking engine regex regular-expressions text-processing

pyregex's Introduction

Hey there! I am using GitHub

pyregex's People

Contributors

dependabot[bot] avatar lorenzofelletti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

phifour axnunezc

pyregex's Issues

Add Named Groups

Named Groups
It would be nice to have the possibility of having named groups with a syntax like the following:
"(?<group_name>foobar*)".

Add ignore case flag

Describe the solution you'd like
It would be awesome if there was an option to match a regex and a test string ignoring the case of the two.

Empty regex result in infinite loop if continue_after_match is True

Describe the bug
Empty regex result in infinite loop if continue_after_match is True

To Reproduce
Steps to reproduce the behavior:

  1. start python
  2. import RegexEngine
  3. execute:
reng = RegexEngine()
res, _ = reng.match(regex, test_str, continue_after_match=True)
  1. See infinite loop.

Expected behavior
The regex should return in a finite time.

Change range element grammar and behavior to a standard one

How it should be
[^a-zA-M] should match each character that is not contained between A-M and a-z. No '|' should be required to separate ranges, and it should be possible to create a range with each - (given that ch1 < ch2).
Also, the '-' should be considered different based on its position:
[a-z] -> range a to z
[-] -> dash
[a-z-] -> a-z and dash
[a-\s] -> a, dash, and whitespaces.

Experiment with https://regexr.com/ to see the expected behavior.

Question mark quantifier not working properly

Describe the bug
When using the question mark quantifier (zero or one) '?' in a regex, the test string matches only if the quantified character matches one time, and not when the character is not present.

To Reproduce
Steps to reproduce the behavior:

  1. In the project base directory, run
  2. ./regex.sh 'https?' 'httpa' "httpsa"
  3. check the output
Regular expression: 'https?'
'httpa' doesn't match the given regex
'httpsa' match with the regex

Expected behavior
The question mark quantifier should return a match both when the character is present and when it is not.

In the scenario presented above, the expected behavior is to get an output like this:

Regular expression: 'https?'
'httpa' match the given regex
'httpsa' match with the regex

Returned matches in case of previous partial match

Describe the bug
When there is a partial match before the actual match, the returned matches contains the groups that were previously matched instead of the actual matched groups, although the match representing the "whole match" is correct, and contains the correct matched string and match indexes.

To Reproduce
Steps to reproduce the behavior:

  1. run this code in debug mode
reng = RegexEngine()
res, consumed, matches = reng.match(r"(a)(a)(a)(a)(a)(a)", "aaaaaaaaaacccaaaaaac"
  1. inspect the second list of matches
  2. check that the global match is correct (the one in position 0)
  3. check that the last 4 matches aren't correct and refer to the "a"s at the indexes 6 to 9.

Expected behavior
The program should return the correct matches (so the "a"s at the indexes from 13 to 16).

Additional context
Likely, the problem is that the previous partial-match matched groups are not "flushed".

Groups in the same parentheses, separate with a "|" should be linked together

Is your feature request related to a problem? Please describe.
Right now, the following regex "a(b|c)" will result in an AST with 3 groups, with 3 different names (one for the whole regex, one for b and one for c). This is happening because the of the or operator that creates two groups, one for its left child, and one for the right.

Describe the solution you'd like
Although this is not a bug per se, it is just a behavior of the system, this may be very confusing especially when matches are returned.
The returned matches should provide a way to "acknowledge" that the groups b and c are inside the same parentheses.

Groups names are not much meaningful

Describe the solution you'd like
Right now, groups are by default named "default", unless they are named groups. This is not so much meaningful. It would be better if groups were named something like "Group x" with a progressive x starting from 1.

Cache Lexer tokens and parser AST

Describe the solution you'd like
It would be nice if the Engine was able to cache the Lexer and Parser results (tokens and AST), so that if you call a regex that you've recently called these (costly) operations won't be recomputed.

Describe alternatives you've considered
An alternative solution would be to save the AST directly in the RegexEngine class, but this way the AST will be exposed to the "outside world", and moreover you will need to create another instance if you change regex, or an additional method to allow the user to change the regex must be provided.
Moreover, the match function signature will be changed this way, as you no longer need to provide the regex.

Group matching multiple times in a single match return policy

Is your feature request related to a problem? Please describe.
If a group matches multiple times (within a single regex match) the returned matches will contain the first time the group was matched only. This could be a bit confusing, because it is not in line with the "greedy" approach of the engine.

Describe the solution you'd like
It would be better to return either the last one, thus every new match of the same group "overwrites" the previous, or return each and every match (but this could largely increase the size of the returned matches structure).

Curly brace syntax raising exception if values before or after comma aren't specified

Describe the bug
The following curly brace syntaxes aren't working:
a{,3} and b{3,}

To Reproduce
Steps to reproduce the behavior:

  1. reng = RegexEngine()
  2. reng.match(r'a{,3})
  3. See exception
  4. reng.match(r'b{3,})
  5. See exception

Expected behavior
The expected behavior is to set the minimum number of times the char a should appear to 0, in the first case, and the maximum number of times b should appear to infinite in the second case.

Backtracking fails with nested quantifiers

Describe the bug
Backtracking fails with nested quantifiers

To Reproduce
Steps to reproduce the behavior:

  1. Try to match the regex "(a*|b*)*ab" with the test string "ab"
  2. Notice that the result is False

Expected behavior
The match result should be True.

Additional context
First experienced on version 0.2.4.
The bug affects all prior versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.