lorenzofelletti / pyregex Goto Github PK
View Code? Open in Web Editor NEWBacktracking regular expression engine written in Python
Home Page: https://lorenzofelletti.github.io/pyregex
License: MIT License
Backtracking regular expression engine written in Python
Home Page: https://lorenzofelletti.github.io/pyregex
License: MIT License
Named Groups
It would be nice to have the possibility of having named groups with a syntax like the following:
"(?<group_name>foobar*)".
Describe the solution you'd like
It would be awesome if there was an option to match a regex and a test string ignoring the case of the two.
Describe the bug
Empty regex result in infinite loop if continue_after_match is True
To Reproduce
Steps to reproduce the behavior:
reng = RegexEngine()
res, _ = reng.match(regex, test_str, continue_after_match=True)
Expected behavior
The regex should return in a finite time.
How it should be
[^a-zA-M] should match each character that is not contained between A-M and a-z. No '|' should be required to separate ranges, and it should be possible to create a range with each - (given that ch1 < ch2).
Also, the '-' should be considered different based on its position:
[a-z] -> range a to z
[-] -> dash
[a-z-] -> a-z and dash
[a-\s] -> a, dash, and whitespaces.
Experiment with https://regexr.com/ to see the expected behavior.
Describe the bug
When using the question mark quantifier (zero or one) '?' in a regex, the test string matches only if the quantified character matches one time, and not when the character is not present.
To Reproduce
Steps to reproduce the behavior:
Regular expression: 'https?'
'httpa' doesn't match the given regex
'httpsa' match with the regex
Expected behavior
The question mark quantifier should return a match both when the character is present and when it is not.
In the scenario presented above, the expected behavior is to get an output like this:
Regular expression: 'https?'
'httpa' match the given regex
'httpsa' match with the regex
Describe the bug
When there is a partial match before the actual match, the returned matches contains the groups that were previously matched instead of the actual matched groups, although the match representing the "whole match" is correct, and contains the correct matched string and match indexes.
To Reproduce
Steps to reproduce the behavior:
reng = RegexEngine()
res, consumed, matches = reng.match(r"(a)(a)(a)(a)(a)(a)", "aaaaaaaaaacccaaaaaac"
Expected behavior
The program should return the correct matches (so the "a"s at the indexes from 13 to 16).
Additional context
Likely, the problem is that the previous partial-match matched groups are not "flushed".
Is your feature request related to a problem? Please describe.
Right now, the following regex "a(b|c)" will result in an AST with 3 groups, with 3 different names (one for the whole regex, one for b and one for c). This is happening because the of the or operator that creates two groups, one for its left child, and one for the right.
Describe the solution you'd like
Although this is not a bug per se, it is just a behavior of the system, this may be very confusing especially when matches are returned.
The returned matches should provide a way to "acknowledge" that the groups b and c are inside the same parentheses.
Line 427 in 1fde108
there is no attribute "min" or "max" in the LeafNode class..
Describe the solution you'd like
Right now, groups are by default named "default", unless they are named groups. This is not so much meaningful. It would be better if groups were named something like "Group x" with a progressive x starting from 1.
Describe the solution you'd like
It would be nice if the Engine was able to cache the Lexer and Parser results (tokens and AST), so that if you call a regex that you've recently called these (costly) operations won't be recomputed.
Describe alternatives you've considered
An alternative solution would be to save the AST directly in the RegexEngine class, but this way the AST will be exposed to the "outside world", and moreover you will need to create another instance if you change regex, or an additional method to allow the user to change the regex must be provided.
Moreover, the match function signature will be changed this way, as you no longer need to provide the regex.
Is your feature request related to a problem? Please describe.
If a group matches multiple times (within a single regex match) the returned matches will contain the first time the group was matched only. This could be a bit confusing, because it is not in line with the "greedy" approach of the engine.
Describe the solution you'd like
It would be better to return either the last one, thus every new match of the same group "overwrites" the previous, or return each and every match (but this could largely increase the size of the returned matches structure).
Describe the bug
The following curly brace syntaxes aren't working:
a{,3}
and b{3,}
To Reproduce
Steps to reproduce the behavior:
reng = RegexEngine()
reng.match(r'a{,3})
reng.match(r'b{3,})
Expected behavior
The expected behavior is to set the minimum number of times the char a
should appear to 0
, in the first case, and the maximum number of times b
should appear to infinite
in the second case.
Describe the bug
Backtracking fails with nested quantifiers
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The match result should be True.
Additional context
First experienced on version 0.2.4.
The bug affects all prior versions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.