jamadden / mrab-regex-hg Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 1.5 MB

Automatically exported from code.google.com/p/mrab-regex-hg

Python 18.77% C 81.23%

mrab-regex-hg's People

Contributors

Watchers

Forkers

nextthought anuvab

mrab-regex-hg's Issues

not all keywords are found by named list with overlapping keywords when full Unicode casefolding is required

What steps will reproduce the problem?

>>> import regex
>>> p = regex.compile(ur'(?fi)\L<keywords>', keywords=['post','pos'])
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')

What is the expected output? What do you see instead?

Expected:

[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']

Got:

[u'POST', u'Post', u'post', u'po\u017ft']

What version of the product are you using? On what operating system?

regex.__version__ == '2.4.0'
sys.version_info == (2, 6, 5, 'final', 0)
platform.platform() == 'Linux-3.0.0-15-generic-x86_64-with-Ubuntu-11.10-oneiric'

Please provide any additional information below.

>>> p = regex.compile(ur'(?fi)pos|post')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POS', u'Pos', u'pos', u'po\u017f']
>>> p = regex.compile(ur'(?fi)post|pos')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POST', u'Post', u'post', u'po\u017ft']
>>> p = regex.compile(ur'(?fi)post|another')
>>> p.findall(u'POST, Post, post, poſt, poﬆ, and poﬅ')
[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']

Original issue reported on code.google.com by [email protected] on 26 Jan 2012 at 2:50

regex.search("^(a|)\\1{2}b", "b") returns None

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
None

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("^(a|)\\1{2}b", "b").group(0,1))
('b', '')
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.
regex 0.1.20120103 has this issue too

Original issue reported on code.google.com by [email protected] on 14 Jan 2012 at 1:32

regex.search("^(?=ab(de))(abd)(e)", "abde").groups() returns (None, 'abd', 'e') instead of ('de', 'abd', 'e')

What steps will reproduce the problem?

>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
(None, 'abd', 'e')
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
('de', 'abd', 'e')
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.
regex ver.0.1.20120103 is OK.

Original issue reported on code.google.com by [email protected] on 14 Jan 2012 at 12:26

Broken pattern

re.search('Derde\s*:', 'aaaaaa:\nDerde:')

matches on Python 2.6.5 (Linux x86_64) using standard re module, doesn't match 
using regex from trunk rev b05186807a

This one is very weird, for example:

re.search('Derde\s*:', 'aaaaa:\nDerde:')

matches just fine using both modules....

Original issue reported on code.google.com by [email protected] on 2 Jan 2011 at 7:18

Problem with shared iterators

Issue 1366311 talks about releasing the GIL in the "re" module, but points to a 
problem which could occur with scanner objects which are shared across threads.

Unfortunately this module has just that problem. :-(

I'm currently attempting to fix it, but it's proving to be surprisingly 
difficult. When the iterator terminates in a thread, it's being collected, 
which causes a failure later when another thread tries to use it.

Revision 4dc5f7f181 didn't fix it.

Original issue reported on code.google.com by [email protected] on 14 Mar 2011 at 9:18

regex.search("(a)", "a", flags=regex.V1).span(1) returns (0, 1) instead of (1, 1)

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(0, 1)
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(1, 1)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.
The following works fine:
>>> import regex
>>> print(regex.search("(a*)*", "aa").span(1))
(2, 2)
>>> print(regex.search("(a*)*", "aaa").span(1))
(3, 3)

Original issue reported on code.google.com by [email protected] on 20 Jan 2012 at 11:34

regex.compile("a#comment\n*", flags=regex.X) causes "_regex_core.error: nothing to repeat"

What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 709, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
regex.Regex('a#comment\n*', flags=regex.X | regex.V0)
>>> regex.search("a#comment\n*", "aaa", flags=regex.X).group(0)
'aaa'
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 22 Jan 2012 at 9:02

atomic and normal groups in recursive patterns

First, many thanks for adding support for recursive patterns  to regex!
While playing with this feature - mainly in the context of balancing 
substrings, I found some disproportion I can't understand. (I suspect, this 
issue ought to be of the type "question" rather than report, but anyway.)

I am trying a pattern for possibly multiply parenthesised substrings; a version 
with an atomic group works, as I would expect:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['(bcd(e)f)', '(h)']

However, using a normal group the result is:
>>> regex.findall(r"\((?:([^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['f', 'h']
>>> 

I confess, I am not fully able to trace the (not-)backtracking behaviour in 
detail, but I can't understand, how there can be a match without the outer "(" 
")", which appear to be an unconditional part of the pattern. Or is possibly 
something else than whole matches returned from findall()?


Regarding the above examples - I am trying to find a pattern for nested 
elements - parentheses in this simplified example - which would work for 
balanced "subexpressions" in both directions, so that the longest balanced part 
would be found.
e.g.:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(b(cd)e)f)g)h")
['(b(cd)e)']
 works Ok, the superflous closing parentheses are ignored; 
however:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
[]

Is the right way to ignore the superfluos opening parentheses to use reversed 
search?:
>>> regex.findall(r"(?r)\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
['(d(e)f)']
>>> 

Or is there maybe a pattern, which would match in both cases unchanged?

(The next step would be to find the *not*-balanced elements, if it would be at 
all possible only using regex.)
(the above tests on Win XP, Python 2.7.2, regex-0.1.20120105)

Thanks and regards,
   vbr

Original issue reported on code.google.com by [email protected] on 12 Jan 2012 at 11:29

Support for regex in Property Values

I just found a chapter in the Unicode guidelines for regular expressions 
concerning property values and would like to ask, whether supporting of regex 
patterns (or some subset thereof) here would be possible.
http://unicode.org/reports/tr18/#Wildcard_Properties

I am not aware of any implementation already supporting it, nor do I know how 
much extra complexity for the parser would be needed, but it looks like a 
feature orthogonal with unicode properties and the set operations which regex 
already has.

I see, that the usecases would cover rather special approaches - in my case it 
would allow for investigating the unicode character repertoire itself 
(Currently I can do something like that after grabbing all the character names 
via unicodedata).

Otherwise, on "normal" text, the cases could be covered, where there are 
multiple character ranges, that should be considered (i.e. basic xxx, xxx 
supplement, xxx extended ...). (I am not sure how the current Script property 
relates to this exactly.) 
E.g. some errors or even spoofing attempts might be checked for on graphically 
similar characters from different ranges. cf.
o (dec: 111; hex: 0x6f) LATIN SMALL LETTER O
ο (dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
о (dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
օ (dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH

I'd like to stress, that this is only meant as proposal for consideration - it 
surely wouldn't be worth some extensive effort or the risk for being possible 
bug source.

Regards
  Vlastimil  Brom

Original issue reported on code.google.com by [email protected] on 15 Sep 2011 at 3:51

locale flag behaviour - independent of locale.setlocale()

Hi,
I hope, I am not missing anything trivial, I just noticed a behaviour of the 
LOCALE flag I can't understand; however, both re and regex behave the same in 
this respect:

I thought, the search pattern (?L)\w would match any of the respective 
string.letters according to the current locale (and possibly additionally 
[0-9_]).

However, the locale doesn't seem to be reflected.



>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0xFFFF+1))

>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print unicode(string.letters, "windows-1250")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻ
łµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕá
âăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print unicode(string.letters, "windows-1253")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔ
ΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμν
ξοπρςστυφχψωϊϋόύώ
>>> 
>>> import re
>>> import regex
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz���������
��£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛ�
�ÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz���������
��£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛ�
�ÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> 

>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º�
�¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäå�
�çèéêëìíîïðñòóôõö÷øùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º�
�¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäå�
�çèéêëìíîïðñòóôõö÷øùúûüýþ
>>> 

it seems that the nearest letter set to the result of the re/regex LOCALE flags 
migt be ascii or US locale:

>>> locale.setlocale(locale.LC_ALL, "US")
'English_United States.1252'
>>> print unicode(string.letters, "windows-1252")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂ
ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêë
ìíîïðñòóôõöøùúûüýþÿ
>>> 

however, there are some differences too, namely between z� and À
re/regex (?L)\w : 
z�¢²³µ¸¹º¼¾¿À (as displayed in wxPython shell)
z����������£¥ª¯³µ¹º¼¾¿À (as displayed in tkinter Idle 
shell)
(in either case, there are some items, one wouldn't consider usual word 
characters, cf. ¿)
US string.letters"
zƒŠŒŽšœžŸªµºÀ (displayed identically in both shells)

There are likely some other issues (like some encoding/displaying peculiarities 
in wx and Tkinter), but the regex matching using the LOCALE flag clearly don't 
reflect the locale.setlocale(...)

Is it supposed to work this way and is there another possibility to get the 
expected locale aware matching?

using Python 2.7.1, 32 bit;  win 7 Home Premium 64-bit, Czech; 
regex-0.1.20110315.

Regards,
    Vlastimil Brom

Original issue reported on code.google.com by [email protected] on 2 Apr 2011 at 9:41

regex.search("((?i)blah)\\s+\\1", "blah BLAH") doesn't return None

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH").group(0,1))
('blah BLAH', 'blah')
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
<_regex.Match object at 0x00C0BBB8>
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
None
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 15 Jan 2012 at 7:03

support concatenation of compiled patterns -- feature

With the new named lists feature, a compiled pattern may contain both the 
pattern string, and references to one or more lists.  This means that the 
standard way of composing patterns, concatenating the .pattern attribute of 
compiled patterns with other text, won't "just work".

Let me suggest an API modification to get around this.  Instead of the 
"pattern" parameter to regex.compile() being a string, allow it to be either a 
string, or a sequence of objects, each of which must be either a string or a 
compiled pattern.  The elements of the sequence are concatenated to generate 
the new pattern string, and list references in the compiled pattern elements of 
the sequence are transferred into the new compiled pattern.  Duplicate names 
for list references could be an error; or, they could be handled automatically 
by name-mangling the list names within the aggregate compiled pattern.

Original issue reported on code.google.com by [email protected] on 29 Jun 2011 at 4:24

non-capturing group (around surrogate character sets) cause recursion error

I just encountered some cornercase while using the non-capturing group (?: 
...), which leads to recursion error. The group i tested contains character 
sets of surrogate characters, I am not sure, whether this is relevant for the 
problem.
the same pattern works as expected with normal parentheses and without any 
grouping. cf.

>>> regex.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 253, in findall
  File "regex.pyc", line 394, in _compile
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
...
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
  File "_regex_core.pyc", line 2187, in fix_groups
RuntimeError: maximum recursion depth exceeded
>>> regex.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> regex.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>> 

re can handle all these pattern normally:
>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>> 

Using regex-0.1.20110616, win 7, python 2.7.2 (the official, narrow unicode 
build, of course)

vbr

Original issue reported on code.google.com by [email protected] on 22 Jun 2011 at 10:38

Setup.py is missing a reference to _regex_unicode.c

When I try to import regex I get the following import error:

dlopen(<snipped>/python2.5/site-packages/_regex.so, 2): Symbol not found: 
_re_is_same_char_ign
  Referenced from: <snipped>/python2.5/site-packages/_regex.so
  Expected in: dynamic lookup

Changing setup.py line 57 like below fixes this:

ext_modules=[Extension('_regex', [os.path.join(PKG_BASE, '_regex.c'), 
os.path.join(PKG_BASE, '_regex_unicode.c')])]

Original issue reported on code.google.com by [email protected] on 10 May 2011 at 2:14

casefolding specification

First, thanks for the new release (regex 0.1.20110917) ! (I especially like the 
changed fuzzy matching behaviour as discussed in 
http://code.google.com/p/mrab-regex-hg/issues/detail?id=12#c28 )

I'd like to ask about the specification of the case-folding behaviour used in 
case insensitive matching.
Is it the chapter 5.18 in the Unocode standard
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
Or did I miss something else?
I tried some patterns, where I thought these would be "caselessly" equivalent 
(based on the above)
>>> for m in regex.findall(ur"(?V1i)[ΣΟ]",u"ρς στΣόο"): print m,
... 
σ Σ ο

Here I'd have thought, the accented lowercase omicron or the positional lower 
sigma variant would be matched too.

On the other hand the sharp s (which is more frequent in my texts) seems to be 
matched in all directions.)

>>> for m in regex.findall(ur"(?V1i)ẞ",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>> for m in regex.findall(ur"(?V1i)ss",u"-s-S-ss-SS-ß-ẞ-"): print m,
... 
ss SS ß ẞ
>>> 

I thought, that only the changes in case should be reflected in matching, now 
there is effectively an equivalence between both lowercase ss and ß, which is 
not (at least not always) what is expected. (Both, with respect to the current 
German orthography or for dealing with text preceeding that official 
orthography regulation.)

Is there now some way to handle these characters as distinct (other than not 
using the i flag)?

Where can I maybe find the specification for this behaviour? - it seems, that I 
will need to reflect it in the search patterns.

(I can't comment competently on the behaviour  of the "prominent" case of the 
Turkic "i"s; personally I believe, there must be other comparable cases, once 
we begin to care about them... I'd support the view of some contributors in the 
respective py-list thread ( 
http://mail.python.org/pipermail/python-list/2011-September/1280544.html ), 
that such cases are better dealt with individually, on an application basis, if 
it need be. (I'd just prefer keeping the flags repertoire shorter, if 
possible:-)

Regards,
 Vlastimil Brom

Original issue reported on code.google.com by [email protected] on 17 Sep 2011 at 12:37

"bad set" error for unescaped ] at the beginning of the set

Hi,
I just found one inconsistence of regex against re in handling of the sets (it 
might depend on the newest addition of set operations).
I thought, a pattern like "[][]" would be legal (although probably not very 
readable). It also does work in re, but in regex it causes a "bad set" error:

>>> print re.sub(r"([][])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([][])", r"-", u"a[b]c")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "regex.pyc", line 194, in sub
  File "regex.pyc", line 334, in _compile
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 369, in _parse_element
  File "_regex_core.pyc", line 503, in _parse_paren
  File "_regex_core.pyc", line 243, in _parse_pattern
  File "_regex_core.pyc", line 257, in _parse_sequence
  File "_regex_core.pyc", line 270, in _parse_item
  File "_regex_core.pyc", line 382, in _parse_element
  File "_regex_core.pyc", line 924, in _parse_set
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1046, in _parse_set_item
  File "_regex_core.pyc", line 933, in _parse_set_union
  File "_regex_core.pyc", line 943, in _parse_set_symm_diff
  File "_regex_core.pyc", line 950, in _parse_set_inter
  File "_regex_core.pyc", line 957, in _parse_set_diff
  File "_regex_core.pyc", line 971, in _parse_set_imp_union
  File "_regex_core.pyc", line 978, in _parse_set_member
  File "_regex_core.pyc", line 1048, in _parse_set_item
error: bad set

It can be easily remedied (after I found the problem in a more complex pattern) 
by escaping the square brackets:


>>> print re.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> 

Using regex-0.1.20110510 python 2.7.1, Win XP

regards,
   vbr

Original issue reported on code.google.com by [email protected] on 18 May 2011 at 11:09

regex.compile("^((?>\w+)|(?>\s+))*$") causes "TypeError: 'GreedyRepeat' object is not iterable"

What steps will reproduce the problem?
python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
    s = s.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
    prefix, branches = Branch._split_common_prefix(info, branches)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
    while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
    return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>> try:
...     regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
...     print("Wrong regexp!")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
    s = s.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
    subpattern = self.subpattern.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
    prefix, branches = Branch._split_common_prefix(info, branches)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
    while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
    return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>> 

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
regex.Regex('^((?>\\w+)|(?>\\s+))*$', flags=regex.F | regex.V1)
>>> try:
...     regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
...     print("Wrong regexp!")
... 
Wrong regexp!
>>> 

What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223

Please provide any additional information below.
# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
...     re.compile("^((?>\w+)|(?>\s+))*$")
... except re.error:
...     print("Wrong regexp!")
... 
Wrong regexp!
>>>

Original issue reported on code.google.com by [email protected] on 3 Jan 2012 at 2:19

regex.search("([\da-f:]+)$", "E", regex.I|regex.V1) returns None

What steps will reproduce the problem?
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
None
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1).group(0))
E
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112

Please provide any additional information below.
# "e" is ok.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1).group(0))
e
>>>

Original issue reported on code.google.com by [email protected] on 13 Jan 2012 at 11:58

Recursive patterns

In order to parse certain strings belonging to certain kinds of languages, 
recursive patterns for recursive matching are necessary. For example, if you 
want to parse and capture a parenthesized chunk of text which may contain 
nested parenthesis, you'll need to have recursion.

Perl already supports recursive matching, so does PCRE and with that many 
libraries for languages such as PHP. I always wondered why the Python regex 
engines never did.

Please refer to http://perldoc.perl.org/perlre.html for documentation on the 
widely accepted syntax for recursive patterns.

Original issue reported on code.google.com by [email protected] on 23 Dec 2011 at 2:56

negated unicode properties in case-insensitive mode

While trying to test some of the recently listed properties supported by regex, 
it appears to me, that the negated properties don't work in case insensitive 
search; cf.:

>>> regex.findall(ur"(?i)\P{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> regex.findall(ur"(?i)\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> 
>>> regex.findall(ur"\P{InBasicLatin}",u"aáb")
[u'\xe1']
>>> regex.findall(ur"\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> 

as if the negated property literal \P would somehow taken in lowercase (?)

some other literals don't seem to be affected, e.g.

>>> regex.findall(ur"\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>> 
>>> regex.findall(ur"(?i)\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"(?i)\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>> 
works as expected.

Regards,
   vbr

Original issue reported on code.google.com by [email protected] on 28 Sep 2011 at 8:12

regex.search("(?>.*/)b", "a/b") returns None

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
None
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
<_regex.Match object at 0x00C0BBB8>
>>> print(regex.search("(?>.*/)b", "a/b").group(0))
a/b
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 15 Jan 2012 at 6:40

property database - aliases (question)

Well it doesn't realy count as a normal usage of the regex module, but I tried 
(again) to toy with its data structures, e.g. to get some unicode properties 
not available in unicodedata now.
(It is not that much practical, as I can just grab the unicode datafiles for 
searching, but I also wanted to dig in the (python) source of regex a bit...)

Sofar, I could make a check for unicode property (hopefully :-) work,
is it just e.g.:

>>> _regex.has_property_value((_regex.get_properties()["SCRIPT"][0] << 16) | 
_regex.get_properties()["SCRIPT"][1]["GREEK"], ord(u"Σ"))
1
>>> 
?

Is it the case, that there is no other access path, i.e. for getting some 
property to a given character, one has to check each property for every 
possible value and collect the successful matches? (Actually, it works 
surprisingly fast, given how clumsy approach this is.) 

(It is really a kind of exercise, I wouldn't want to ask for a more comfortable 
access to this data, you already offered, as this belongs to unicodedata.)

On a related note, is it somhow possible to programatically access the 
original, not normalised property names and values? - as listed on:
http://code.google.com/p/mrab-regex-hg/wiki/UnicodeProperties

It is possible to collect the aliases belonging to each other and take the 
longest ones as full forms, but the casing and spaces probably can't be 
recovered, can they?

Sorry for this possibly irrelevant "issue" (as having an issue-type "question", 
would likely by silly...

And, of course, many thanks for the recent enhancements and fixes!

regards,
 vbr

Original issue reported on code.google.com by [email protected] on 29 Sep 2011 at 3:56

regex.search("(a)(?<=b(?1))", "baz", regex.V1) returns None incorrectly

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
None
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
<_regex.Match object at 0x00C09C28>
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1).group(0))
a
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 25 Jan 2012 at 1:56

regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1) returns ('aaaaa', 'a') instead of ('aaaaaaaaaa', 'aaaa')

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaa', 'a')
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaaaaaaa', 'aaaa')
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123

Please provide any additional information below.
The following works fine:
>>> print(regex.search("(a(?(1)\\1)){1}", "a"*10, flags=regex.V1).group(0,1))
('a', 'a')
>>> print(regex.search("(a(?(1)\\1)){2}", "a"*10, flags=regex.V1).group(0,1))
('aaa', 'aa')
>>>

Original issue reported on code.google.com by [email protected] on 25 Jan 2012 at 1:01

regex.search("(\$)?[^()]+(?(1)\$|)", "(abcd").group(0) returns "bcd" instead of "abcd"

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
bcd
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
abcd
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 15 Jan 2012 at 7:08

hg EOL extension needs to be used and configured

All developers need to use the http://mercurial.selenic.com/wiki/EolExtension 
to deal with cross platform line ending conversion issues.

[extensions]
eol =

In their own .hgrc file.

This repo also needs the attached file checked in as a .hgeol file.

Original issue reported on code.google.com by [email protected] on 14 Mar 2011 at 10:12

Attachments:

hgeol

approximate matching -- feature request

I'm currently using the TRE regex engine to match output from OCR, because it 
supports approximate matching.  Very useful.  Would be nice to have that 
capability in Python regex, as well.

Original issue reported on code.google.com by [email protected] on 2 Jun 2011 at 8:33

regex.search("^(a){0,0}", "abc").group(0,1) returns ('a', 'a') instead of ('', None)

What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('a', 'a')
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('', None)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 14 Jan 2012 at 4:06

regex.search("a(bc)d", "abcd", regex.I|regex.V1) returns None

What steps will reproduce the problem?
C:\Python32\3.2.2\Scripts>python.exe
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
None

What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1).group(0))
abcd
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 13 Jan 2012 at 11:42

nested sets in case insensitive mode ("invalid RE code")

I just encountered an error hwile using nested sets in case insensitive mode:

>>> regex.findall("(?V1)[[a-z]--[aei]]","abc")
['b', 'c']
>>> regex.findall("(?V1i)[[a-z]--[aei]]","abc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 276, in findall
    return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 499, in _compile
    named_lists, named_list_indexes)
RuntimeError: invalid RE code
>>> regex.findall("(?V0i)[[a-z]--[aei]]","abc")
[]
>>> 

using regex-0.1.20111004; python 2.7.2 on WinXP; I am not sure, when it first 
appeared, but it would probably not be there for a long time.

regards,
  vbr

Original issue reported on code.google.com by [email protected] on 5 Oct 2011 at 8:37

module alias for regex_version1

Now that other features like nested sets were moved under the VERSION1 for 
compatibility with re, I am thinking, how to let the regex behaviour in my 
scripts default to the new VERSION1 (if possible without adding the flag all 
over the code).
It has been already explained, that it can't be a module-wide setting, as this 
would influence other programs or libraries using it.

Before I try some hackish approach, I wanted to ask, whether some kind of 
"aliasing" the module on import could work. 
I mentioned this idea marginally in 
http://bugs.python.org/issue2636#msg143442 but the discussion was concentrated 
on other topics.

Would it be possible to have something like
import regex_version0 as re
vs:
import regex_version1 as re

while the opposed flags could be still set if needed?

Could this be achieved (without much code duplication), or are there some 
drawbacks or limits?
Or are there other approaches how to activate the version1 behaviour in the 
user code "at once"?

Regards,
   vbr

Original issue reported on code.google.com by [email protected] on 17 Sep 2011 at 11:28

set qualifiers - feature idea

Some background:  I've been working with very large REs in CPython and 
IronPython.  We generate the RE pattern from lists, like lists of cities or 
lists of names, somewhat like this:

    namelist = open("names.txt").read().split()
    pattern = re.compile("|".join(namelist))

The one I'm working with now is just a pattern for finding substrings that look 
like the name of a person.  It's overflowing the 
System::Text::RegularExpressions buffers on IronPython, but works OK with 
CPython 2.6 on 64-bit Ubuntu.

One of the things I've been thinking is that this kind of pattern should be 
handled differently.  Suppose there was some syntax like

    pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))

where (?S indicates a named ImmutableSet, the members of that set to be drawn 
from the keyword argument of that name.  The compiler would generate a 
reasonably fast pattern from that set, say the union of all characters in all 
the strings in the set, and a max and min size based on the min-lengthed and 
max-lengthed elements of the set.  When the engine runs, it would match that 
fast pattern, and if it matches, it would then check to see if the matched 
group is a member of the named set.  If so, the match would be confirmed; if 
not, it would fail.

Seems like this might be a useful feature for regex to have, given the 
popularity of this kind of machine-generated RE.

Original issue reported on code.google.com by [email protected] on 2 Jun 2011 at 7:45

regex.compile("^(?:a(?:(?:))+)+") causes "_regex_core.error: nothing to repeat"

What steps will reproduce the problem?
>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 649, in parse_
element
    element = parse_paren(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 783, in parse_
paren
    return parse_flags_subpattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 998, in parse_
flags_subpattern
    subpattern = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
regex.Regex('^(?:a(?:(?:))+)+', flags=regex.V0)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Jan 2012 at 1:54

regex.compile("a(?x: b c )d") causes "_regex_core.error: missing )"

What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a(?x: b c )d")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 660, in parse_
element
    element = parse_paren(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 796, in parse_
paren
    return parse_flags_subpattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1014, in parse
_flags_subpattern
    source.expect(")")
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 3439, in expec
t
    raise error("missing {}".format(substring))
_regex_core.error: missing )
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("a(?x: b c )d")
regex.Regex('a(?x: b c )d', flags=regex.V0)
>>> regex.search("a(?x: b c )d", "abcd").group(0)
'abcd'
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 22 Jan 2012 at 8:51

can't build the project

I thought I'd try building the module, and expand the test cases a bit, but I 
can't seem to find the setup.py file.

I did the

  hg clone https://mrab-regex-hg.googlecode.com/hg/ mrab-regex-hg

Is there some trick to building it?

Original issue reported on code.google.com by [email protected] on 9 Jun 2011 at 4:35

regex.compile("a(?#xxx)*") causes "_regex_core.error: nothing to repeat"

What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a(?#xxx)*")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?
>>> regex.compile("a(?#xxx)*")
regex.Regex('a(?#xxx)*', flags=regex.V0)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Jan 2012 at 12:21

regex 0.1.20110514 findall overlapped not working with 'start of string' expression

Apologies if this is again not the right place to post this

I'm trying to use regex 0.1.2011051 with the overlapped=True feature

It works great, unless I have the 'start of string' (caret) character in my 
regular expression:

>>> regex.findall(r"a.*b","abadalaba",overlapped=True)
['abadalab', 'adalab', 'alab', 'ab']
>>> regex.findall(r"^a.*b","abadalaba",overlapped=True)
['abadalab']

If I understand correctly, the second regexp should also produce the same 
results as the first one, since all the results are at the beginning of the 
string

Original issue reported on code.google.com by [email protected] on 20 May 2011 at 11:02

= for fuzzy matches

= operator could be pretty handy for fuzzy matches, finding only erroneous 
text. For example, in a list of hotmail email accounts, you could search for 
misspells like '@(hotmail\.com){e=1}'. This will save the user an extra "grep 
-v" for filtering out correct emails in the list of matches.

Original issue reported on code.google.com by [email protected] on 17 Jan 2012 at 12:49

adding the set operations and possibly the supported properties to the help text

Hi,
first of all, many thanks for further excellent additions to regex, such as the 
extended unicode properties and newly the set operations!
I'd like to ask for some help text additions in this respect.
Could the Features.rst / Features.html be somehow accessible programmatically 
from within the library? Especially the parts: "Unicode codepoint properties, 
including scripts and blocks" and "Set operators" may be good additions as 
these features are probably less known. Maybe the set operations syntax coud be 
added briefly to the initial part of the help: "The special characters are:" 
under "[]"
Furthermore, there could ideally be a reference for the multitude of the 
supported unicode properties.
However, I can see, that the help text might get too large. Maybe the links to 
the respective data (the unicode standard etc.) migt be more appropriate. (Some 
of these migt be eventually added to the interface of unicodedata, but that's 
not relevant here.)
(now using regex-0.1.20110514.tar.gz, python 2.7, on win 7)
Thanks again
 vbr

Original issue reported on code.google.com by [email protected] on 15 May 2011 at 9:31

different handling of \w in unicode patterns in regex and re

Hi,
I think, it may be an intended behaviour, but I did't find it mentioned 
anywhere in the docs. Sorry, if it is already discussed somewhere I haven't 
looked ...
It seems, that in the unicode patterns like ur"..." regex implicitely sets the 
unicode flag (?u), while re doesn't seem to do that. 

>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> 

Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f

regards,
   Vlastimil Brom

Original issue reported on code.google.com by [email protected] on 7 Feb 2011 at 1:13

Forward references; nested references?

I'd like to ask about the support for forward references and nested references 
in regex ( http://www.regular-expressions.info/brackets.html ).
I couldn't find any notice of this in the documentation, but it seems, that 
forward references are supported, while nested references are not:

>>> regex.search(r"(\2b|(a))+", "-aab-").group()
'aab'
>>> 
>>> regex.search(r"(\1b|(a))+", "-aab-").group()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 235, in search
    return _compile(pattern, flags, kwargs).search(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 587, in parse_element
    element = parse_paren(source, info)
  File "C:\Python27\lib\_regex_core.py", line 723, in parse_paren
    subpattern = parse_pattern(source, info)
  File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
    item = parse_item(source, info)
  File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
    element = parse_element(source, info)
  File "C:\Python27\lib\_regex_core.py", line 584, in parse_element
    return parse_escape(source, info, False)
  File "C:\Python27\lib\_regex_core.py", line 1035, in parse_escape
    return parse_numeric_escape(source, info, ch, in_set)
  File "C:\Python27\lib\_regex_core.py", line 1069, in parse_numeric_escape
    raise error("can't refer to an open group")
error: can't refer to an open group
>>> 

Is it true, or am I misinterpretting something?

(re fails with the same error message for the second pattern and "bogus escape: 
'\\2'" for the first one.)

thanks and regards,
   vbr

Original issue reported on code.google.com by [email protected] on 26 Sep 2011 at 9:19

Patch to restore speed lost in commit 7abd9f9bb1

The attached diff against c0186afe8c50 restores the speed lost in 7abd9f9bb1 
for me (25% faster on my regression tests).  In addition, all my tests still 
pass without error.

Looking at the diff, I confirmed empirically that it's the conditional:

if (pattern->repeat_info[i].inner)

that seems to slow everything down.  I don't why this test and dereference 
would be so expensive;  it might very well be a gcc bug (I haven't compared the 
generated assembly yet).

Original issue reported on code.google.com by [email protected] on 6 Jan 2011 at 2:02

Attachments:

_regex.c.diff

regex.compile("\\ ", regex.X) causes "_regex_core.error: bad escape"

What steps will reproduce the problem?

>>> import regex
>>> regex.compile("\\ ", regex.X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 335, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 351, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 364, in parse_
item
    element = parse_element(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 585, in parse_
element
    return parse_escape(source, info, False)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1012, in parse
_escape
    raise error("bad escape")
_regex_core.error: bad escape
>>>

What is the expected output? What do you see instead?

>>> import regex
>>> print(regex.compile("\\ ", regex.X|regex.D))
CHARACTER MATCH 32
regex.Regex('\\ ', flags=regex.D | regex.X | regex.V0)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 14 Jan 2012 at 12:59

regex.compile("(?>b)") causes "TypeError: 'Character' object is not subscriptable"

What steps will reproduce the problem?
$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
    prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
    prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>> try:
...     regex.compile("(?>b)")
... except regex.error:
...     print("Wrong Regexp!")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
    parsed = parsed.optimise(info)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
    prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
  File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
    prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>> 

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
regex.Regex('(?>b)', flags=regex.F | regex.V1)
>>> try:
...     regex.compile("(?>b)")
... except regex.error:
...     print("Wrong RegExp!")
... 
Wrong RegExp!
>>> 

What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223

Please provide any additional information below.
# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
...     re.compile("(?>b)")
... except re.error:
...     print("Wrong RegExp!")
... 
Wrong RegExp!
>>>

Original issue reported on code.google.com by [email protected] on 3 Jan 2012 at 1:59

regex.compile("(?=abc){3}abc") causes "_regex_core.error: nothing to repeat"

What steps will reproduce the problem?
>>> import regex
>>> regex.compile("(?=abc){3}abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
    return _compile(pattern, flags, kwargs)
  File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
    parsed = parse_pattern(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
    branches = [parse_sequence(source, info)]
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 352, in parse_
sequence
    item = parse_item(source, info)
  File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 370, in parse_
item
    raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>

What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("(?=abc){3}abc")
regex.Regex('(?=abc){3}abc', flags=regex.V0)
>>> regex.search("(?=abc){3}abc", "abcabcabc").span(0)
(0, 3)
>>>

What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 20 Jan 2012 at 1:35

Change NEW flag

Feature request: please change the NEW flag to something else. In five or six 
years (give or take), the re module will be long forgotten, compatibility with 
it will not be needed, so-called "new" features will no longer be new, and the 
NEW flag will just be silly.

If you care about future compatibility, some sort of version specification 
would be better, e.g. "VERSION=0" (current re module), "VERSION=1" (this regex 
module), "VERSION=2" (next generation). You could then default to VERSION=0 for 
the first few releases, and potentially change to VERSION=1 some time in the 
future.

Otherwise, I suggest swapping the sense of the flag: instead of "re behaviour 
unless NEW flag is given", I'd say "re behaviour only if OLD flag is given". 
(Old semantics will, of course, remain old even when the new semantics are no 
longer new.)

(Copied from http://bugs.python.org/issue2636)

Original issue reported on code.google.com by [email protected] on 28 Aug 2011 at 6:17

syntax for beginning and end of the word?

Hi,
I just noticed the addition for beginning/end of the word and I appreciate it 
very much.
I just wanted to ask, what does maybe "m" in \m and \M stand for.?
I'd have expected rather 
\< and \> (maybe because of OpenOffice), but it turns out, that \m \M are used 
in TCL and that the other syntax isn't much more used either:

http://www.regular-expressions.info/refflavors.html
[I must have somehow missed that page sofar; it's nice to see, that regex now 
have most of the "useful" features (marked as ommissions in re) and some extras 
...]

Would it be possible to alias this alternative syntax too, or are there some 
drawbacks with it, or is it considered "line noise"?


Just an aside: Am I supposed to be able to set the issue type (such as Feature 
request etc., or is it always "Defect" which could be changed by the 
administrators later?

Thanks and regards,
          vbr

Original issue reported on code.google.com by [email protected] on 6 Aug 2011 at 11:46

HG history makes no sense

What steps will reproduce the problem?
1. hg clone
2. hg log
3. say aloud "whut"

As far as I can tell there are two parallel histories on the "default" branch. 
Google Code's commit browser shows only one of these histories. `hg log` shows 
both histories interleaved, one commit after the other, producing a single 
nonsensical history. hgview shows these histories in parallel, but still 
asserts that both are on the same branch. Is there a problem with the 
repository? Am I doing something dumb?

I'd like to know with certainty the latest version of this project and the 
history that led to it.

Original issue reported on code.google.com by [email protected] on 1 Nov 2011 at 11:32

fuzzy patterns in negative lookarounds - case sensitivity difference

Hi,
recently I used the fuzzy matching capability to detect some misspellings; I 
used lookarounds to filter out the correct forms.
In some cases I saw some differences I can't understand, it appears, there may 
be some correlation with the i and V0/V1 flags. (using regex 0.1.20111014; py 
2.7.2; win XP)

# caseless negative lookbehind in V1 somehow doesn't filter out e==1 matches
>>> regex.findall(r"(?iV1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word 
word2 word word3 word word234 word23 word")
['word2', 'word3', 'word234', 'word23']

# while in case-insensitive mode this works as expected
>>> regex.findall(r"(?V1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word 
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> 
>>> 
# the - hopefully - equivalent lookahaeds work both the same 
>>> regex.findall(r"(?iV1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word 
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?V1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word word2 
word word3 word word234 word23 word")
['word234', 'word23']
>>> 

# the original above lookbehinds both work with V0 flag 
>>> regex.findall(r"(?V0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word 
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?iV0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word 
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> 

Unless I have made some stupid mistake, the patterns should match the variants 
of "word" with at least 1 and at most 3 errors. Are there some other aspects to 
consider if using lookarounds with the fuzzy patterns?

BTW, a probably silly idea originated from this kind of searches - how about 
adding support for only erroneous matches via ... {1<=e<=3} ?
(I see, it could get quite complex with some more sophisticated error costs 
arithmetics.)
However, I am glad, the same could be done with lookarounds.

(As a side note, is this kind of possible bug report appropriate for a  
separate issue, or would it have been better within the "approximate matching" 
issue?)

regards,
   vbr

Original issue reported on code.google.com by [email protected] on 3 Nov 2011 at 5:17

regex.compile("qu", flags=regex.I|regex.V1) doesn't match "qu"

What steps will reproduce the problem?
$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> r
regex.Regex('qu', flags=regex.F | regex.I | regex.V1)
>>> print(r.match("qu"))
None
>>> 

What is the expected output? What do you see instead?
>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> print(r.match("qu").group(0))
qu

What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20120103

Please provide any additional information below.

I can't reproduce this issue by following steps
>>> import regex
>>> r = regex.compile("qu")
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.I)
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.V1)
>>> print(r.match("qu").group(0))
qu

Original issue reported on code.google.com by [email protected] on 3 Jan 2012 at 4:48

jamadden / mrab-regex-hg Goto Github PK

mrab-regex-hg's People

Contributors

Watchers

Forkers

mrab-regex-hg's Issues

Recommend Projects

Recommend Topics

Recommend Org