jamadden / mrab-regex-hg Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/mrab-regex-hg
Automatically exported from code.google.com/p/mrab-regex-hg
What steps will reproduce the problem?
>>> import regex
>>> p = regex.compile(ur'(?fi)\L<keywords>', keywords=['post','pos'])
>>> p.findall(u'POST, Post, post, poſt, post, and poſt')
What is the expected output? What do you see instead?
Expected:
[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']
Got:
[u'POST', u'Post', u'post', u'po\u017ft']
What version of the product are you using? On what operating system?
regex.__version__ == '2.4.0'
sys.version_info == (2, 6, 5, 'final', 0)
platform.platform() == 'Linux-3.0.0-15-generic-x86_64-with-Ubuntu-11.10-oneiric'
Please provide any additional information below.
>>> p = regex.compile(ur'(?fi)pos|post')
>>> p.findall(u'POST, Post, post, poſt, post, and poſt')
[u'POS', u'Pos', u'pos', u'po\u017f']
>>> p = regex.compile(ur'(?fi)post|pos')
>>> p.findall(u'POST, Post, post, poſt, post, and poſt')
[u'POST', u'Post', u'post', u'po\u017ft']
>>> p = regex.compile(ur'(?fi)post|another')
>>> p.findall(u'POST, Post, post, poſt, post, and poſt')
[u'POST', u'Post', u'post', u'po\u017ft', u'po\ufb06', u'po\ufb05']
Original issue reported on code.google.com by [email protected]
on 26 Jan 2012 at 2:50
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
None
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("^(a|)\\1{2}b", "b"))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("^(a|)\\1{2}b", "b").group(0,1))
('b', '')
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112
Please provide any additional information below.
regex 0.1.20120103 has this issue too
Original issue reported on code.google.com by [email protected]
on 14 Jan 2012 at 1:32
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
(None, 'abd', 'e')
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("^(?=ab(de))(abd)(e)", "abde").groups())
('de', 'abd', 'e')
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112
Please provide any additional information below.
regex ver.0.1.20120103 is OK.
Original issue reported on code.google.com by [email protected]
on 14 Jan 2012 at 12:26
re.search('Derde\s*:', 'aaaaaa:\nDerde:')
matches on Python 2.6.5 (Linux x86_64) using standard re module, doesn't match
using regex from trunk rev b05186807a
This one is very weird, for example:
re.search('Derde\s*:', 'aaaaa:\nDerde:')
matches just fine using both modules....
Original issue reported on code.google.com by [email protected]
on 2 Jan 2011 at 7:18
Issue 1366311 talks about releasing the GIL in the "re" module, but points to a
problem which could occur with scanner objects which are shared across threads.
Unfortunately this module has just that problem. :-(
I'm currently attempting to fix it, but it's proving to be surprisingly
difficult. When the iterator terminates in a thread, it's being collected,
which causes a failure later when another thread tries to use it.
Revision 4dc5f7f181 didn't fix it.
Original issue reported on code.google.com by [email protected]
on 14 Mar 2011 at 9:18
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(0, 1)
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a*)*", "a", flags=regex.V1).span(1))
(1, 1)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119
Please provide any additional information below.
The following works fine:
>>> import regex
>>> print(regex.search("(a*)*", "aa").span(1))
(2, 2)
>>> print(regex.search("(a*)*", "aaa").span(1))
(3, 3)
Original issue reported on code.google.com by [email protected]
on 20 Jan 2012 at 11:34
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 709, in parse_
element
raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("a#comment\n*", flags=regex.X)
regex.Regex('a#comment\n*', flags=regex.X | regex.V0)
>>> regex.search("a#comment\n*", "aaa", flags=regex.X).group(0)
'aaa'
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 22 Jan 2012 at 9:02
First, many thanks for adding support for recursive patterns to regex!
While playing with this feature - mainly in the context of balancing
substrings, I found some disproportion I can't understand. (I suspect, this
issue ought to be of the type "question" rather than report, but anyway.)
I am trying a pattern for possibly multiply parenthesised substrings; a version
with an atomic group works, as I would expect:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['(bcd(e)f)', '(h)']
However, using a normal group the result is:
>>> regex.findall(r"\((?:([^()]+)|(?R))*\)", "a(bcd(e)f)g(h)")
['f', 'h']
>>>
I confess, I am not fully able to trace the (not-)backtracking behaviour in
detail, but I can't understand, how there can be a match without the outer "("
")", which appear to be an unconditional part of the pattern. Or is possibly
something else than whole matches returned from findall()?
Regarding the above examples - I am trying to find a pattern for nested
elements - parentheses in this simplified example - which would work for
balanced "subexpressions" in both directions, so that the longest balanced part
would be found.
e.g.:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(b(cd)e)f)g)h")
['(b(cd)e)']
works Ok, the superflous closing parentheses are ignored;
however:
>>> regex.findall(r"\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
[]
Is the right way to ignore the superfluos opening parentheses to use reversed
search?:
>>> regex.findall(r"(?r)\((?:(?>[^()]+)|(?R))*\)", "a(bc(d(e)f)gh")
['(d(e)f)']
>>>
Or is there maybe a pattern, which would match in both cases unchanged?
(The next step would be to find the *not*-balanced elements, if it would be at
all possible only using regex.)
(the above tests on Win XP, Python 2.7.2, regex-0.1.20120105)
Thanks and regards,
vbr
Original issue reported on code.google.com by [email protected]
on 12 Jan 2012 at 11:29
I just found a chapter in the Unicode guidelines for regular expressions
concerning property values and would like to ask, whether supporting of regex
patterns (or some subset thereof) here would be possible.
http://unicode.org/reports/tr18/#Wildcard_Properties
I am not aware of any implementation already supporting it, nor do I know how
much extra complexity for the parser would be needed, but it looks like a
feature orthogonal with unicode properties and the set operations which regex
already has.
I see, that the usecases would cover rather special approaches - in my case it
would allow for investigating the unicode character repertoire itself
(Currently I can do something like that after grabbing all the character names
via unicodedata).
Otherwise, on "normal" text, the cases could be covered, where there are
multiple character ranges, that should be considered (i.e. basic xxx, xxx
supplement, xxx extended ...). (I am not sure how the current Script property
relates to this exactly.)
E.g. some errors or even spoofing attempts might be checked for on graphically
similar characters from different ranges. cf.
o (dec: 111; hex: 0x6f) LATIN SMALL LETTER O
ο (dec: 959; hex: 0x3bf) GREEK SMALL LETTER OMICRON
о (dec: 1086; hex: 0x43e) CYRILLIC SMALL LETTER O
օ (dec: 1413; hex: 0x585) ARMENIAN SMALL LETTER OH
I'd like to stress, that this is only meant as proposal for consideration - it
surely wouldn't be worth some extensive effort or the risk for being possible
bug source.
Regards
Vlastimil Brom
Original issue reported on code.google.com by [email protected]
on 15 Sep 2011 at 3:51
Hi,
I hope, I am not missing anything trivial, I just noticed a behaviour of the
LOCALE flag I can't understand; however, both re and regex behave the same in
this respect:
I thought, the search pattern (?L)\w would match any of the respective
string.letters according to the current locale (and possibly additionally
[0-9_]).
However, the locale doesn't seem to be reflected.
>>> unicode_BMP = " " + "".join(unichr(i)for i in range(1, 0xFFFF+1))
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print unicode(string.letters, "windows-1250")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzŠŚŤŽŹšśťžźŁĄŞŻ
łµąşĽľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖŘŮÚŰÜÝŢßŕá
âăäĺćçčéęëěíîďđńňóôőöřůúűüýţ
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print unicode(string.letters, "windows-1253")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒΆµΈΉΊΌΎΏΐΑΒΓΔ
ΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμν
ξοπρςστυφχψωϊϋόύώ
>>>
>>> import re
>>> import regex
>>> locale.setlocale(locale.LC_ALL, "")
'Czech_Czech Republic.1250'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz���������
��£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛ�
�ÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz���������
��£¥ª¯³µ¹º¼¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛ�
�ÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ
>>>
>>> locale.setlocale(locale.LC_ALL, "Greek")
'Greek_Greece.1253'
>>> print "".join(re.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º�
�¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäå�
�çèéêëìíîïðñòóôõö÷øùúûüýþ
>>> print "".join(regex.findall(r"(?L)\w", unicode_BMP))
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz�¢²³µ¸¹º�
�¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäå�
�çèéêëìíîïðñòóôõö÷øùúûüýþ
>>>
it seems that the nearest letter set to the result of the re/regex LOCALE flags
migt be ascii or US locale:
>>> locale.setlocale(locale.LC_ALL, "US")
'English_United States.1252'
>>> print unicode(string.letters, "windows-1252")
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzƒŠŒŽšœžŸªµºÀÁÂ
ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêë
ìíîïðñòóôõöøùúûüýþÿ
>>>
however, there are some differences too, namely between z� and À
re/regex (?L)\w :
z�¢²³µ¸¹º¼¾¿À (as displayed in wxPython shell)
z����������£¥ª¯³µ¹º¼¾¿À (as displayed in tkinter Idle
shell)
(in either case, there are some items, one wouldn't consider usual word
characters, cf. ¿)
US string.letters"
zƒŠŒŽšœžŸªµºÀ (displayed identically in both shells)
There are likely some other issues (like some encoding/displaying peculiarities
in wx and Tkinter), but the regex matching using the LOCALE flag clearly don't
reflect the locale.setlocale(...)
Is it supposed to work this way and is there another possibility to get the
expected locale aware matching?
using Python 2.7.1, 32 bit; win 7 Home Premium 64-bit, Czech;
regex-0.1.20110315.
Regards,
Vlastimil Brom
Original issue reported on code.google.com by [email protected]
on 2 Apr 2011 at 9:41
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH").group(0,1))
('blah BLAH', 'blah')
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
<_regex.Match object at 0x00C0BBB8>
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("((?i)blah)\\s+\\1", "blah BLAH"))
None
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 15 Jan 2012 at 7:03
With the new named lists feature, a compiled pattern may contain both the
pattern string, and references to one or more lists. This means that the
standard way of composing patterns, concatenating the .pattern attribute of
compiled patterns with other text, won't "just work".
Let me suggest an API modification to get around this. Instead of the
"pattern" parameter to regex.compile() being a string, allow it to be either a
string, or a sequence of objects, each of which must be either a string or a
compiled pattern. The elements of the sequence are concatenated to generate
the new pattern string, and list references in the compiled pattern elements of
the sequence are transferred into the new compiled pattern. Duplicate names
for list references could be an error; or, they could be handled automatically
by name-mangling the list names within the aggregate compiled pattern.
Original issue reported on code.google.com by [email protected]
on 29 Jun 2011 at 4:24
I just encountered some cornercase while using the non-capturing group (?:
...), which leads to recursion error. The group i tested contains character
sets of surrogate characters, I am not sure, whether this is relevant for the
problem.
the same pattern works as expected with normal parentheses and without any
grouping. cf.
>>> regex.findall(ur"(?s)(?:[\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "regex.pyc", line 253, in findall
File "regex.pyc", line 394, in _compile
File "_regex_core.pyc", line 2187, in fix_groups
File "_regex_core.pyc", line 2187, in fix_groups
File "_regex_core.pyc", line 2187, in fix_groups
...
File "_regex_core.pyc", line 2187, in fix_groups
File "_regex_core.pyc", line 2187, in fix_groups
File "_regex_core.pyc", line 2187, in fix_groups
File "_regex_core.pyc", line 2187, in fix_groups
RuntimeError: maximum recursion depth exceeded
>>> regex.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> regex.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>>
re can handle all these pattern normally:
>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)[\ud800-\udbff][\udc00-\udfff]", u"a𐀀bcdefg")
[u'\U00010000']
>>> re.findall(ur"(?s)([\ud800-\udbff][\udc00-\udfff])", u"a𐀀bcdefg")
[u'\U00010000']
>>>
Using regex-0.1.20110616, win 7, python 2.7.2 (the official, narrow unicode
build, of course)
vbr
Original issue reported on code.google.com by [email protected]
on 22 Jun 2011 at 10:38
When I try to import regex I get the following import error:
dlopen(<snipped>/python2.5/site-packages/_regex.so, 2): Symbol not found:
_re_is_same_char_ign
Referenced from: <snipped>/python2.5/site-packages/_regex.so
Expected in: dynamic lookup
Changing setup.py line 57 like below fixes this:
ext_modules=[Extension('_regex', [os.path.join(PKG_BASE, '_regex.c'),
os.path.join(PKG_BASE, '_regex_unicode.c')])]
Original issue reported on code.google.com by [email protected]
on 10 May 2011 at 2:14
First, thanks for the new release (regex 0.1.20110917) ! (I especially like the
changed fuzzy matching behaviour as discussed in
http://code.google.com/p/mrab-regex-hg/issues/detail?id=12#c28 )
I'd like to ask about the specification of the case-folding behaviour used in
case insensitive matching.
Is it the chapter 5.18 in the Unocode standard
http://www.unicode.org/versions/Unicode6.0.0/ch05.pdf
Or did I miss something else?
I tried some patterns, where I thought these would be "caselessly" equivalent
(based on the above)
>>> for m in regex.findall(ur"(?V1i)[ΣΟ]",u"ρς στΣόο"): print m,
...
σ Σ ο
Here I'd have thought, the accented lowercase omicron or the positional lower
sigma variant would be matched too.
On the other hand the sharp s (which is more frequent in my texts) seems to be
matched in all directions.)
>>> for m in regex.findall(ur"(?V1i)ẞ",u"-s-S-ss-SS-ß-ẞ-"): print m,
...
ss SS ß ẞ
>>> for m in regex.findall(ur"(?V1i)ss",u"-s-S-ss-SS-ß-ẞ-"): print m,
...
ss SS ß ẞ
>>>
I thought, that only the changes in case should be reflected in matching, now
there is effectively an equivalence between both lowercase ss and ß, which is
not (at least not always) what is expected. (Both, with respect to the current
German orthography or for dealing with text preceeding that official
orthography regulation.)
Is there now some way to handle these characters as distinct (other than not
using the i flag)?
Where can I maybe find the specification for this behaviour? - it seems, that I
will need to reflect it in the search patterns.
(I can't comment competently on the behaviour of the "prominent" case of the
Turkic "i"s; personally I believe, there must be other comparable cases, once
we begin to care about them... I'd support the view of some contributors in the
respective py-list thread (
http://mail.python.org/pipermail/python-list/2011-September/1280544.html ),
that such cases are better dealt with individually, on an application basis, if
it need be. (I'd just prefer keeping the flags repertoire shorter, if
possible:-)
Regards,
Vlastimil Brom
Original issue reported on code.google.com by [email protected]
on 17 Sep 2011 at 12:37
Hi,
I just found one inconsistence of regex against re in handling of the sets (it
might depend on the newest addition of set operations).
I thought, a pattern like "[][]" would be legal (although probably not very
readable). It also does work in re, but in regex it causes a "bad set" error:
>>> print re.sub(r"([][])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([][])", r"-", u"a[b]c")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "regex.pyc", line 194, in sub
File "regex.pyc", line 334, in _compile
File "_regex_core.pyc", line 243, in _parse_pattern
File "_regex_core.pyc", line 257, in _parse_sequence
File "_regex_core.pyc", line 270, in _parse_item
File "_regex_core.pyc", line 369, in _parse_element
File "_regex_core.pyc", line 503, in _parse_paren
File "_regex_core.pyc", line 243, in _parse_pattern
File "_regex_core.pyc", line 257, in _parse_sequence
File "_regex_core.pyc", line 270, in _parse_item
File "_regex_core.pyc", line 382, in _parse_element
File "_regex_core.pyc", line 924, in _parse_set
File "_regex_core.pyc", line 933, in _parse_set_union
File "_regex_core.pyc", line 943, in _parse_set_symm_diff
File "_regex_core.pyc", line 950, in _parse_set_inter
File "_regex_core.pyc", line 957, in _parse_set_diff
File "_regex_core.pyc", line 971, in _parse_set_imp_union
File "_regex_core.pyc", line 978, in _parse_set_member
File "_regex_core.pyc", line 1046, in _parse_set_item
File "_regex_core.pyc", line 933, in _parse_set_union
File "_regex_core.pyc", line 943, in _parse_set_symm_diff
File "_regex_core.pyc", line 950, in _parse_set_inter
File "_regex_core.pyc", line 957, in _parse_set_diff
File "_regex_core.pyc", line 971, in _parse_set_imp_union
File "_regex_core.pyc", line 978, in _parse_set_member
File "_regex_core.pyc", line 1048, in _parse_set_item
error: bad set
It can be easily remedied (after I found the problem in a more complex pattern)
by escaping the square brackets:
>>> print re.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>> print regex.sub(r"([\]\[])", r"-", u"a[b]c")
a-b-c
>>>
Using regex-0.1.20110510 python 2.7.1, Win XP
regards,
vbr
Original issue reported on code.google.com by [email protected]
on 18 May 2011 at 11:09
What steps will reproduce the problem?
python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
parsed = parsed.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
s = s.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
subpattern = self.subpattern.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
subpattern = self.subpattern.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
prefix, branches = Branch._split_common_prefix(info, branches)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>> try:
... regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
... print("Wrong regexp!")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
parsed = parsed.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2805, in optimise
s = s.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2450, in optimise
subpattern = self.subpattern.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 2514, in optimise
subpattern = self.subpattern.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1813, in optimise
prefix, branches = Branch._split_common_prefix(info, branches)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1903, in _split_common_prefix
while pos < end_pos and prefix[pos].can_be_affix() and all(a[pos] ==
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1731, in can_be_affix
return all(s.can_be_affix() for s in self.subpattern)
TypeError: 'GreedyRepeat' object is not iterable
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("^((?>\w+)|(?>\s+))*$", flags=regex.V1)
regex.Regex('^((?>\\w+)|(?>\\s+))*$', flags=regex.F | regex.V1)
>>> try:
... regex.compile("^((?>\w+)|(?>\s+))*$")
... except regex.error:
... print("Wrong regexp!")
...
Wrong regexp!
>>>
What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223
Please provide any additional information below.
# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
... re.compile("^((?>\w+)|(?>\s+))*$")
... except re.error:
... print("Wrong regexp!")
...
Wrong regexp!
>>>
Original issue reported on code.google.com by [email protected]
on 3 Jan 2012 at 2:19
What steps will reproduce the problem?
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
None
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "E", regex.I|regex.V1).group(0))
E
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112
Please provide any additional information below.
# "e" is ok.
>>> import regex
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("([\da-f:]+)$", "e", regex.I|regex.V1).group(0))
e
>>>
Original issue reported on code.google.com by [email protected]
on 13 Jan 2012 at 11:58
In order to parse certain strings belonging to certain kinds of languages,
recursive patterns for recursive matching are necessary. For example, if you
want to parse and capture a parenthesized chunk of text which may contain
nested parenthesis, you'll need to have recursion.
Perl already supports recursive matching, so does PCRE and with that many
libraries for languages such as PHP. I always wondered why the Python regex
engines never did.
Please refer to http://perldoc.perl.org/perlre.html for documentation on the
widely accepted syntax for recursive patterns.
Original issue reported on code.google.com by [email protected]
on 23 Dec 2011 at 2:56
While trying to test some of the recently listed properties supported by regex,
it appears to me, that the negated properties don't work in case insensitive
search; cf.:
>>> regex.findall(ur"(?i)\P{InBasicLatin}",u"aáb")
[u'a', u'b']
>>> regex.findall(ur"(?i)\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>>
>>> regex.findall(ur"\P{InBasicLatin}",u"aáb")
[u'\xe1']
>>> regex.findall(ur"\p{InBasicLatin}",u"aáb")
[u'a', u'b']
>>>
as if the negated property literal \P would somehow taken in lowercase (?)
some other literals don't seem to be affected, e.g.
>>> regex.findall(ur"\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>>
>>> regex.findall(ur"(?i)\s",u"a b\tcd")
[u' ', u'\t']
>>> regex.findall(ur"(?i)\S",u"a b\tcd")
[u'a', u'b', u'c', u'd']
>>>
works as expected.
Regards,
vbr
Original issue reported on code.google.com by [email protected]
on 28 Sep 2011 at 8:12
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
None
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(?>.*/)b", "a/b"))
<_regex.Match object at 0x00C0BBB8>
>>> print(regex.search("(?>.*/)b", "a/b").group(0))
a/b
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 15 Jan 2012 at 6:40
Well it doesn't realy count as a normal usage of the regex module, but I tried
(again) to toy with its data structures, e.g. to get some unicode properties
not available in unicodedata now.
(It is not that much practical, as I can just grab the unicode datafiles for
searching, but I also wanted to dig in the (python) source of regex a bit...)
Sofar, I could make a check for unicode property (hopefully :-) work,
is it just e.g.:
>>> _regex.has_property_value((_regex.get_properties()["SCRIPT"][0] << 16) |
_regex.get_properties()["SCRIPT"][1]["GREEK"], ord(u"Σ"))
1
>>>
?
Is it the case, that there is no other access path, i.e. for getting some
property to a given character, one has to check each property for every
possible value and collect the successful matches? (Actually, it works
surprisingly fast, given how clumsy approach this is.)
(It is really a kind of exercise, I wouldn't want to ask for a more comfortable
access to this data, you already offered, as this belongs to unicodedata.)
On a related note, is it somhow possible to programatically access the
original, not normalised property names and values? - as listed on:
http://code.google.com/p/mrab-regex-hg/wiki/UnicodeProperties
It is possible to collect the aliases belonging to each other and take the
longest ones as full forms, but the casing and spaces probably can't be
recovered, can they?
Sorry for this possibly irrelevant "issue" (as having an issue-type "question",
would likely by silly...
And, of course, many thanks for the recent enhancements and fixes!
regards,
vbr
Original issue reported on code.google.com by [email protected]
on 29 Sep 2011 at 3:56
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
None
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1))
<_regex.Match object at 0x00C09C28>
>>> print(regex.search("(a)(?<=b(?1))", "baz", regex.V1).group(0))
a
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 25 Jan 2012 at 1:56
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaa', 'a')
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(a(?(1)\\1)){4}", "a"*10, flags=regex.V1).group(0,1))
('aaaaaaaaaa', 'aaaa')
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120123
Please provide any additional information below.
The following works fine:
>>> print(regex.search("(a(?(1)\\1)){1}", "a"*10, flags=regex.V1).group(0,1))
('a', 'a')
>>> print(regex.search("(a(?(1)\\1)){2}", "a"*10, flags=regex.V1).group(0,1))
('aaa', 'aa')
>>>
Original issue reported on code.google.com by [email protected]
on 25 Jan 2012 at 1:01
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
bcd
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("(\\()?[^()]+(?(1)\\)|)", "(abcd").group(0))
abcd
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120114
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 15 Jan 2012 at 7:08
All developers need to use the http://mercurial.selenic.com/wiki/EolExtension
to deal with cross platform line ending conversion issues.
[extensions]
eol =
In their own .hgrc file.
This repo also needs the attached file checked in as a .hgeol file.
Original issue reported on code.google.com by [email protected]
on 14 Mar 2011 at 10:12
Attachments:
I'm currently using the TRE regex engine to match output from OCR, because it
supports approximate matching. Very useful. Would be nice to have that
capability in Python regex, as well.
Original issue reported on code.google.com by [email protected]
on 2 Jun 2011 at 8:33
What steps will reproduce the problem?
>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('a', 'a')
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("^(a){0,0}", "abc").group(0,1))
('', None)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 14 Jan 2012 at 4:06
What steps will reproduce the problem?
C:\Python32\3.2.2\Scripts>python.exe
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
None
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1))
<_regex.Match object at 0x00C09B10>
>>> print(regex.search("a(bc)d", "abcd", regex.I|regex.V1).group(0))
abcd
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2
regex 0.1.20120112
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 13 Jan 2012 at 11:42
I just encountered an error hwile using nested sets in case insensitive mode:
>>> regex.findall("(?V1)[[a-z]--[aei]]","abc")
['b', 'c']
>>> regex.findall("(?V1i)[[a-z]--[aei]]","abc")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\regex.py", line 276, in findall
return _compile(pattern, flags, kwargs).findall(string, pos, endpos,
File "C:\Python27\lib\regex.py", line 499, in _compile
named_lists, named_list_indexes)
RuntimeError: invalid RE code
>>> regex.findall("(?V0i)[[a-z]--[aei]]","abc")
[]
>>>
using regex-0.1.20111004; python 2.7.2 on WinXP; I am not sure, when it first
appeared, but it would probably not be there for a long time.
regards,
vbr
Original issue reported on code.google.com by [email protected]
on 5 Oct 2011 at 8:37
Now that other features like nested sets were moved under the VERSION1 for
compatibility with re, I am thinking, how to let the regex behaviour in my
scripts default to the new VERSION1 (if possible without adding the flag all
over the code).
It has been already explained, that it can't be a module-wide setting, as this
would influence other programs or libraries using it.
Before I try some hackish approach, I wanted to ask, whether some kind of
"aliasing" the module on import could work.
I mentioned this idea marginally in
http://bugs.python.org/issue2636#msg143442 but the discussion was concentrated
on other topics.
Would it be possible to have something like
import regex_version0 as re
vs:
import regex_version1 as re
while the opposed flags could be still set if needed?
Could this be achieved (without much code duplication), or are there some
drawbacks or limits?
Or are there other approaches how to activate the version1 behaviour in the
user code "at once"?
Regards,
vbr
Original issue reported on code.google.com by [email protected]
on 17 Sep 2011 at 11:28
Some background: I've been working with very large REs in CPython and
IronPython. We generate the RE pattern from lists, like lists of cities or
lists of names, somewhat like this:
namelist = open("names.txt").read().split()
pattern = re.compile("|".join(namelist))
The one I'm working with now is just a pattern for finding substrings that look
like the name of a person. It's overflowing the
System::Text::RegularExpressions buffers on IronPython, but works OK with
CPython 2.6 on 64-bit Ubuntu.
One of the things I've been thinking is that this kind of pattern should be
handled differently. Suppose there was some syntax like
pattern = re.compile("(?S<names>)", names=ImmutableSet(namelist))
where (?S indicates a named ImmutableSet, the members of that set to be drawn
from the keyword argument of that name. The compiler would generate a
reasonably fast pattern from that set, say the union of all characters in all
the strings in the set, and a max and min size based on the min-lengthed and
max-lengthed elements of the set. When the engine runs, it would match that
fast pattern, and if it matches, it would then check to see if the matched
group is a member of the named set. If so, the match would be confirmed; if
not, it would fail.
Seems like this might be a useful feature for regex to have, given the
popularity of this kind of machine-generated RE.
Original issue reported on code.google.com by [email protected]
on 2 Jun 2011 at 7:45
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 649, in parse_
element
element = parse_paren(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 783, in parse_
paren
return parse_flags_subpattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 998, in parse_
flags_subpattern
subpattern = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("^(?:a(?:(?:))+)+")
regex.Regex('^(?:a(?:(?:))+)+', flags=regex.V0)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Jan 2012 at 1:54
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a(?x: b c )d")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 339, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 358, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 368, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 660, in parse_
element
element = parse_paren(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 796, in parse_
paren
return parse_flags_subpattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1014, in parse
_flags_subpattern
source.expect(")")
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 3439, in expec
t
raise error("missing {}".format(substring))
_regex_core.error: missing )
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("a(?x: b c )d")
regex.Regex('a(?x: b c )d', flags=regex.V0)
>>> regex.search("a(?x: b c )d", "abcd").group(0)
'abcd'
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120122
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 22 Jan 2012 at 8:51
I thought I'd try building the module, and expand the test cases a bit, but I
can't seem to find the setup.py file.
I did the
hg clone https://mrab-regex-hg.googlecode.com/hg/ mrab-regex-hg
Is there some trick to building it?
Original issue reported on code.google.com by [email protected]
on 9 Jun 2011 at 4:35
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("a(?#xxx)*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 355, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 365, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 698, in parse_
element
raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>
What is the expected output? What do you see instead?
>>> regex.compile("a(?#xxx)*")
regex.Regex('a(?#xxx)*', flags=regex.V0)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Jan 2012 at 12:21
Apologies if this is again not the right place to post this
I'm trying to use regex 0.1.2011051 with the overlapped=True feature
It works great, unless I have the 'start of string' (caret) character in my
regular expression:
>>> regex.findall(r"a.*b","abadalaba",overlapped=True)
['abadalab', 'adalab', 'alab', 'ab']
>>> regex.findall(r"^a.*b","abadalaba",overlapped=True)
['abadalab']
If I understand correctly, the second regexp should also produce the same
results as the first one, since all the results are at the beginning of the
string
Original issue reported on code.google.com by [email protected]
on 20 May 2011 at 11:02
= operator could be pretty handy for fuzzy matches, finding only erroneous
text. For example, in a list of hotmail email accounts, you could search for
misspells like '@(hotmail\.com){e=1}'. This will save the user an extra "grep
-v" for filtering out correct emails in the list of matches.
Original issue reported on code.google.com by [email protected]
on 17 Jan 2012 at 12:49
Hi,
first of all, many thanks for further excellent additions to regex, such as the
extended unicode properties and newly the set operations!
I'd like to ask for some help text additions in this respect.
Could the Features.rst / Features.html be somehow accessible programmatically
from within the library? Especially the parts: "Unicode codepoint properties,
including scripts and blocks" and "Set operators" may be good additions as
these features are probably less known. Maybe the set operations syntax coud be
added briefly to the initial part of the help: "The special characters are:"
under "[]"
Furthermore, there could ideally be a reference for the multitude of the
supported unicode properties.
However, I can see, that the help text might get too large. Maybe the links to
the respective data (the unicode standard etc.) migt be more appropriate. (Some
of these migt be eventually added to the interface of unicodedata, but that's
not relevant here.)
(now using regex-0.1.20110514.tar.gz, python 2.7, on win 7)
Thanks again
vbr
Original issue reported on code.google.com by [email protected]
on 15 May 2011 at 9:31
Hi,
I think, it may be an intended behaviour, but I did't find it mentioned
anywhere in the docs. Sorry, if it is already discussed somewhere I haven't
looked ...
It seems, that in the unicode patterns like ur"..." regex implicitely sets the
unicode flag (?u), while re doesn't seem to do that.
>>> re.findall(ur"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(ur"\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> re.findall(r"\w", u"aáb")
[u'a', u'b']
>>> regex.findall(r"\w", u"aáb")
[u'a', u'b']
>>> re.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>> regex.findall(ur"(?u)\w", u"aáb")
[u'a', u'\xe1', u'b']
>>>
Python 2.7.1, win XPp SP3, 32 bit Czech; regex r902c02d44f
regards,
Vlastimil Brom
Original issue reported on code.google.com by [email protected]
on 7 Feb 2011 at 1:13
I'd like to ask about the support for forward references and nested references
in regex ( http://www.regular-expressions.info/brackets.html ).
I couldn't find any notice of this in the documentation, but it seems, that
forward references are supported, while nested references are not:
>>> regex.search(r"(\2b|(a))+", "-aab-").group()
'aab'
>>>
>>> regex.search(r"(\1b|(a))+", "-aab-").group()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Python27\lib\regex.py", line 235, in search
return _compile(pattern, flags, kwargs).search(string, pos, endpos,
File "C:\Python27\lib\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
branches = [parse_sequence(source, info)]
File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
item = parse_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
element = parse_element(source, info)
File "C:\Python27\lib\_regex_core.py", line 587, in parse_element
element = parse_paren(source, info)
File "C:\Python27\lib\_regex_core.py", line 723, in parse_paren
subpattern = parse_pattern(source, info)
File "C:\Python27\lib\_regex_core.py", line 334, in parse_pattern
branches = [parse_sequence(source, info)]
File "C:\Python27\lib\_regex_core.py", line 350, in parse_sequence
item = parse_item(source, info)
File "C:\Python27\lib\_regex_core.py", line 363, in parse_item
element = parse_element(source, info)
File "C:\Python27\lib\_regex_core.py", line 584, in parse_element
return parse_escape(source, info, False)
File "C:\Python27\lib\_regex_core.py", line 1035, in parse_escape
return parse_numeric_escape(source, info, ch, in_set)
File "C:\Python27\lib\_regex_core.py", line 1069, in parse_numeric_escape
raise error("can't refer to an open group")
error: can't refer to an open group
>>>
Is it true, or am I misinterpretting something?
(re fails with the same error message for the second pattern and "bogus escape:
'\\2'" for the first one.)
thanks and regards,
vbr
Original issue reported on code.google.com by [email protected]
on 26 Sep 2011 at 9:19
The attached diff against c0186afe8c50 restores the speed lost in 7abd9f9bb1
for me (25% faster on my regression tests). In addition, all my tests still
pass without error.
Looking at the diff, I confirmed empirically that it's the conditional:
if (pattern->repeat_info[i].inner)
that seems to slow everything down. I don't why this test and dereference
would be so expensive; it might very well be a gcc bug (I haven't compared the
generated assembly yet).
Original issue reported on code.google.com by [email protected]
on 6 Jan 2011 at 2:02
Attachments:
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("\\ ", regex.X)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 335, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 351, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 364, in parse_
item
element = parse_element(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 585, in parse_
element
return parse_escape(source, info, False)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 1012, in parse
_escape
raise error("bad escape")
_regex_core.error: bad escape
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> print(regex.compile("\\ ", regex.X|regex.D))
CHARACTER MATCH 32
regex.Regex('\\ ', flags=regex.D | regex.X | regex.V0)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120112
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 14 Jan 2012 at 12:59
What steps will reproduce the problem?
$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
parsed = parsed.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>> try:
... regex.compile("(?>b)")
... except regex.error:
... print("Wrong Regexp!")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/regex.py", line 452, in _compile
parsed = parsed.optimise(info)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1712, in optimise
prefix, subpattern = Atomic._split_atomic_prefix(subpattern)
File "/Users/msmhrt/local/mypython/3.2.2/lib/python3.2/site-packages/_regex_core.py", line 1765, in _split_atomic_prefix
prefix, subpattern = prefix[ : count], subpattern[count : ]
TypeError: 'Character' object is not subscriptable
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("(?>b)", flags=regex.V1)
regex.Regex('(?>b)', flags=regex.F | regex.V1)
>>> try:
... regex.compile("(?>b)")
... except regex.error:
... print("Wrong RegExp!")
...
Wrong RegExp!
>>>
What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20111223
Please provide any additional information below.
# The case of Python 3.2.2 standard re module
>>> import re
>>> try:
... re.compile("(?>b)")
... except re.error:
... print("Wrong RegExp!")
...
Wrong RegExp!
>>>
Original issue reported on code.google.com by [email protected]
on 3 Jan 2012 at 1:59
What steps will reproduce the problem?
>>> import regex
>>> regex.compile("(?=abc){3}abc")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 289, in compile
return _compile(pattern, flags, kwargs)
File "C:\Python32\3.2.2\lib\site-packages\regex.py", line 423, in _compile
parsed = parse_pattern(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 336, in parse_
pattern
branches = [parse_sequence(source, info)]
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 352, in parse_
sequence
item = parse_item(source, info)
File "C:\Python32\3.2.2\lib\site-packages\_regex_core.py", line 370, in parse_
item
raise error("nothing to repeat")
_regex_core.error: nothing to repeat
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> regex.compile("(?=abc){3}abc")
regex.Regex('(?=abc){3}abc', flags=regex.V0)
>>> regex.search("(?=abc){3}abc", "abcabcabc").span(0)
(0, 3)
>>>
What version of the product are you using? On what operating system?
Windows XP Home SP3 (32-bit version)
Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
32
regex 0.1.20120119
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 20 Jan 2012 at 1:35
Feature request: please change the NEW flag to something else. In five or six
years (give or take), the re module will be long forgotten, compatibility with
it will not be needed, so-called "new" features will no longer be new, and the
NEW flag will just be silly.
If you care about future compatibility, some sort of version specification
would be better, e.g. "VERSION=0" (current re module), "VERSION=1" (this regex
module), "VERSION=2" (next generation). You could then default to VERSION=0 for
the first few releases, and potentially change to VERSION=1 some time in the
future.
Otherwise, I suggest swapping the sense of the flag: instead of "re behaviour
unless NEW flag is given", I'd say "re behaviour only if OLD flag is given".
(Old semantics will, of course, remain old even when the new semantics are no
longer new.)
(Copied from http://bugs.python.org/issue2636)
Original issue reported on code.google.com by [email protected]
on 28 Aug 2011 at 6:17
Hi,
I just noticed the addition for beginning/end of the word and I appreciate it
very much.
I just wanted to ask, what does maybe "m" in \m and \M stand for.?
I'd have expected rather
\< and \> (maybe because of OpenOffice), but it turns out, that \m \M are used
in TCL and that the other syntax isn't much more used either:
http://www.regular-expressions.info/refflavors.html
[I must have somehow missed that page sofar; it's nice to see, that regex now
have most of the "useful" features (marked as ommissions in re) and some extras
...]
Would it be possible to alias this alternative syntax too, or are there some
drawbacks with it, or is it considered "line noise"?
Just an aside: Am I supposed to be able to set the issue type (such as Feature
request etc., or is it always "Defect" which could be changed by the
administrators later?
Thanks and regards,
vbr
Original issue reported on code.google.com by [email protected]
on 6 Aug 2011 at 11:46
What steps will reproduce the problem?
1. hg clone
2. hg log
3. say aloud "whut"
As far as I can tell there are two parallel histories on the "default" branch.
Google Code's commit browser shows only one of these histories. `hg log` shows
both histories interleaved, one commit after the other, producing a single
nonsensical history. hgview shows these histories in parallel, but still
asserts that both are on the same branch. Is there a problem with the
repository? Am I doing something dumb?
I'd like to know with certainty the latest version of this project and the
history that led to it.
Original issue reported on code.google.com by [email protected]
on 1 Nov 2011 at 11:32
Hi,
recently I used the fuzzy matching capability to detect some misspellings; I
used lookarounds to filter out the correct forms.
In some cases I saw some differences I can't understand, it appears, there may
be some correlation with the i and V0/V1 flags. (using regex 0.1.20111014; py
2.7.2; win XP)
# caseless negative lookbehind in V1 somehow doesn't filter out e==1 matches
>>> regex.findall(r"(?iV1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word
word2 word word3 word word234 word23 word")
['word2', 'word3', 'word234', 'word23']
# while in case-insensitive mode this works as expected
>>> regex.findall(r"(?V1)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>>
>>>
# the - hopefully - equivalent lookahaeds work both the same
>>> regex.findall(r"(?iV1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?V1)(?!\m(?:word){e<=1}\M)\m(?:word){e<=3}\M", "word word2
word word3 word word234 word23 word")
['word234', 'word23']
>>>
# the original above lookbehinds both work with V0 flag
>>> regex.findall(r"(?V0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>> regex.findall(r"(?iV0)\m(?:word){e<=3}\M(?<!\m(?:word){e<=1}\M)", "word
word2 word word3 word word234 word23 word")
['word234', 'word23']
>>>
Unless I have made some stupid mistake, the patterns should match the variants
of "word" with at least 1 and at most 3 errors. Are there some other aspects to
consider if using lookarounds with the fuzzy patterns?
BTW, a probably silly idea originated from this kind of searches - how about
adding support for only erroneous matches via ... {1<=e<=3} ?
(I see, it could get quite complex with some more sophisticated error costs
arithmetics.)
However, I am glad, the same could be done with lookarounds.
(As a side note, is this kind of possible bug report appropriate for a
separate issue, or would it have been better within the "approximate matching"
issue?)
regards,
vbr
Original issue reported on code.google.com by [email protected]
on 3 Nov 2011 at 5:17
What steps will reproduce the problem?
$ python3
Python 3.2.2 (default, Dec 23 2011, 15:22:48)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> r
regex.Regex('qu', flags=regex.F | regex.I | regex.V1)
>>> print(r.match("qu"))
None
>>>
What is the expected output? What do you see instead?
>>> import regex
>>> r = regex.compile("qu", flags=regex.I|regex.V1)
>>> print(r.match("qu").group(0))
qu
What version of the product are you using? On what operating system?
Mac OS X 10.6.8
Python 3.2.2
regex 0.1.20120103
Please provide any additional information below.
I can't reproduce this issue by following steps
>>> import regex
>>> r = regex.compile("qu")
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.I)
>>> print(r.match("qu").group(0))
qu
>>> r = regex.compile("qu", flags=regex.V1)
>>> print(r.match("qu").group(0))
qu
Original issue reported on code.google.com by [email protected]
on 3 Jan 2012 at 4:48
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.