Giter Site home page Giter Site logo

k-takata / onigmo Goto Github PK

View Code? Open in Web Editor NEW
611.0 34.0 93.0 4.06 MB

Onigmo is a regular expressions library forked from Oniguruma.

License: Other

Shell 0.12% C 58.19% Python 4.26% Ruby 3.02% Makefile 0.71% Batchfile 0.01% M4 0.08% Objective-C 33.02% C++ 0.60%
regular-expression regex regexp c

onigmo's Introduction

Build Status Build status Coverage Status Coverity Scan Build Status Code Quality: Cpp Total Alerts

Onigmo (Oniguruma-mod)

https://github.com/k-takata/Onigmo

Onigmo is a regular expressions library forked from Oniguruma. It focuses to support new expressions like \K, \R, (?(cond)yes|no) and etc. which are supported in Perl 5.10+.

Since Onigmo is used as the default regexp library of Ruby 2.0 or later, many patches are backported from Ruby 2.x.

See also the Wiki page: https://github.com/k-takata/Onigmo/wiki

License

BSD license.

Install

Case 1: Unix and Cygwin platform

  1. ./autogen.sh (If configure doesn't exist.)
  2. ./configure
  3. make
  4. make install
  • test

    make test

  • uninstall

    make uninstall

  • configuration check

    onigmo-config --cflags onigmo-config --libs onigmo-config --prefix onigmo-config --exec-prefix

Case 2: Windows 64/32bit platform (Visual C++)

Execute build_nmake.cmd. build_x64 or build_x86 will be used as a working/output directory.

  onigmo_s.lib:  static link library
  onigmo.lib:    import library for dynamic link
  onigmo.dll:    dynamic link library
  • test (ASCII/Shift_JIS/EUC-JP/Unicode)

    Execute build_nmake.cmd test. Python (with the same bitness of Onigmo) is needed to run the tests.

Case 3: Windows 64/32bit platform (MinGW)

Execute mingw32-make -f win32/Makefile.mingw. build_x86-64, build_i686 and etc. will be used as a working/output directory.

  libonigmo.a:     static link library
  libonigmo.dll.a: import library for dynamic link
  onigmo.dll:      dynamic link library
  • test (ASCII/Shift_JIS/EUC-JP/Unicode)

    Execute mingw32-make -f win32/Makefile.mingw test. Python (with the same bitness of Onigmo) is needed to run the tests.

  • If you use MinGW on MSYS2, you can also use ./configure and make like Unix. In this case, DLL name will have API version number. E.g.:

    libonigmo-6.dll

Regular Expressions

See doc/RE or doc/RE.ja for Japanese.

Usage

Include onigmo.h in your program. (Onigmo API) See doc/API for Onigmo API.

If you want to disable UChar type (== unsigned char) definition in onigmo.h, define ONIG_ESCAPE_UCHAR_COLLISION and then include onigmo.h.

If you want to disable regex_t type definition in onigmo.h, define ONIG_ESCAPE_REGEX_T_COLLISION and then include onigmo.h.

Example of the compiling/linking command line in Unix or Cygwin, (prefix == /usr/local case)

cc sample.c -L/usr/local/lib -lonigmo

If you want to use static link library (onigmo_s.lib) in Win32, add option -DONIG_EXTERN=extern to C compiler.

Sample Programs

File Description
sample/simple.c example of the minimum (Onigmo API)
sample/names.c example of the named group callback.
sample/encode.c example of some encodings.
sample/listcap.c example of the capture history.
sample/posix.c POSIX API sample.
sample/sql.c example of the variable meta characters.

Test Programs

File Description
sample/syntax.c Perl, Java and ASIS syntax test.
sample/crnl.c CRNL test

Source Files

File Description
onigmo.h Onigmo API header file (public)
onigmo-config.in configuration check program template
onigmo.py Onigmo module for Python
regenc.h character encodings framework header file
regint.h internal definitions
regparse.h internal definitions for regparse.c and regcomp.c
regcomp.c compiling and optimization functions
regenc.c character encodings framework
regerror.c error message function
regext.c extended API functions (deluxe version API)
regexec.c search and match functions
regparse.c parsing functions.
regsyntax.c pattern syntax functions and built-in syntax definition
regtrav.c capture history tree data traverse functions
regversion.c version info function
st.h hash table functions header file
st.c hash table functions
onigmognu.h GNU regex API header file (public)
reggnu.c GNU regex API functions
onigmoposix.h POSIX API header file (public)
regposerr.c POSIX error message function
regposix.c POSIX API functions
enc/mktable.c character type table generator
enc/ascii.c ASCII-8BIT encoding
enc/jis/ JIS properties data
enc/euc_jp.c EUC-JP encoding
enc/euc_tw.c EUC-TW encoding
enc/euc_kr.c EUC-KR, EUC-CN encoding
enc/shift_jis.c Shift_JIS encoding
enc/shift_jis.h Common part of Shift_JIS and Windows-31J encoding
enc/windows_31j.c Windows-31J (CP932) encoding
enc/big5.c Big5 encoding
enc/gb18030.c GB18030 encoding
enc/gbk.c GBK encoding
enc/koi8_r.c KOI8-R encoding
enc/koi8_u.c KOI8-U encoding
enc/iso_8859.h common definition of ISO-8859 encoding
enc/iso_8859_1.c ISO-8859-1 (Latin-1)
enc/iso_8859_2.c ISO-8859-2 (Latin-2)
enc/iso_8859_3.c ISO-8859-3 (Latin-3)
enc/iso_8859_4.c ISO-8859-4 (Latin-4)
enc/iso_8859_5.c ISO-8859-5 (Cyrillic)
enc/iso_8859_6.c ISO-8859-6 (Arabic)
enc/iso_8859_7.c ISO-8859-7 (Greek)
enc/iso_8859_8.c ISO-8859-8 (Hebrew)
enc/iso_8859_9.c ISO-8859-9 (Latin-5 or Turkish)
enc/iso_8859_10.c ISO-8859-10 (Latin-6 or Nordic)
enc/iso_8859_11.c ISO-8859-11 (Thai)
enc/iso_8859_13.c ISO-8859-13 (Latin-7 or Baltic Rim)
enc/iso_8859_14.c ISO-8859-14 (Latin-8 or Celtic)
enc/iso_8859_15.c ISO-8859-15 (Latin-9 or West European with Euro)
enc/iso_8859_16.c ISO-8859-16 (Latin-10)
enc/utf_8.c UTF-8 encoding
enc/utf_16be.c UTF-16BE encoding
enc/utf_16le.c UTF-16LE encoding
enc/utf_32be.c UTF-32BE encoding
enc/utf_32le.c UTF-32LE encoding
enc/unicode.c common codes of Unicode encoding
enc/unicode/ Unicode case folding data and properties data
enc/windows_1250.c Windows-1250 (CP1250) encoding (Central/Eastern Europe)
enc/windows_1251.c Windows-1251 (CP1251) encoding (Cyrillic)
enc/windows_1252.c Windows-1252 (CP1252) encoding (Latin)
enc/windows_1253.c Windows-1253 (CP1253) encoding (Greek)
enc/windows_1254.c Windows-1254 (CP1254) encoding (Turkish)
enc/windows_1257.c Windows-1257 (CP1257) encoding (Baltic Rim)
enc/cp949.c CP949 encoding (only used in Ruby)
enc/emacs_mule.c Emacs internal encoding (only used in Ruby)
enc/gb2312.c GB2312 encoding (only used in Ruby)
enc/us_ascii.c US-ASCII encoding (only used in Ruby)
win32/Makefile Makefile for Win32 (VC++)
win32/Makefile.mingw Makefile for Win32 (MinGW)
win32/config.h config.h for Win32
win32/onigmo.rc resource file for Win32

onigmo's People

Contributors

aycabta avatar duerst avatar imasahiro avatar k-takata avatar kazuho avatar keens avatar knu avatar kou avatar ksss avatar mattn avatar methodmissing avatar mtsuji avatar nobu avatar nurse avatar omochi avatar sebgod avatar shyouhei avatar sorbits avatar srawlins avatar timgates42 avatar tom-lord avatar xcorail avatar znz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

onigmo's Issues

Backward search with .*

Onigmo/testpy.py

Lines 1419 to 1422 in 2cb0bec

# These match differently. Is it okay?
x2(".*[a-z]bc", "abcabc", 0, 6, searchtype=SearchType.BACKWARD)
x2(".+[a-z]bc", "abcabc", 0, 6, searchtype=SearchType.BACKWARD)
x2(".{1,3}[a-z]bc", "abcabc", 2, 6, searchtype=SearchType.BACKWARD)

/.*[a-z]bc/ and /.+[a-z]bc/ both match to 0-6 (abcabc),
but /.{1,3}[a-z]bc/ matches to 2-6 (cabc).

Shouldn't be as follows?
/.*[a-z]bc/: 3-6
/.+[a-z]bc/: 2-6
/.{1,3}[a-z]bc/: 2-6

Named backrefs behave differently in Perl syntax

Named backrefs (\k<name>, \g{name}) refer only the left most group with the name in Perl.
However they behave differently in Onigmo with Perl syntax. (

Onigmo/doc/RE

Lines 321 to 323 in b334081

When backreferencing with a name that is assigned to more than one groups,
the last group with the name is checked first, if not matched then the
previous one with the name, and so on, until there is a match.
)
They should refer only the left most group in Perl syntax.

Related: #73

/(?i)\u0149\u0149/ =~ "\u0149\u0149" doesn't match

How to reproduce:

$ LD_LIBRARY_PATH=.libs python

>>> from testpy import *
>>> set_encoding('UTF-8')
>>> set_output_encoding('UTF-8')
>>> x2(u"(?i)\u0149\u0149", u"\u0149\u0149", 0, 2)
FAIL: /(?i)ʼnʼn/ 'ʼnʼn'

It seems that Oniguruma 5.9.5 also has this bug.

question about the \G anchor and onig_search_gpos

hi,

let's say we have a regex /\Gabc/ and target string is "123abcdef", if we set gpos at offset 3 in the target string, then the regex is supposed to match "abc" in the target string, right?

can someone please advise why below code not work as my expected?

    UChar* pattern = (UChar*) "\\Gabc";
    UChar* pattern_end = (UChar*) (pattern + strlen(pattern));
    OnigErrorInfo oni_error;
    regex_t* regex;
    int code = onig_new(&regex, pattern, pattern_end, ONIG_OPTION_CAPTURE_GROUP, ONIG_ENCODING_UTF8, ONIG_SYNTAX_DEFAULT,&oni_error);
    if (code != ONIG_NORMAL) printf("error create regex\n");

    UChar* str = (UChar*)"123abcdef";
    UChar* end = str + strlen(str);
    UChar* gpos = str + 3;
    UChar* start = str;
    OnigRegion* region = onig_region_new();
    code = onig_search_gpos(regex, str, end, gpos, start, end, region, ONIG_OPTION_NONE);
    if (code == ONIG_MISMATCH) {
        printf("can't match\n");   // output this, why???
    }else {
        printf("matched\n");
    }

Plan for Onigmo 6.0

About one year ago Oniguruma restarted on GitHub, and Oniguruma 6.0 was released about five months ago. Unfortunately new syntax I implemented in Onigmo like \K, \X and etc. were not merged in Oniguruma, and I don't think K.Kosako is going to merge them.
I think Ongimo and Oniguruma are now diverged and won't be merged again.

Currently Onigmo and Oniguruma cannot be installed in the same system, because Onigmo still uses the same header/library filenames. They should be changed now.

I'm planing that:

  • Rename the header and library filenames: onigmo.h, libonigmo.so, onigmo.dll, etc.
  • Import the latest Ruby's source code. This includes support for Unicode 9.0 and improvement of case folding. (Already merged into the ruby-2.x branch.)
  • Merge the ruby-2.x branch into master. Ruby specific codes should be surrounded by #ifdef RUBY.
  • Change some APIs. (Some of them come from the side effect of merging the ruby-2.x branch.)
    • Encoding structs and syntax structs will become const.
    • Change some encoding name. E.g. ONIG_ENCODING_UTF8 -> ONIG_ENCODING_UTF_8. However the old names still can be used, because I will define aliases for backward compatibility. (But the old structure names like OnigEncodingUTF8 will not be used. They are not considered as a part of public API.)
    • (Import Oniguruma 6.x's new APIs?)
      • onig_scan: Maybe useful and easy to import.
      • onig_unicode_define_user_property: Not sure how this is useful.
      • onig_initialize: Not sure why this is needed. Related to thread safe?
  • (Make Onigmo thread safe and remove all THREAD_* macros like Oniguruma did?)
  • Merge bug fixes done in Oniguruma 6.x. Need to check if they are already fixed in Onigmo.

Support Perl's \h and \v

The following two should be supported on Perl syntax:

\h: horizontal white space
\v: vertical white space

They conflict with Ruby syntax, so they will not be supported on Ruby syntax.

Backwards search not respecting range

Summary

If I call onig_search with a start/range of 3-0 then it may still return a match outside this range.

Steps to Reproduce

Build the following source:

#include <string.h>
#include <stdio.h>
#include "oniguruma.h"

int main (int argc, char const* argv[])
{
    char const* ptrn = "\\h+";
    char const* data = "abcdef";

    OnigErrorInfo einfo;
    OnigRegex regex = NULL;
    if(ONIG_NORMAL == onig_new(&regex, (OnigUChar const*)ptrn, (OnigUChar const*)ptrn + strlen(ptrn), ONIG_OPTION_CAPTURE_GROUP, ONIG_ENCODING_UTF8, ONIG_SYNTAX_RUBY, &einfo))
    {
        OnigRegion* res = onig_region_new();
        if(ONIG_MISMATCH != onig_search(regex, (OnigUChar const*)data, (OnigUChar const*)data + strlen(data), (OnigUChar const*)data + 3, (OnigUChar const*)data, res, 0))
            fprintf(stderr, "%td-%td\n", res->beg[0], res->end[0]);
    }
    return 0;
}

Expected Result

I would expect the output to be 0-3.

Actual Result

The program outputs 3-4.

Notes

This was tested with current master (9cd4fa1) but problem has existed for as long as I know.

warnings occur when compile with MinGW

regcomp.c: In function 'disable_noname_group_capture':
regcomp.c:2020:3: warning: implicit declaration of function 'alloca' [-Wimplicit-function-declaration]
   map = (GroupNumRemap* )xalloca(sizeof(GroupNumRemap) * (env->num_mem + 1));
   ^
In file included from regparse.h:33:0,
                 from regcomp.c:31:
regint.h:195:21: warning: incompatible implicit declaration of built-in function 'alloca' [enabled by default]
 #define xalloca     alloca
                     ^
regcomp.c:2020:26: note: in expansion of macro 'xalloca'
   map = (GroupNumRemap* )xalloca(sizeof(GroupNumRemap) * (env->num_mem + 1));
                          ^

crashes from onig_new

  r = onig_new(&reg, data, data + size,
    ONIG_OPTION_DEFAULT, ONIG_ENCODING_ASCII, ONIG_SYNTAX_DEFAULT, &einfo);
(\2)(\1)
stack-overflow on address 0x7fffca3d5ff8 (pc 0x000000574964 bp 0x7fffca3d6110 sp 0x7fffca3d6000 T0)
    #0 0x574963 in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2152:3
    #1 0x574dc5 in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2188:11
    #2 0x574d1d in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2244:8
    #3 0x57520a in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2163:11
    #4 0x574d1d in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2244:8
    #5 0x57520a in get_min_match_length /home/skomski/Code/Onigmo/regcomp.c:2163:11

   ...
(((?(700000))(?<y>)(())))
SEGV on unknown address 0x7ffcea194a00 (pc 0x000000572f74 bp 0x7ffcea194a00 sp 0x7ffce9ee8f80 T0)
==27959==The signal is caused by a READ memory access.
    #0 0x572f73 in renumber_by_map /home/skomski/Code/Onigmo/regcomp.c:1953:38
    #1 0x57325d in renumber_by_map /home/skomski/Code/Onigmo/regcomp.c:1943:11
    #2 0x54bf34 in disable_noname_group_capture /home/skomski/Code/Onigmo/regcomp.c:2030:7
    #3 0x5489c6 in onig_compile /home/skomski/Code/Onigmo/regcomp.c:5740:11
    #4 0x5718c4 in onig_new /home/skomski/Code/Onigmo/regcomp.c:5976:7

Merge Ruby's casefold.h

The latest Ruby's enc/unicode/casefold.h uses perfect hash.
Maybe it is faster than the current implementation.

Regexp matching with \p{Upper} and \p{Lower} for EUC-JP doesn’t work.

U+FF21 (A, FULLWIDTH LATIN CAPITAL LETTER A) and U+00c0 (À, LATIN CAPITAL LETTER A WITH GRAVE) is Uppercase_Letter so it should match and return 0 in following cases but this returns 1.

ruby -e 'puts "\uFF21A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP”))' # => 1
ruby -e 'puts "\u00C0A".encode("EUC-JP") =~ Regexp.compile("\\\p{Upper}".encode("EUC-JP"))' # => 1

This also happens in lower case matching.

ruby -e 'puts "\uFF41a".encode("EUC-JP") =~ Regexp.compile("\\\p{Lower}".encode("EUC-JP"))' => 1

In Unicode encoding it works as follows.

ruby -e 'puts "\uFF21A" =~ Regexp.compile("\\\p{Upper}")'  # => 0

Looks like EUC-JP \p{Upper} and \p{Lower} regex is limited to ASCII characters.

Out-of-bounds read in set_bm_skip()

The following code

#include <stdio.h>
#include <string.h>
#include "onigmo.h"

int main(int argc, char* argv[])
{
	int r;
	regex_t* reg;
	OnigErrorInfo einfo;

	UChar* pattern = (UChar* )"\\\xD3\xD5\xBE\x1E+";
	r = onig_new(&reg, pattern, pattern + strlen((char* )pattern),
		     ONIG_OPTION_DEFAULT, ONIG_ENCODING_EUC_JP, ONIG_SYNTAX_DEFAULT, &einfo);
	if (r != ONIG_NORMAL) {
		OnigUChar s[ONIG_MAX_ERROR_MESSAGE_LEN];
		onig_error_code_to_str(s, r, &einfo);
		fprintf(stderr, "ERROR: %s\n", s);
		return -1;
	}
	onig_free(reg);
	onig_end();
}

causes out-of-bounds heap read:

=================================================================
==13981==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200000efd3 at pc 0x7f898811550c bp 0x7ffd279fb1c0 sp 0x7ffd279fb1b0
READ of size 1 at 0x60200000efd3 thread T0
    #0 0x7f898811550b in set_bm_skip /work/ref/Onigmo/regcomp.c:4252
    #1 0x7f898811c341 in set_optimize_exact_info /work/ref/Onigmo/regcomp.c:5294
    #2 0x7f898811cce5 in set_optimize_info_from_tree /work/ref/Onigmo/regcomp.c:5386
    #3 0x7f898811db90 in onig_compile /work/ref/Onigmo/regcomp.c:5798
    #4 0x7f898811e553 in onig_new /work/ref/Onigmo/regcomp.c:5938
    #5 0x400be1 in main /work/ref/Onigmo/sample/simple.c:12
    #6 0x7f8987d2d290 in __libc_start_main (/usr/lib/libc.so.6+0x20290)
    #7 0x4009d9 in _start (/work/ref/Onigmo/sample/.libs/simple+0x4009d9)

0x60200000efd3 is located 0 bytes to the right of 3-byte region [0x60200000efd0,0x60200000efd3)
allocated by thread T0 here:
    #0 0x7f89884aee60 in __interceptor_malloc /build/gcc-multilib/src/gcc/libsanitizer/asan/asan_malloc_linux.cc:62
    #1 0x7f898811be9e in set_optimize_exact_info /work/ref/Onigmo/regcomp.c:5268
    #2 0x7f898811cce5 in set_optimize_info_from_tree /work/ref/Onigmo/regcomp.c:5386
    #3 0x7f898811db90 in onig_compile /work/ref/Onigmo/regcomp.c:5798
    #4 0x7f898811e553 in onig_new /work/ref/Onigmo/regcomp.c:5938
    #5 0x400be1 in main /work/ref/Onigmo/sample/simple.c:12
    #6 0x7f8987d2d290 in __libc_start_main (/usr/lib/libc.so.6+0x20290)

SUMMARY: AddressSanitizer: heap-buffer-overflow /work/ref/Onigmo/regcomp.c:4252 in set_bm_skip
Shadow bytes around the buggy address:
  0x0c047fff9da0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9db0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9dc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9dd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9de0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x0c047fff9df0: fa fa fa fa fa fa fa fa fa fa[03]fa fa fa 00 04
  0x0c047fff9e00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9e10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9e20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9e30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c047fff9e40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==13981==ABORTING

(originally reported at https://bugs.ruby-lang.org/issues/12997)

Valgrind reports out-of-bounds memory access while creating a Regexp object with an invalid byte sequence:
$ valgrind ruby -e'Regexp.new("\\\xD3\xD5\xBE\x1E+".force_encoding("euc-jp"))'
==21986== Memcheck, a memory error detector
==21986== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==21986== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==21986== Command: ruby -eRegexp.new("\\\\\\xD3\\xD5\\xBE\\x1E+".force_encoding("euc-jp"))
==21986== 
==21986== Invalid read of size 1
==21986==    at 0x1EF7D0: set_bm_skip.isra.17 (regcomp.c:4271)
==21986==    by 0x1FC1FB: set_optimize_exact_info (regcomp.c:5310)
==21986==    by 0x1FC1FB: set_optimize_info_from_tree (regcomp.c:5396)
==21986==    by 0x1FC1FB: onig_compile (regcomp.c:5824)
==21986==    by 0x1E7C0C: onig_new_with_source (re.c:850)
==21986==    by 0x1E7C0C: make_regexp (re.c:874)
==21986==    by 0x1E7C0C: rb_reg_initialize (re.c:2681)
==21986==    by 0x1E7DEE: rb_reg_initialize_str (re.c:2715)
==21986==    by 0x1E8021: rb_reg_init_str (re.c:2751)
==21986==    by 0x1E8021: rb_reg_initialize_m (re.c:3293)
==21986==    by 0x2981AA: vm_call0_cfunc_with_frame (vm_eval.c:131)
==21986==    by 0x2981AA: vm_call0_cfunc (vm_eval.c:148)
==21986==    by 0x2981AA: vm_call0_body.constprop.142 (vm_eval.c:180)
==21986==    by 0x29897C: vm_call0 (vm_eval.c:61)
==21986==    by 0x29897C: rb_call0 (vm_eval.c:342)
==21986==    by 0x19BFA0: rb_class_new_instance (object.c:1895)
==21986==    by 0x2891D6: vm_call_cfunc_with_frame (vm_insnhelper.c:1752)
==21986==    by 0x2891D6: vm_call_cfunc (vm_insnhelper.c:1847)
==21986==    by 0x296A8D: vm_call_method_each_type (vm_insnhelper.c:2138)
==21986==    by 0x296FC2: vm_call_method (vm_insnhelper.c:2288)
==21986==    by 0x28FEC8: vm_exec_core (insns.def:1066)
==21986==  Address 0x73f7333 is 0 bytes after a block of size 3 alloc'd
==21986==    at 0x4C2AB8D: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21986==    by 0x1FC083: set_optimize_exact_info (regcomp.c:5284)
==21986==    by 0x1FC083: set_optimize_info_from_tree (regcomp.c:5396)
==21986==    by 0x1FC083: onig_compile (regcomp.c:5824)
==21986==    by 0x1E7C0C: onig_new_with_source (re.c:850)
==21986==    by 0x1E7C0C: make_regexp (re.c:874)
==21986==    by 0x1E7C0C: rb_reg_initialize (re.c:2681)
==21986==    by 0x1E7DEE: rb_reg_initialize_str (re.c:2715)
==21986==    by 0x1E8021: rb_reg_init_str (re.c:2751)
==21986==    by 0x1E8021: rb_reg_initialize_m (re.c:3293)
==21986==    by 0x2981AA: vm_call0_cfunc_with_frame (vm_eval.c:131)
==21986==    by 0x2981AA: vm_call0_cfunc (vm_eval.c:148)
==21986==    by 0x2981AA: vm_call0_body.constprop.142 (vm_eval.c:180)
==21986==    by 0x29897C: vm_call0 (vm_eval.c:61)
==21986==    by 0x29897C: rb_call0 (vm_eval.c:342)
==21986==    by 0x19BFA0: rb_class_new_instance (object.c:1895)
==21986==    by 0x2891D6: vm_call_cfunc_with_frame (vm_insnhelper.c:1752)
==21986==    by 0x2891D6: vm_call_cfunc (vm_insnhelper.c:1847)
==21986==    by 0x296A8D: vm_call_method_each_type (vm_insnhelper.c:2138)
==21986==    by 0x296FC2: vm_call_method (vm_insnhelper.c:2288)
==21986==    by 0x28FEC8: vm_exec_core (insns.def:1066)
==21986== 
==21986== 
==21986== HEAP SUMMARY:
==21986==     in use at exit: 2,538,700 bytes in 17,476 blocks
==21986==   total heap usage: 43,758 allocs, 26,282 frees, 10,646,254 bytes allocated
==21986== 
==21986== LEAK SUMMARY:
==21986==    definitely lost: 349,991 bytes in 3,886 blocks
==21986==    indirectly lost: 474,023 bytes in 5,121 blocks
==21986==      possibly lost: 1,441,628 bytes in 7,599 blocks
==21986==    still reachable: 273,058 bytes in 870 blocks
==21986==         suppressed: 0 bytes in 0 blocks
==21986== Rerun with --leak-check=full to see details of leaked memory
==21986== 
==21986== For counts of detected and suppressed errors, rerun with: -v
==21986== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Thread safety of onig_search and onig_match.

Forgive me if I'm working from incorrect assumptions as I've just begun integrating Onigmo into one of my projects. My understanding is that these functions are intended to be thread-safe.

Looking at the implementation of these onig_search and onig_match, I notice the following code attempting some synchronization at the start:

#if defined(USE_RECOMPILE_API) && defined(USE_MULTI_THREAD_SYSTEM)
start:
THREAD_ATOMIC_START;
...

This seems to be part of a (perhaps defunct?) feature, hidden behind USE_RECOMPILE_API that isn't set anywhere, that would recompile the regular expression from within onig_search and onig_match, something that's obviously not thread safe and needs to be protected (although a global lock would kill throughput).

Given that there appears to be no other synchronization in use other than this (dead?) code, I thought I would try changing the functions to take const regex_t* instead of regex_t* to at least signal thread safety with respect to the regular expression struct.

Here is a comparison of a commit that adds const where needed.

Well, that wasn't that bad of a change, but unfortunately it won't build because set_bm_backward_skip is used to modify the int_map_backward field of regex_t. At first glance, this does not appear to be a thread safe mutation of regex_t state.

Is that the case or is this not a problem?

Please update outdated config.guess because it causes FTBFS on ppc64el

On ppc64el, configure script fails due to outdated config.guess.

Here is the output log of configure:

checking dependency style of gcc... gcc3
checking build system type... ./config.guess: unable to guess system type

This script, last modified 2009-12-30, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
and
  http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

If the version you run (./config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to <[email protected]> in order to provide the needed
information to handle your system.

config.guess timestamp = 2009-12-30

FYI: there is a workaround when building deb package (dh --with autotools-dev) because it updates config.(guess|sub), but it is better to be fixed in upstream.

Whitespace in /[...]/x

http://www.kt.rim.or.jp/~kbk/zakkicho/14/zakkicho1408b.html#D20140820-3
See: [ruby-list:49923] whitespace in /[...]/x

正規表現を x オプションで記述すると、空白を無視するとのことですが、
文字クラス内に記述した空白は無視されないようです。
これは仕様でしょうか?

Currently, Onigmo doesn't ignore whitespace in a char class even if ONIG_OPTION_EXTEND option is specified.

Perl ignores it.
http://perldoc.perl.org/perlre.html#/x

Python doesn't ignore it.
https://docs.python.org/3.3/library/re.html#re.X

Is it better to add an option?

Named backrefs in conditional expressions behave differently

Named backrefs in conditional expressions ((?(<name>)yes|no)) behave differently from normal named backrefs (\k<name>) in Ruby syntax.
\k<name> checks the last group with the name, if not matched then the previous one, and so on, until there is a match. (

Onigmo/doc/RE

Lines 321 to 323 in b334081

When backreferencing with a name that is assigned to more than one groups,
the last group with the name is checked first, if not matched then the
previous one with the name, and so on, until there is a match.
)
However (?(<name>)yes|no) only checks the left most group with the name. This behavior of the conditional expressions is the same as Perl, but inconsistent for Ruby.
The same strategy with \k<name> should be used for Ruby syntax.

Related: #74

onig_init() is not thread safe

onig_init() is not thread safe.

extern int
onig_init(void)
{
  if (onig_inited != 0)
    return 0;

  THREAD_SYSTEM_INIT;
  THREAD_ATOMIC_START;

  onig_inited = 1;

The above block is not an atomic block, so THREAD_SYSTEM_INIT might be called more than once.

This may cause problems when calling onig_new() from multiple threads without calling onig_init() explicitly. doc/API says:

You don't have to call it explicitly, because it is called in onig_new().

But it's not true when calling onig_new() from multiple threads.
A workaround is calling onig_init() explicitly (from the main thread) before calling onig_new().

Use Direct Threaded Code

Direct threaded code can make a VM faster, but GCC extension is needed to implement it. (Of cause VC doesn't support the extension.)
However, there is a way to support both GCC-compatible and non-GCC-compatible compiler in one code base.
E.g. picrin supports both direct threaded code and normal swich-case based code. (https://github.com/picrin-scheme/picrin/blob/master/extlib/benz/vm.c#L583)

See also: Rubyist Magazine - YARV Maniacs 【第 3 回】 命令ディスパッチの高速化

Character properties ignore the ignore case flag

/(?i)\p{lower}\p{upper}/ =~ "Ab"       → unmatch
/(?i)[\p{lower}][\p{upper}/] =~ "Ab"   → match

If they are in character class, the flag is not ignored. It is inconsistent.
Oniguruma 5.9.5 also behaves like this.

Perl 5.16 doesn't ignore the flag.

Irregal capture after recursive match

Thank you very much for maintaining the excellent library.

I found the latest version of Onigmo suffers from the following incorrect behavior when used with a recursive expression.

/\(((?:[^(]|\g<0>)*)\)/ matches "(abc)(abc)" => OK 💚
matches[0] == (0, 5) corresponding to "(abc)" => OK 💚
matches[1] == (6, 4) corresponding to a string with negative length => NG 💔

/\(((?:[^()]|\g<0>)*)\)/ matches "((abc)(abc))" => OK 💚
matches[0] == (0, 12) corresponding to "((abc)(abc))" => OK 💚
matches[1] == (7, 11) corresponding to "abc)" => NG 💔

It seems that matches[].rm_so refers to the last capture while matches[].rm_eo refers to to the top level capture. I believe the users will be happier if both of them refers to the top level capture. When tested in ruby, the former example returns an invalid string for $2 that causes ArgumentError when given to a function as described at ruby/正規表現.

[[:punct:]] and \p{Punct}

Perl's document (perlrecharclass) says that:

\p{PosixPunct} and [[:punct:]] in the ASCII range match all non-controls, non-alphanumeric, non-space characters: [-!"#$%&'()*+,./:;<=>?@[\\\]^_{|}~]`

The similarly named property, \p{Punct} , matches a somewhat different set in the ASCII range, namely [-!"#%&'()*,./:;?@[\\\]_{}]. That is, it is missing the nine characters [$+<=>^|~]`.

In current Onigmo, [[:punct:]] and \p{Punct} is the same in the ASCII range and they depend on the encoding.
If the encoding is Unicode encoding, [[:punct:]] and \p{Punct} don't match the nine characters.
If the encoding is not Unicode encoding, [[:punct:]] and \p{Punct} match the nine characters.

Is it OK?

segmation fault occurs when many groups are used

see: https://bugs.ruby-lang.org/issues/8716

WindowsとOS Xで検証しました。

  • 再現手順 ruby 2.0.0p247 (2013-06-27) [x64-mingw32]
a="()"
(32767.times{a<<'()'}
eval "/#{a}/=~''"
  • 再現手順 ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.2.1]
a="()"
(1<<21).times{a<<'()'}
eval "/#{a}/=~''"

regexec.c:match_at()で呼ばれるSTACK_INITがサイズを考慮せずにxallocaしているため、スタックオーバーフローしています。

Two warnings use incorrect source file and line when found in instance_eval

This is ruby bug #9344:

$ cat 9344.rb
code = <<-RUBY
x = /]]/
y = /[a-z]+*/
z = /[a-z]**/
RUBY
instance_eval code, 'foo.rb'
$ ruby 9344.rb
foo.rb:1: warning: regular expression has ']' without escape: /]]/
9344.rb:6: warning: nested repeat operator + and * was replaced with '*': /[a-z]+*/
9344.rb:6: warning: redundant nested repeat operator: /[a-z]**/

My solution was to stop using onig_verb_warn and instead use onig_syntax_warn. These are the last two references to onig_verb_warn, so then that function can be deleted too.

I attached a patch to the ruby bug. Should I open a pull request here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.