yandex / pire Goto Github PK
View Code? Open in Web Editor NEWPerl Incompatible Regular Expressions library
Home Page: http://github.com/dprokoptsev/pire/wiki
License: Other
Perl Incompatible Regular Expressions library
Home Page: http://github.com/dprokoptsev/pire/wiki
License: Other
This is PIRE, Perl Incompatible Regular Expressions library. This library is aimed at checking a huge amount of text against relatively many regular expressions. Roughly speaking, it can just check whether given text maches the certain regexp, but can do it really fast (more than 400 MB/s on our hardware is common). Even more, multiple regexps can be combined together, giving capability to check the text against apx.10 regexps in a single pass (and mantaining the same speed). Since Pire examines each character only once, without any lookaheads or rollbacks, spending about five machine instructions per each character, it can be used even in realtime tasks. On the other hand, Pire has very limited functionality (compared to other regexp libraries). Pire does not have any Perlish conditional regexps, lookaheads & backtrackings, greedy/nongreedy matches; neither has it any capturing facilities. Pire was developed in Yandex (http://company.yandex.ru/) as a part of its web crawler. More information can be found in README.ru (in Russian), which is yet to be translated. Please report bugs to [email protected] or [email protected]. Quick Start ============= #include <stdio.h> #include <vector> #include <pire/pire.h> Pire::NonrelocScanner CompileRegexp(const char* pattern) { // Transform the pattern from UTF-8 into UCS4 std::vector<Pire::wchar32> ucs4; Pire::Encodings::Utf8().FromLocal(pattern, pattern + strlen(pattern), std::back_inserter(ucs4)); return Pire::Lexer(ucs4.begin(), ucs4.end()) .AddFeature(Pire::Features::CaseInsensitive()) // enable case insensitivity .SetEncoding(Pire::Encodings::Utf8()) // set input text encoding .Parse() // create an FSM .Surround() // PCRE_ANCHORED behavior .Compile<Pire::NonrelocScanner>(); // compile the FSM } bool Matches(const Pire::NonrelocScanner& scanner, const char* ptr, size_t len) { return Pire::Runner(scanner) .Begin() // '^' .Run(ptr, len) // the text .End(); // '$' // implicitly cast to bool } int main() { char re[] = "hello\\s+w.+d$"; char str[] = "Hello world"; Pire::NonrelocScanner sc = CompileRegexp(re); bool res = Matches(sc, str, strlen(str)); printf("String \"%s\" %s \"%s\"\n", str, (res ? "matches" : "doesn't match"), re); return 0; }
Matches
.pire/run.h
:bool Matches(const Scanner& scanner, const char* begin, const char* end)
{
return Runner(scanner).Run(begin, end);
}
README
:bool Matches(const Pire::NonrelocScanner& scanner, const char* ptr, size_t len)
{
return Pire::Runner(scanner)
.Begin() // '^'
.Run(ptr, len) // the text
.End(); // '$'
// implicitly cast to bool
}
Which one is correct?
If Begin()
and End()
are to be called, then patterns without ^
and $
match nothing:
When Begin()
is called, it feeds scanner with special begin char, moving it to dead state 1.
Compare this graph with the graph produced for same pattern surrounded and optimised:
Does this mean that all patterns must begin with ^
and end with $
? Are Begin()
and End()
calls required? It should be clarified and documented.
2 . pigrep
Program pigrep
behaves as latter Matches
, calling Begin()
and End()
. It also surrounds its patterns. I have removed surrounding (btw it would be useful option, grep has it as -x, --line-regexp
) and get the following results:
$ echo -n 'abc' | pigrep 'abc'
$ echo -n 'abc' | pigrep '^abc'
$ echo -n 'abc' | pigrep 'abc$'
$ echo -n 'abc' | pigrep '^abc$'
abc
Summary of problems here:
Matches
pigrep
option equivalent to grep --line-regexp
I have a very simple regexp
std::string r;
r = "(post|get|put|delete).*http/1\\.(1|0)\r\n.*\r\n\r\n";
The test data is:
GET / HTTP/1.1
User-Agent: chrome
Host: ya.ru
Accept: /
Proxy-Connection: Keep-Alive
Which has a proper format i.e. it contains \r\n after each line and extra \r\n after the header. Creation of the scanner is done with the following function:
Pire::Scanner flow::scannerFor( std::string regexp ){
if ( !regexp.size() ) {
throw common::error( _ERR_EMPTY_REGEXP );
}
std::vector<Pire::wchar32> pattern;
Pire::Scanner s;
try {
Pire::Encodings::Utf8().FromLocal( regexp.c_str(), regexp.c_str() + regexp.size(), std::back_inserter(pattern) );
s = Pire::Lexer( pattern.begin(), pattern.end() )
.SetEncoding( Pire::Encodings::Utf8() )
.AddFeature( Pire::Features::CaseInsensitive() )
.Parse()
.Surround()
.Compile<Pire::Scanner>();
} catch ( ... ) {
throw common::error( _ERR_REGEXP_COMPILE
.arg( regexp )
);
}
return s;
}
When I use scanner which is compiled for my regexp with the test data it fails to match. If I comment the line
.SetEncoding( Pire::Encodings::Utf8() )
in the scanner's creation function, scanner starts to match.
Could you comment this situation?
Other scanner types have this method. SlowScanner can implement it easily. I write template code parametrized with scanner type, in which I need LettersCount
.
MIT or Apache.
Is it possible?
$ echo test | pigrep '^test$'
test
$ echo test | pigrep '^test$' -
(stdin)test
Hi!
Thank you for the such cool library! I'm interested in packaging the library into some dependecy managers. It will be much easier if you make a release on GitHub, so in a package recipe I will rely on some "stable" version instead of specific commit. E.g. it much easier to create a package for Conan with release.
I found that from the last release (in 2013 - 7 years ago) there are a lot of changes. Will be fine if they will be released.
Thank you!
Scanner representation is not portable between systems with different byte order and potentially word size; hence once regexp is compiled and inlined, the resulting .cpp must be compiled for the same system where pire_inline had been run. As the result, no cross-compiling is possible.
We need to invent a way to specify target platform capabilities and serialize regexp for that platform.
Hello, I am considering to replace RE2 library with PIRE in a high load application which works with a stream. I am stuck with a regexp syntax, could you clarify which one do I need to use?
Github now has Downloads and Releases as two separate entities. There are no links to Downloads from the main page of the repo. "Releaes" currently are simple snapshots of tags. In particular, files in releases are not autoreconf'ed, so they lack file "configure". Downloads are outdated.
Look in the instructions:
On *nix, from the tarball:
$ ./configure && make && sudo make install
Obviously these instructions do not work for tarball from Releases.
How to solve:
pire/fsm.h:
/// Creates an FSM which matches any suffix of any word current FSM matches.
void MakePrefix();
/// Creates an FSM which matches any suffix of any word current FSM matches.
void MakeSuffix();
yaoweibin@ubuntu:/test/pire$ uname -a/test/pire$ g++ -v
Linux ubuntu 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36 UTC 2009 x86_64 GNU/Linux
yaoweibin@ubuntu:
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.3-5ubuntu4' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
yaoweibin@ubuntu:~/test/pire$ make
make all-recursive
make[1]: Entering directory /home/yaoweibin/test/pire' Making all in pire make[2]: Entering directory
/home/yaoweibin/test/pire/pire'
make all-am
make[3]: Entering directory /home/yaoweibin/test/pire/pire' /bin/bash ../ylwrap inline.lpp .c inline.cpp -- : make[3]: *** [inline.cpp] Error 1 make[3]: Leaving directory
/home/yaoweibin/test/pire/pire'
make[2]: *** [all] Error 2
make[2]: Leaving directory /home/yaoweibin/test/pire/pire' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory
/home/yaoweibin/test/pire'
make: *** [all] Error 2
Scanner representation is not portable between systems with different byte order and potentially word size; hence once regexp is compiled and inlined, the resulting .cpp must be compiled for the same system where pire_inline had been run. As the result, no cross-compiling is possible.
We need to invent a way to specify target platform capabilities and serialize regexp for that platform.
#include <pire/pire.h>
int main() {
Pire::Lexer lexer("abc");
Pire::Scanner s1 = lexer.Parse().Compile<Pire::Scanner>();
Pire::Scanner s2 = lexer.Parse().Compile<Pire::Scanner>();
const char* text = "abc";
std::cout << "abc "
<< Pire::Matches(s1, text, text + 3) << ' '
<< Pire::Matches(s2, text, text + 3) << std::endl;
std::cout << "ab "
<< Pire::Matches(s1, text, text + 2) << ' '
<< Pire::Matches(s2, text, text + 2) << std::endl;
std::cout << " "
<< Pire::Matches(s1, text, text + 0) << ' '
<< Pire::Matches(s2, text, text + 0) << std::endl;
}
Output:
abc 1 0
ab 0 0
0 1
Expected output:
abc 1 1
ab 0 0
0 0
What version of Unicode is used in the library?
This method is trivial to implement. It would return length of the pattern, which is useful information.
How to reproduce:
git clone https://github.com/yandex/pire
cd pire
autoreconf --install
./configure
make all check
make distcheck
tar -xf pire-0.0.5.tar.gz
cd pire-0.0.5/
./configure
make
Error:
$ make
make all-recursive
make[1]: Entering directory `.../pire/pire-0.0.5'
Making all in pire
make[2]: Entering directory `.../pire/pire-0.0.5/pire'
make[2]: *** No rule to make target `re_parser.y', needed by `re_parser.cpp'. Stop.
make[2]: Leaving directory `.../pire/pire-0.0.5/pire'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `.../pire/pire-0.0.5'
make: *** [all] Error 2
I have flex and bison installed.
OS: Ubuntu 14.04.2 LTS
Values stored in m_actions are copies of members .action
of structures Transition
stored in m_jumps
.
How to build it on Windows? It always fails with the following error:
pire\re_lexer.cpp(29): fatal error C1083: Cannot open include file: 're_parser.h': No such file or directory
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.