Giter Site home page Giter Site logo

gitzhangjinyu / hyperscan-java Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gliwka/hyperscan-java

0.0 1.0 0.0 29.62 MB

Match tens of thousands of regular expressions within milliseconds - Java bindings for Intel's hyperscan 5

License: BSD 3-Clause "New" or "Revised" License

Java 100.00%

hyperscan-java's Introduction

hyperscan-java

Maven Central example workflow name

hyperscan is a high-performance multiple regex matching library.

It uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions and for the matching of regular expressions across streams of data.

This project is a third-party developed wrapper for the hyperscan project to enable developers to integrate hyperscan in their java (JVM) based projects.

Add it to your project

This project is available on maven central.

The version number consists of two parts (i.e. 5.4.0-2.0.0). The first part specifies the hyperscan version (5.4.0), the second part the version of the wrapper (2.0.0)

Maven

<dependency>
    <groupId>com.gliwka.hyperscan</groupId>
    <artifactId>hyperscan</artifactId>
    <version>5.4.0-2.0.0</version>
</dependency>

Gradle

compile group: 'com.gliwka.hyperscan', name: 'hyperscan', version: '5.4.0-2.0.0'

sbt

libraryDependencies += "com.gliwka.hyperscan" %% "hyperscan" % "5.4.0-2.0.0"

Usage

If you want to utilize the whole power of the Java Regex API / full PCRE syntax and are fine with sacrificing some performance, use thePatternFilter. It takes a large lists of java.util.regex.Pattern and uses hyperscan to filter it down to a few Patterns with a high probability that they will match. You can then use the regular Java API to confirm those matches. This is similar to chimera, only using the standard Java API instead of libpcre.

If you need the highest performance, you should use the hyperscan API directly. Be aware, that only a smaller subset of the PCRE syntax is supported. Missing features are for example backreferences, capture groups and backtracking verbs. The matching behaviour is also a litte bit different, see the semantics chapter of the hyperscan docs.

Examples

Use of the PatternFilter

List<Pattern> patterns = asList(
        Pattern.compile("The number is ([0-9]+)", Pattern.CASE_INSENSITIVE),
        Pattern.compile("The color is (blue|red|orange)")
        // and thousands more
);

//not thread-safe, create per thread
PatternFilter filter = new PatternFilter(patterns);

//this list now only contains the probably matching patterns, in this case the first one
List<Matcher> matchers = filter.filter("The number is 7 the NUMber is 27");

//now we use the regular java regex api to check for matches - this is not hyperscan specific
for(Matcher matcher : matchers) {
    while (matcher.find()) {
        // will print 7 and 27
        System.out.println(matcher.group(1));
    }
}

Direct use of hyperscan

import com.gliwka.hyperscan.wrapper;

...

//we define a list containing all of our expressions
LinkedList<Expression> expressions = new LinkedList<Expression>();

//the first argument in the constructor is the regular pattern, the latter one is a expression flag
//make sure you read the original hyperscan documentation to learn more about flags
//or browse the ExpressionFlag.java in this repo.
expressions.add(new Expression("[0-9]{5}", EnumSet.of(ExpressionFlag.SOM_LEFTMOST)));
expressions.add(new Expression("Test", ExpressionFlag.CASELESS));


//we precompile the expression into a database.
//you can compile single expression instances or lists of expressions

//since we're interacting with native handles always use try-with-resources or call the close method after use
try(Database db = Database.compile(expressions)) {
    //initialize scanner - one scanner per thread!
    //same here, always use try-with-resources or call the close method after use
    try(Scanner scanner = new Scanner())
    {
        //allocate scratch space matching the passed database
        scanner.allocScratch(db);


        //provide the database and the input string
        //returns a list with matches
        //synchronized method, only one execution at a time (use more scanner instances for multithreading)
        List<Match> matches = scanner.scan(db, "12345 test string");

        //matches always contain the expression causing the match and the end position of the match
        //the start position and the matches string it self is only part of a matach if the
        //SOM_LEFTMOST is set (for more details refer to the original hyperscan documentation)
    }

    // Save the database to the file system for later use
    try(OutputStream out = new FileOutputStream("db")) {
        db.save(out);
    }

    // Later, load the database back in. This is useful for large databases that take a long time to compile.
    // You can compile them offline, save them to a file, and then quickly load them in at runtime.
    // The load has to happen on the same type of platform as the save.
    try (InputStream in = new FileInputStream("db");
         Database loadedDb = Database.load(in)) {
        // Use the loadedDb as before.
    }
}
catch (CompileErrorException ce) {
    //gets thrown during  compile in case something with the expression is wrong
    //you can retrieve the expression causing the exception like this:
    Expression failedExpression = ce.getFailedExpression();
}
catch(IOException ie) {
  //IO during serializing / deserializing failed
}

Native libraries

This wrapper ships with pre-compiled hyperscan binaries for windows, linux (glibc >=2.12) and osx for x86_64 CPUs. You can find the repository with the native libraries here

Documentation

The hyperscan developer reference explains hyperscan. The javadoc is located here.

Changelog

See here.

Contributing

Feel free to raise issues or submit a pull request.

Credits

Shoutout to @eliaslevy, @krzysztofzienkiewicz and @swapnilnawale for all the great contributions.

Thanks to Intel for opensourcing hyperscan!

License

BSD 3-Clause License

hyperscan-java's People

Contributors

cerebuild-bot avatar dependabot-preview[bot] avatar eliaslevy avatar gliwka avatar krzysztofzienkiewicz avatar stevenjbaldwin avatar swapnilnawale-drsi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.