Giter Site home page Giter Site logo

aws / random-cut-forest-by-aws Goto Github PK

View Code? Open in Web Editor NEW
201.0 10.0 32.0 2.81 MB

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.

Home Page: https://github.com/aws/random-cut-forest-by-aws

License: Apache License 2.0

Java 81.52% Rust 18.48%
algorithms anomalydetection streaming

random-cut-forest-by-aws's People

Contributors

alolita avatar amazon-auto avatar amitgalitz avatar ankane avatar arunsathiya avatar cswiercz avatar dblock avatar dependabot[bot] avatar gardner avatar jdecuyper avatar jmazanec15 avatar jotok avatar kaituo avatar prateekiiest avatar sudiptoguha avatar wnbts avatar ylwu-amzn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

random-cut-forest-by-aws's Issues

Review IUpdatable interface

This is an open-ended task to review and possibly change the abstractions used in IUpdateCoordinator and IUpdatable. The UpdateReturn class in particular was added as part of a larger restructuring change and didn't get a lot of scrutiny when it was introduced.

Clean up abstract trees

  • AbstractRandomCutTreeFields should be protected instead of public
  • AbstractCompactRandomCutTree sets the enableCache variable, even though it's already set in the super constructor
  • The second constructor in AbstractCompactRandomCutTree (the one used to restore from state) should be able to restore sequence indexes.
  • Should enableCache be changed to enableBoundingBoxCache, or is there a plant o reuse that variable for other kinds of caching?

Signed zero causes NullPointerException

Running the unit test can reproduce the issue:

RandomCutForestFunctionalTest:

    // Use this ArgumentsProvider to run a test on both single-threaded and
    // multi-threaded forests
    static class TestForestProvider2 implements ArgumentsProvider {
        @Override
        public Stream<? extends Arguments> provideArguments(ExtensionContext context) throws Exception {
            RandomCutForest forest = RandomCutForest.builder().numberOfTrees(numberOfTrees).sampleSize(sampleSize)
                    .dimensions(1).randomSeed(randomSeed).centerOfMassEnabled(true)
                    .storeSequenceIndexesEnabled(true).parallelExecutionEnabled(false).build();
            return Stream.of(forest).map(Arguments::of);
        }
    }

    @ParameterizedTest
    @ArgumentsSource(TestForestProvider2.class)
    public void testSideEffectsC(RandomCutForest forest) {
        /* the changes to score and attribution should be in sync */
        forest.update(new double[] { 0.0 });
        forest.update(new double[] { -0.0 });
    }

NullPointerException will be thrown:

java.lang.NullPointerException: null
        at com.amazon.randomcutforest.tree.Cut.isLeftOf(Cut.java:52) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.Node.isLeftOf(Node.java:134) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:348) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:272) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.TreeUpdater.update(TreeUpdater.java:82) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.SequentialForestTraversalExecutor.lambda$update$0(SequentialForestTraversalExecutor.java:38) ~[random-cut-forest-1.0.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1507) ~[?:?]
        at com.amazon.randomcutforest.SequentialForestTraversalExecutor.update(SequentialForestTraversalExecutor.java:37) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.AbstractForestTraversalExecutor.update(AbstractForestTraversalExecutor.java:50) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.RandomCutForest.update(RandomCutForest.java:291) ~[random-cut-forest-1.0.jar:?]

RCF does not consider new double[] { 0.0 } and new double[] { -0.0 } to be equal since it uses Arrays.equals to compare them. But actually 0.0 == -0.0. These arrays should be equal. This causes RandomCutTree.addPoint to access Node.isLeftOf where cut is null.

Review naming conventions for tree update methods

  • increaseMassOfAncestorsAndItself is a little awkward
  • constructBoxInPlace is ambiguous, because construct makes it sound like a new allocation, but InPlace suggests that it's updating the existing box (which is true).
  • We have both recoompute and reCompute in different places.

Add support for single precision to state/mapper classes

Currently the point store state class has an array of doubles to store points. When single precision is enabled, this could add unnecessary bloat to serialization. We are considering two options for how to fix this:

  • (preferred) Change the array of doubles to an array of bytes and add a flag for precision to the state class.
  • Keep two arrays in the state class, one array of doubles and another array of floats, and use the applicable array during serialization.

Visitors should not depend on concrete classes

Visitors should only have access to interfaces (e.g., INodeView, IBoundingBoxView) and not concrete classes. Visitors should be completely data structure agnostic.

When closing this issue we should also remove the copyBoxToDouble() method from the IBoundingBox interface.

Test failures with OpenJDK 13

Unit tests are failing when being run under OpenJDK 13 with the following error message:

[ERROR] testImputeMissingValuesWithNoMissingValues  Time elapsed: 0.001 s  <<< ERROR!
java.lang.RuntimeException: Internal error: Failed to find the "modifiers" field in method setInternalState.
	at com.amazon.randomcutforest.RandomCutForestTest.setUp(RandomCutForestTest.java:82)
Caused by: java.lang.NoSuchFieldException: modifiers
	at com.amazon.randomcutforest.RandomCutForestTest.setUp(RandomCutForestTest.java:82)
$ java --version
openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)

Reduce or eliminate use of `null` when updating tree structure

In the original pointer based tree implementation, we used null values to indicate the absence of a parent or sibling node. In the new compact implementation, we use int and short indexes instead of pointers, and we define a constant NULL to indicate absence. Mixing null and NULL makes code hard to follow and maintain.

This task is to follow up on #103 to improve code maintainability by reducing or eliminating the use of null when updating tree states.

Move exception to main thread in ParallelForestTraversalExecutor

There are two places inside ParallelForestTraversalExecutor where a lambda being executed on a threadpool can throw an exception. We should move this exception outside the lambda to the main execution thread. This should give more predictable behavior and a better stack trace.

Fixing the caching logic

There is a bug in the bounding box caching logic that makes it so that you get different results in the anomaly score algorithm.

Add microbenchmarks to package

Add JMH microbenchmarks to the package. Maven should build an executable jar with the benchmark code. Having the benchmarking code will allow us to measure the performance impact of other changes.

Proposal: new directory structure to multiple language implementations

We are currently developing a second implementation of RCFs in C. We intend to support both the Java and C versions going forward, since the two implementations have different tradeoffs. This issue is to discuss how we should reorganize the project directory to accomodate both implementations.

Proposal 1: Root level language directories

c/
java/
python/

With this approach, the existing Maven project would be moved to a new java/ subdirectory, and we would add a top-level c/ directory to contain the C implementation. We could then continue this pattern for other language bindings. For example, the python/ directory could contain a Python project with bindings to the C library.

If we take this approach we'll have to create a new top-level README explaining the setup and possibly giving some information about RCF the algorithm. We'll also have to double-check that we can execute our github workflow from a subdirectory.

Proposal 2: Separate repositories

This proposal is to create a new repository for the C implementation. This approach would all both repos to conform to the standard project layout wherein the root directory for the project contains the build configuration files for your project.

By taking this approach we would lose some discoverability between the two implementations, and it would be less clear that the two implementations conform to a single "vision" for RCFs.

Also considered: Add the C library to a directory in the Maven project

benchmark/
core/
...
rcfc/

In this approach, the folder containing the C code would live in the Maven project, but it would be ignored by the Maven build. I think the main drawback here is that the C and Java versions would not appear as different-but-equally-good implementations. People who get to the repository via search might not even notice that there is a C implementation.

Does the RandomCutForest class calculate CollusiveDisplacement?

Hi all,

May I check whether the RandomCutForest class returns the CollusiveDisplacement score by default?

I have looked through the code and documentation and could not really figure out what get_anomaly_score is exactly returning. If it is returning Displacement, is it possible to get a code snippet on how to configure it to return CollusiveDisplacement instead?

Thank you!

[Help] Testing AnomalyScore Runner on input stream

Hi @sudiptoguha @jotok

I am currently trying to run the AnomalyScore runner using the CLI command. However, I am not able to figure out what is the exact format of the input stream/input file.

System.out.println("Reading from stdin... (Ctrl-c to exit)");

Also using through CLI as mentioned here.

https://github.com/aws/random-cut-forest-by-aws/tree/master/Java#build-command-line-cli-usage

An example would thus really help in this regard.

Thanks
Prateek

Support mapping to state class without copy

Currently, all mappers copy data when creating state classes. We should give the user the option to create a state class from a reference. This will make serialization more efficient in the case where the user is willing to take responsibility for integrity of the memory.

Standardize constructors for mapper classes

Some of the mapper classes are starting to take one or two constructors, for things like optionally saving tree state or validating the heap property in a compact sampler. Proposal: all mappers have a 0-argument constructor, and we set options using setters (see RandomCutForestMapper for an example). This approach is extensible if we decide to add new options later.

Precision setting should be an enum

We are allowing the user to decide to using single or double precision floating point numbers internally. Currently this is implemented as a boolean flag. I think this setting would make more sense if we defined a enum for precision:

public enum Precision {
  SINGLE, DOUBLE;
}
RandomCutForest forest = RandomCutForest.builder()
  .precision(Precision.SINGLE)
  .build();

I think this is more self-explanatory, and it leaves the possibility to support another level of precision (e.g. BigDecimal) if think it would be useful.

deprecate master branch

The name "master" is founded on inequality. Since it is used as the name of the most current branch, it is inevitable for contributors to use it for pulling/pushing. For current or potential contributors, this may be or may have been a source of discomfort/rejection, against Code of Conduct. Neutral names such as "main" have been proposed and used.

The RFC provides reasonable arguments against this language. Github , as well as many others , are taking the actions to end the use of the term.

Add method to update executor to perform more generic tree updates

The update executor class contains one method for updating models by submitting points to them. We recently added a setting to trees to control the fraction of bounding boxes that are stored in memory. This is a setting that we might want to change at runtime. We need a way to request a tree settings change like this. The tree executor is a good candidate for this functionality, but we'll have to figure out a way to deal with different tree implementations possibly having different settings (and hence different setter methods in their interfaces).

Review default forest parameters

Before pushing to Maven, let's review and update the default forest parameters. Current defaults:

Parameter Name Type Description Default Value
centerOfMassEnabled boolean If true, then tree nodes in the forest will compute their center of mass as part of tree update operations. false
dimensions int The number of dimensions in the input data. Required, no default value
lambda double The decay factor (lambda value) used by stream samplers in this forest. 1e-5
numberOfTrees int The number of trees in this forest. 100
outputAfter int The number of points required by stream samplers before results are returned. 0.25 * sampleSize
parallelExecutionEnabled boolean If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel. true
randomSeed long A seed value used to initialize the random number generators in this forest.
sampleSize int The sample size used by stream samplers in this forest 256
storeSequenceIndexesEnabled boolean If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values. false
threadPoolSize int The number of threads to use in the internal threadpool. Number of available processors - 1
windowSize int An alternate way of specifying the lambda value. Using this parameter will set lambda to 1 / windowSize.

To discuss:

  • Given what we know about convergence of anomaly scores, 100 trees is probably to many for a default. Proposal: 50.
  • Should we change the default for parallel execution to false? From our benchmark results for 50 trees, parallel execution only consistently wins for 256 dimensional points.
  • The lambda value in this code was taken from an older implementation and not re-examined. Can we pick a better default? What guidance can we give to users to help them pick a meaningful value?
  • I like the idea of windowSize as a more user-friendly way of setting lambda, but does it have a precise mathematical interpretation that we can explain? If not, could we come up with a better choice for a user-friendly setting? For example, something like: "if you set this parameter to 100, then the probability that more than half of your in-sample points comes from the last 100 submitted points is greater than 90%".

Add SerDe tests for noncompact forest

The existing SerDe tests fail for non-compact forests. We believe the reason is that the tests aren't correctly accounting for statistical variation in the deserialized forests. This task is to review the existing tests and either fix them to handle the noncompact case or to add new tests for that case.

Configure the Maven checkstyle plugin to use the suppressions file

mvn checkstyle:check currently reports over 2000 errors. We have a checkstyle-suppressions.xml configuration file, which we reference in the checkstyle configuration in pom.xml, but it looks like checkstyle is not picking it up. We were previously able to run checkstyle with this suppressions file under a different build system.

Ready for production

Hi,

We are building a multi-cloud anomaly detector application with two deployment options for our SaaS ops. We plan to use Random Cut Forest in Sagemaker when running in AWS while use this opensource when running on-premise or private cloud due to data sensitivity issues when we cannot send data to cloud through estimator APIs when invoking Sagemaker APIs.

Would you please let me know if random-cut-forest-by-aws (this repo) is ready for production and any differences from Sagemaker in terms of functionality such as explainability and real-time streaming use cases? I saw a note [in Development] in the title and hence wondering the status and when it will be production ready. Again great work on this repo.

Thanks.

Move cleanCopy method to forest

The AbstractForestUpdateExecutor class defines the cleanCopy method, which is responsible for mapping -0.0 to +0.0 before submitting a point to the tree (and it's possible that we could add other normalizations in the future). We should move this method into RandomCutForest, so that can call it immediately when a point is submitted to the model. In addition, we can reuse the method to normalize points in traversal methods.

TreeUpdaterBenchmark fails when sequence indexes are enabled

Running TreeUpdateBenchmark methods with sequence indexes enabled results in the exception java.lang.IllegalStateException: Error in sequence index. Inconsistency in trees in delete step.

To reproduce, checkout the benchmarking branch (or mainline, after #16 us merged) and run

% java -jar target/random-cut-forest-1.0-benchmarks.jar TreeUpdaterBenchmark\.update

This will run the update benchmark using different combinations of parameters. The benchmark fails with the above exception each time storeSequenceIndexesEnabled = true is used.

Remove the Sequential interface

The Sequential interface was introduced in an early 2.0 refactoring. At that point we thought that the input to IUpdatable could be completely generic. We're walking this back now, because the Sequential class was basically a wrapper around a pair of values, and it only existed between the update call and the tree add-point method. Since all update code paths are following the same pattern, we've decided to revert to passing sequence index as a method argument instead of passing it via a short-lived object.

Intermittent failure in RandomCutForestSerDeTests

I'm seeing an occasional failuree in the RandomCutForestSerDeTests class. The tests usually pass on re-run. I'm creating this issue to track occurrences.

Example

[ERROR] Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 22.914 s <<< FAILURE! - in com.amazon.randomcutforest.serialize.RandomCutForestSerDeTests
[ERROR] toJsonString{int, int, int, int, int, int, int}[9]  Time elapsed: 0.55 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <1.394635311942881> but was: <1.8229730367398675>
        at com.amazon.randomcutforest.serialize.RandomCutForestSerDeTests.toJsonString(RandomCutForestSerDeTests.java:75)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.