aws / random-cut-forest-by-aws Goto Github PK

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.

Home Page: https://github.com/aws/random-cut-forest-by-aws

License: Apache License 2.0

Java 81.52% Rust 18.48%

algorithms anomalydetection streaming

random-cut-forest-by-aws's People

Contributors

Stargazers

Watchers

random-cut-forest-by-aws's Issues

Review IUpdatable interface

This is an open-ended task to review and possibly change the abstractions used in IUpdateCoordinator and IUpdatable. The UpdateReturn class in particular was added as part of a larger restructuring change and didn't get a lot of scrutiny when it was introduced.

Review and update Javadocs

Review the javadocs in the package to fix inaccuracies and improve quality.

Remove the getNodeView method from INode

It shouldn't be a node's responsibility to return a node view for an arbitrary node in the tree.

Clean up abstract trees

AbstractRandomCutTreeFields should be protected instead of public
AbstractCompactRandomCutTree sets the enableCache variable, even though it's already set in the super constructor
The second constructor in AbstractCompactRandomCutTree (the one used to restore from state) should be able to restore sequence indexes.
Should enableCache be changed to enableBoundingBoxCache, or is there a plant o reuse that variable for other kinds of caching?

Signed zero causes NullPointerException

Running the unit test can reproduce the issue:

RandomCutForestFunctionalTest:

    // Use this ArgumentsProvider to run a test on both single-threaded and
    // multi-threaded forests
    static class TestForestProvider2 implements ArgumentsProvider {
        @Override
        public Stream<? extends Arguments> provideArguments(ExtensionContext context) throws Exception {
            RandomCutForest forest = RandomCutForest.builder().numberOfTrees(numberOfTrees).sampleSize(sampleSize)
                    .dimensions(1).randomSeed(randomSeed).centerOfMassEnabled(true)
                    .storeSequenceIndexesEnabled(true).parallelExecutionEnabled(false).build();
            return Stream.of(forest).map(Arguments::of);
        }
    }

    @ParameterizedTest
    @ArgumentsSource(TestForestProvider2.class)
    public void testSideEffectsC(RandomCutForest forest) {
        /* the changes to score and attribution should be in sync */
        forest.update(new double[] { 0.0 });
        forest.update(new double[] { -0.0 });
    }

NullPointerException will be thrown:

java.lang.NullPointerException: null
        at com.amazon.randomcutforest.tree.Cut.isLeftOf(Cut.java:52) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.Node.isLeftOf(Node.java:134) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:348) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:349) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:272) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.TreeUpdater.update(TreeUpdater.java:82) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.SequentialForestTraversalExecutor.lambda$update$0(SequentialForestTraversalExecutor.java:38) ~[random-cut-forest-1.0.jar:?]
        at java.util.ArrayList.forEach(ArrayList.java:1507) ~[?:?]
        at com.amazon.randomcutforest.SequentialForestTraversalExecutor.update(SequentialForestTraversalExecutor.java:37) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.AbstractForestTraversalExecutor.update(AbstractForestTraversalExecutor.java:50) ~[random-cut-forest-1.0.jar:?]
        at com.amazon.randomcutforest.RandomCutForest.update(RandomCutForest.java:291) ~[random-cut-forest-1.0.jar:?]

RCF does not consider new double[] { 0.0 } and new double[] { -0.0 } to be equal since it uses Arrays.equals to compare them. But actually 0.0 == -0.0. These arrays should be equal. This causes RandomCutTree.addPoint to access Node.isLeftOf where cut is null.

Remove old RandomCutTree from package

Since #103 created a new PointerTree class that extends from AbstractRandomCutTree. This class should replace the old RandomCutTree class.

Replace System.arraycopy() with array.clone()

Calling the clone() method on an array is more efficient than allocating a new array and calling System.arraycopy() to fill it.

https://stackoverflow.com/questions/46145826/why-clone-is-the-best-way-for-copying-arrays

getStore() method in PointStoreCoordinator should return IPointStore

Currently this method is hard-coded to return a PointStoreDouble.

Fix javadoc warnings

Run mvn javadoc:javadoc and fix any warnings that are printed.

Consider use of Optional in INode, INodeView interfaces

The leaf point, children, and parent are all attributes that may be missing for a particular node. We should consider using Optional instead of null to represent absent values.

Test coverage: tree package

Review naming conventions for tree update methods

increaseMassOfAncestorsAndItself is a little awkward
constructBoxInPlace is ambiguous, because construct makes it sound like a new allocation, but InPlace suggests that it's updating the existing box (which is true).
We have both recoompute and reCompute in different places.

Add support for single precision to state/mapper classes

Currently the point store state class has an array of doubles to store points. When single precision is enabled, this could add unnecessary bloat to serialization. We are considering two options for how to fix this:

(preferred) Change the array of doubles to an array of bytes and add a flag for precision to the state class.
Keep two arrays in the state class, one array of doubles and another array of floats, and use the applicable array during serialization.

Test coverage: executor package

Visitors should not depend on concrete classes

Visitors should only have access to interfaces (e.g., INodeView, IBoundingBoxView) and not concrete classes. Visitors should be completely data structure agnostic.

When closing this issue we should also remove the copyBoxToDouble() method from the IBoundingBox interface.

Test coverage: store package

Test failures with OpenJDK 13

Unit tests are failing when being run under OpenJDK 13 with the following error message:

[ERROR] testImputeMissingValuesWithNoMissingValues  Time elapsed: 0.001 s  <<< ERROR!
java.lang.RuntimeException: Internal error: Failed to find the "modifiers" field in method setInternalState.
	at com.amazon.randomcutforest.RandomCutForestTest.setUp(RandomCutForestTest.java:82)
Caused by: java.lang.NoSuchFieldException: modifiers
	at com.amazon.randomcutforest.RandomCutForestTest.setUp(RandomCutForestTest.java:82)

$ java --version
openjdk 13.0.2 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)

Reduce or eliminate use of `null` when updating tree structure

In the original pointer based tree implementation, we used null values to indicate the absence of a parent or sibling node. In the new compact implementation, we use int and short indexes instead of pointers, and we define a constant NULL to indicate absence. Mixing null and NULL makes code hard to follow and maintain.

This task is to follow up on #103 to improve code maintainability by reducing or eliminating the use of null when updating tree states.

Move exception to main thread in ParallelForestTraversalExecutor

There are two places inside ParallelForestTraversalExecutor where a lambda being executed on a threadpool can throw an exception. We should move this exception outside the lambda to the main execution thread. This should give more predictable behavior and a better stack trace.

[Blog] post contains two identical paragraphs

Hi @sudiptoguha
I din't manage to find you at linkedin, so I've decided to leverage this repo to report about the finding. This post contains two identical paragraphs: https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2019/11/random-cut-forests/

Thanks for your work!

Add support for sequence indexes for compact trees

Add support for storing sequence indexes in leaf nodes to compact trees.

Fixing the caching logic

There is a bug in the bounding box caching logic that makes it so that you get different results in the anomaly score algorithm.

Add microbenchmarks to package

Add JMH microbenchmarks to the package. Maven should build an executable jar with the benchmark code. Having the benchmarking code will allow us to measure the performance impact of other changes.

Create a new abstract base class for compact and non-compact trees

There is some code duplication between the two tree implementations that we could move to a base class.

Proposal: new directory structure to multiple language implementations

We are currently developing a second implementation of RCFs in C. We intend to support both the Java and C versions going forward, since the two implementations have different tradeoffs. This issue is to discuss how we should reorganize the project directory to accomodate both implementations.

Proposal 1: Root level language directories

c/
java/
python/

With this approach, the existing Maven project would be moved to a new java/ subdirectory, and we would add a top-level c/ directory to contain the C implementation. We could then continue this pattern for other language bindings. For example, the python/ directory could contain a Python project with bindings to the C library.

If we take this approach we'll have to create a new top-level README explaining the setup and possibly giving some information about RCF the algorithm. We'll also have to double-check that we can execute our github workflow from a subdirectory.

Proposal 2: Separate repositories

This proposal is to create a new repository for the C implementation. This approach would all both repos to conform to the standard project layout wherein the root directory for the project contains the build configuration files for your project.

By taking this approach we would lose some discoverability between the two implementations, and it would be less clear that the two implementations conform to a single "vision" for RCFs.

Also considered: Add the C library to a directory in the Maven project

benchmark/
core/
...
rcfc/

In this approach, the folder containing the C code would live in the Maven project, but it would be ignored by the Maven build. I think the main drawback here is that the C and Java versions would not appear as different-but-equally-good implementations. People who get to the repository via search might not even notice that there is a C implementation.

Does the RandomCutForest class calculate CollusiveDisplacement?

Hi all,

May I check whether the RandomCutForest class returns the CollusiveDisplacement score by default?

I have looked through the code and documentation and could not really figure out what get_anomaly_score is exactly returning. If it is returning Displacement, is it possible to get a code snippet on how to configure it to return CollusiveDisplacement instead?

Thank you!

Add support for center of mass to compact trees

[Help] Testing AnomalyScore Runner on input stream

Hi @sudiptoguha @jotok

I am currently trying to run the AnomalyScore runner using the CLI command. However, I am not able to figure out what is the exact format of the input stream/input file.

random-cut-forest-by-aws/Java/core/src/main/java/com/amazon/randomcutforest/runner/AnomalyScoreRunner.java

Line 45 in 8ea743b

System.out.println("Reading from stdin... (Ctrl-c to exit)");

Also using through CLI as mentioned here.

https://github.com/aws/random-cut-forest-by-aws/tree/master/Java#build-command-line-cli-usage

An example would thus really help in this regard.

Thanks
Prateek

Support mapping to state class without copy

Currently, all mappers copy data when creating state classes. We should give the user the option to create a state class from a reference. This will make serialization more efficient in the case where the user is willing to take responsibility for integrity of the memory.

Standardize constructors for mapper classes

Some of the mapper classes are starting to take one or two constructors, for things like optionally saving tree state or validating the heap property in a compact sampler. Proposal: all mappers have a 0-argument constructor, and we set options using setters (see RandomCutForestMapper for an example). This approach is extensible if we decide to add new options later.

Precision setting should be an enum

We are allowing the user to decide to using single or double precision floating point numbers internally. Currently this is implemented as a boolean flag. I think this setting would make more sense if we defined a enum for precision:

public enum Precision {
  SINGLE, DOUBLE;
}

RandomCutForest forest = RandomCutForest.builder()
  .precision(Precision.SINGLE)
  .build();

I think this is more self-explanatory, and it leaves the possibility to support another level of precision (e.g. BigDecimal) if think it would be useful.

deprecate master branch

The name "master" is founded on inequality. Since it is used as the name of the most current branch, it is inevitable for contributors to use it for pulling/pushing. For current or potential contributors, this may be or may have been a source of discomfort/rejection, against Code of Conduct. Neutral names such as "main" have been proposed and used.

The RFC provides reasonable arguments against this language. Github , as well as many others , are taking the actions to end the use of the term.

Add method to update executor to perform more generic tree updates

The update executor class contains one method for updating models by submitting points to them. We recently added a setting to trees to control the fraction of bounding boxes that are stored in memory. This is a setting that we might want to change at runtime. We need a way to request a tree settings change like this. The tree executor is a good candidate for this functionality, but we'll have to figure out a way to deal with different tree implementations possibly having different settings (and hence different setter methods in their interfaces).

Add test coverage for imputation and extrapolate methods

As noted in #9 , we are missing test coverage (both unit test coverage and functional test coverage) for the Random Cut Forest imputation and extrapolation methods.

Test coverage: state package

Review default forest parameters

Before pushing to Maven, let's review and update the default forest parameters. Current defaults:

Parameter Name	Type	Description	Default Value
centerOfMassEnabled	boolean	If true, then tree nodes in the forest will compute their center of mass as part of tree update operations.	false
dimensions	int	The number of dimensions in the input data.	Required, no default value
lambda	double	The decay factor (lambda value) used by stream samplers in this forest.	1e-5
numberOfTrees	int	The number of trees in this forest.	100
outputAfter	int	The number of points required by stream samplers before results are returned.	0.25 * sampleSize
parallelExecutionEnabled	boolean	If true, then the forest will create an internal threadpool. Forest updates and traversals will be submitted to this threadpool, and individual trees will be updated or traversed in parallel.	true
randomSeed	long	A seed value used to initialize the random number generators in this forest.
sampleSize	int	The sample size used by stream samplers in this forest	256
storeSequenceIndexesEnabled	boolean	If true, then sequence indexes (ordinals indicating when a point was added to a tree) will be stored in the forest along with poitn values.	false
threadPoolSize	int	The number of threads to use in the internal threadpool.	Number of available processors - 1
windowSize	int	An alternate way of specifying the lambda value. Using this parameter will set lambda to 1 / windowSize.

To discuss:

Given what we know about convergence of anomaly scores, 100 trees is probably to many for a default. Proposal: 50.
Should we change the default for parallel execution to false? From our benchmark results for 50 trees, parallel execution only consistently wins for 256 dimensional points.
The lambda value in this code was taken from an older implementation and not re-examined. Can we pick a better default? What guidance can we give to users to help them pick a meaningful value?
I like the idea of windowSize as a more user-friendly way of setting lambda, but does it have a precise mathematical interpretation that we can explain? If not, could we come up with a better choice for a user-friendly setting? For example, something like: "if you set this parameter to 100, then the probability that more than half of your in-sample points comes from the last 100 submitted points is greater than 90%".

Add SerDe tests for noncompact forest

The existing SerDe tests fail for non-compact forests. We believe the reason is that the tests aren't correctly accounting for statistical variation in the deserialized forests. This task is to review the existing tests and either fix them to handle the noncompact case or to add new tests for that case.

Review FindBugs report

Review and update README

Configure the Maven checkstyle plugin to use the suppressions file

mvn checkstyle:check currently reports over 2000 errors. We have a checkstyle-suppressions.xml configuration file, which we reference in the checkstyle configuration in pom.xml, but it looks like checkstyle is not picking it up. We were previously able to run checkstyle with this suppressions file under a different build system.

Ready for production

Hi,

We are building a multi-cloud anomaly detector application with two deployment options for our SaaS ops. We plan to use Random Cut Forest in Sagemaker when running in AWS while use this opensource when running on-premise or private cloud due to data sensitivity issues when we cannot send data to cloud through estimator APIs when invoking Sagemaker APIs.

Would you please let me know if random-cut-forest-by-aws (this repo) is ready for production and any differences from Sagemaker in terms of functionality such as explainability and real-time streaming use cases? I saw a note [in Development] in the title and hence wondering the status and when it will be production ready. Again great work on this repo.

Thanks.

Move validation of builder parameters to constructor

Currently some validation is happening in the initCompactFloat() method. We should put all validation in one place and fail fast if a user has selected an incompatible set of options.

Provide an example file for CLI usage

The existing README text is vague about the data format expected by the CLI runners. We should make this clear and provide an example.

Move cleanCopy method to forest

The AbstractForestUpdateExecutor class defines the cleanCopy method, which is responsible for mapping -0.0 to +0.0 before submitting a point to the tree (and it's possible that we could add other normalizations in the future). We should move this method into RandomCutForest, so that can call it immediately when a point is submitted to the model. In addition, we can reuse the method to normalize points in traversal methods.

TreeUpdaterBenchmark fails when sequence indexes are enabled

Running TreeUpdateBenchmark methods with sequence indexes enabled results in the exception java.lang.IllegalStateException: Error in sequence index. Inconsistency in trees in delete step.

To reproduce, checkout the benchmarking branch (or mainline, after #16 us merged) and run

% java -jar target/random-cut-forest-1.0-benchmarks.jar TreeUpdaterBenchmark\.update

This will run the update benchmark using different combinations of parameters. The benchmark fails with the above exception each time storeSequenceIndexesEnabled = true is used.

Remove the Sequential interface

The Sequential interface was introduced in an early 2.0 refactoring. At that point we thought that the input to IUpdatable could be completely generic. We're walking this back now, because the Sequential class was basically a wrapper around a pair of values, and it only existed between the update call and the tree add-point method. Since all update code paths are following the same pattern, we've decided to revert to passing sequence index as a method argument instead of passing it via a short-lived object.

Add a builder to compact tree classes

Intermittent failure in RandomCutForestSerDeTests

I'm seeing an occasional failuree in the RandomCutForestSerDeTests class. The tests usually pass on re-run. I'm creating this issue to track occurrences.

Example

[ERROR] Tests run: 16, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 22.914 s <<< FAILURE! - in com.amazon.randomcutforest.serialize.RandomCutForestSerDeTests
[ERROR] toJsonString{int, int, int, int, int, int, int}[9]  Time elapsed: 0.55 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: expected: <1.394635311942881> but was: <1.8229730367398675>
        at com.amazon.randomcutforest.serialize.RandomCutForestSerDeTests.toJsonString(RandomCutForestSerDeTests.java:75)

Move lambda and sample size fields into compact sampler state

Currently we use the lambda and sampler size are stored in the forest state. This could make the code awkward in the future if we want to create forests with heterogeneous samplers.

Remove generic type parameter from INodeView

Remove generic type parameter from INodeView.

aws / random-cut-forest-by-aws Goto Github PK

random-cut-forest-by-aws's People

Contributors

Stargazers

Watchers

Forkers

random-cut-forest-by-aws's Issues

Recommend Projects

Recommend Topics

Recommend Org