Giter Site home page Giter Site logo

Regarding isolation forest about ml HOT 5 CLOSED

rubixml avatar rubixml commented on May 12, 2024
Regarding isolation forest

from ml.

Comments (5)

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Hi @drakon47 thanks for the great question, I have also had fairly dissimilar results with Isolation Forest in the past

Isolation Forest is an algorithm that works on the principle of randomness. An Isolation Tree is a completely randomized decision tree - refer to the split method on the Isolator node. The logic being that anomalies will get separated into their own Cell (leaf) node earlier in the grow process and thus receive a higher isolation score.

In addition, Isolation Forest, in much the same way as Random Forest seeks to minimize the variance of an ensemble of estimators by training each learner on a randomized subset of the training set (called a bootstrap set) and then averaging their predictions.

So it is perfectly normal to encounter results of varying degrees due to the stochastic nature of the algorithm. To remove some of the variance you can try adding additional estimators to the ensemble. Try 1000 or more. Having that said, it's also possible that I fucked something up somewhere in the algorithm.

How big is your dataset?

Are you able to cross validate with ground truth labels? If so, is the accuracy good or bad?

What kind of features are you using?

from ml.

drakon47 avatar drakon47 commented on May 12, 2024

Thank you Andrew for your response. I kept playing around with it and I have additional information that might be useful:
-I decided to try with a different dataset because I had a very homogeneous dataset and I suspected that could be he problem.
-I run the same experiments with the classic diabetes dataset but I get the same results. To test the accuracy of detecting anomalies I run the same isolation forest (1000 trees) multiple times on the same dataset and then I compared how many records where identified as anomalies on all the runs.
-Results are very varaible. For example, on the diabetes data set I run it 3 times. The set has 800 records. The algorithm isolates around 70 items each time (this does not vary). But those 70 items are most of the times different on each run. For example: after I run it 3 times the Isolation Forest separately on the same dataset, only one record shows as anomaly on the 3 runs results. Most of them show as anomaly only on one of the run.
-I understand there can be variability. But is it right to have that much? Everytime I run the algorithm shows a completely different dataset of anomalies.
-So, I decided to run the exact same experiment on Python, using isoltaion forest implemented on sklearn. Exact same data set, run the IF 3 times. Then I compared results. Most of the data points are repeated on the 3 runs. THey are consistently identified as anomalies on all runs. There are a few different, but very little (less than 5% of all detected anomalies).
I hope my thoughts help. Sorry for my english.

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

Thanks for the helpful intelligence @drakon47

Both the Rubix ML and Scikit-learn implementations of Islolation Forest are based on the two original papers by Liu et al. The Rubix ML version, however, implements extended IF as described in the Garchery et al paper for the addition of categorical feature support. They SHOULD behave very similarly for continuous features, but your experiments are not confirming this.

I took another look at my work and indeed I made a mistake on the isolation score calculation. Specifically, the calculation is supposed to divide by the total expected depth of an average search for the whole ensemble. See this commit bc72001#diff-08b1873f5aa259f041b4b08edecaee46L239

Having that said, we may not be finished just yet - although I would like you to composer install the latest dev-master to see if the latest commit worked/helped

How many features were in your original dataset? Aer Diabetes has 20 is looks like ... Are you using just continuous features or categorical as well?

I'm wondering if we should implement max_features per learner like the ScikitLearn implementation. I understand that in practice Isolation Forest works best with a constrained set of features.

from ml.

drakon47 avatar drakon47 commented on May 12, 2024

Excelent job! It is deffinitely behaving very different now. I installed the new versionand played around with both datasets and results are consistent with scikitlearn comparison.
-The diabetes dataset (i used a 9 features version, smaller: https://idealsur.com/borrar/diabetes.csv) run with 500 trees gives back a very consistent set of data every time. I comapred 5 runs of it and 95% of the data shows up on all runs.
-The other dataset, behave similar. THe percentaje is higher.
I havent palyed aruond with max_features, but will be trying it and comaring it so I can give you more feedback.
I love this php implementation of all the ml algorithms. I see a great potencial in it. Good work!

from ml.

andrewdalpino avatar andrewdalpino commented on May 12, 2024

@drakon47 I'm glad that commit solved the issue, thanks again for the great bug report

Keep us up to date with your experience so far and don't hesitate to get involved if you find something else that isn't working right

I have an idea in my head as to how we'd implement max_features so let us know if that feature will be useful - I'd be happy to work with you if you felt that was something you'd like to contribute to the project

I'm also curious to know the speed difference between the Rubix ML and the Scikit-learn implementation

Thanks again for the great work, feedback, and for your help in bringing quality ML tools to the PHP language

from ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.