Comments (5)
Hi @drakon47 thanks for the great question, I have also had fairly dissimilar results with Isolation Forest in the past
Isolation Forest is an algorithm that works on the principle of randomness. An Isolation Tree is a completely randomized decision tree - refer to the split method on the Isolator node. The logic being that anomalies will get separated into their own Cell (leaf) node earlier in the grow process and thus receive a higher isolation score.
In addition, Isolation Forest, in much the same way as Random Forest seeks to minimize the variance of an ensemble of estimators by training each learner on a randomized subset of the training set (called a bootstrap set) and then averaging their predictions.
So it is perfectly normal to encounter results of varying degrees due to the stochastic nature of the algorithm. To remove some of the variance you can try adding additional estimators to the ensemble. Try 1000 or more. Having that said, it's also possible that I fucked something up somewhere in the algorithm.
How big is your dataset?
Are you able to cross validate with ground truth labels? If so, is the accuracy good or bad?
What kind of features are you using?
from ml.
Thank you Andrew for your response. I kept playing around with it and I have additional information that might be useful:
-I decided to try with a different dataset because I had a very homogeneous dataset and I suspected that could be he problem.
-I run the same experiments with the classic diabetes dataset but I get the same results. To test the accuracy of detecting anomalies I run the same isolation forest (1000 trees) multiple times on the same dataset and then I compared how many records where identified as anomalies on all the runs.
-Results are very varaible. For example, on the diabetes data set I run it 3 times. The set has 800 records. The algorithm isolates around 70 items each time (this does not vary). But those 70 items are most of the times different on each run. For example: after I run it 3 times the Isolation Forest separately on the same dataset, only one record shows as anomaly on the 3 runs results. Most of them show as anomaly only on one of the run.
-I understand there can be variability. But is it right to have that much? Everytime I run the algorithm shows a completely different dataset of anomalies.
-So, I decided to run the exact same experiment on Python, using isoltaion forest implemented on sklearn. Exact same data set, run the IF 3 times. Then I compared results. Most of the data points are repeated on the 3 runs. THey are consistently identified as anomalies on all runs. There are a few different, but very little (less than 5% of all detected anomalies).
I hope my thoughts help. Sorry for my english.
from ml.
Thanks for the helpful intelligence @drakon47
Both the Rubix ML and Scikit-learn implementations of Islolation Forest are based on the two original papers by Liu et al. The Rubix ML version, however, implements extended IF as described in the Garchery et al paper for the addition of categorical feature support. They SHOULD behave very similarly for continuous features, but your experiments are not confirming this.
I took another look at my work and indeed I made a mistake on the isolation score calculation. Specifically, the calculation is supposed to divide by the total expected depth of an average search for the whole ensemble. See this commit bc72001#diff-08b1873f5aa259f041b4b08edecaee46L239
Having that said, we may not be finished just yet - although I would like you to composer install the latest dev-master to see if the latest commit worked/helped
How many features were in your original dataset? Aer Diabetes has 20 is looks like ... Are you using just continuous features or categorical as well?
I'm wondering if we should implement max_features per learner like the ScikitLearn implementation. I understand that in practice Isolation Forest works best with a constrained set of features.
from ml.
Excelent job! It is deffinitely behaving very different now. I installed the new versionand played around with both datasets and results are consistent with scikitlearn comparison.
-The diabetes dataset (i used a 9 features version, smaller: https://idealsur.com/borrar/diabetes.csv) run with 500 trees gives back a very consistent set of data every time. I comapred 5 runs of it and 95% of the data shows up on all runs.
-The other dataset, behave similar. THe percentaje is higher.
I havent palyed aruond with max_features, but will be trying it and comaring it so I can give you more feedback.
I love this php implementation of all the ml algorithms. I see a great potencial in it. Good work!
from ml.
@drakon47 I'm glad that commit solved the issue, thanks again for the great bug report
Keep us up to date with your experience so far and don't hesitate to get involved if you find something else that isn't working right
I have an idea in my head as to how we'd implement max_features so let us know if that feature will be useful - I'd be happy to work with you if you felt that was something you'd like to contribute to the project
I'm also curious to know the speed difference between the Rubix ML and the Scikit-learn implementation
Thanks again for the great work, feedback, and for your help in bringing quality ML tools to the PHP language
from ml.
Related Issues (20)
- Prune redundant Decision Tree leaf nodes
- Fixed Array Memory Optimizations HOT 2
- Use new PHP 8.0 features in version 3.0
- Warning: Ambiguous class resolution, `Rubix\ML\Kernels\Distance\Gower` HOT 4
- Use pretrained models? HOT 3
- How to Train only One Class? HOT 4
- Not working in PHP 8.2 because of voku/portable-utf8/src/voku/helper/UTF8.php HOT 7
- Question, which model users for Fraud Prediction HOT 2
- Incorrect SVC save/load methods implementation HOT 2
- psr/log old version is limiting the project to be used on modern frameworks HOT 2
- Which alogorithm can be used for search result ranking ? HOT 2
- Is "Transformer Architecture Marchine Learning Model" supported on RubixML ??? HOT 4
- Map method in Dataset doesn't exist HOT 2
- Multi Language Tokenization Support HOT 2
- WordCountVectorizer Memory Issue HOT 2
- TruncatedSVD() made PHP crash without any message HOT 3
- Evaluation of the cluster quality with indicators HOT 1
- Requirements not resolved to an installable set of packages HOT 3
- Softmax Classifier & partial training HOT 1
- Does Rubix ML support Natural Language Processing (NLP)? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ml.