Comments (9)
It is normal for this to happen when the cardinality of the data in a particular dimension is too low. I think a good solution is to erase the dimensions with low cardinality from the data in advance and run XBOS.
from xbos-anomaly-detection.
I just checked low cardinality definition in the database which in short means that the column contains a lot of repeats in its data range but imagine I have the excerpt of following data with (1516385, 5)
:
+---+-------------+------+------------+-------------+-----------------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
| 0| Sentence| 4014| 198| false| 136|
| 1| contextid| 90| 2| false| 15|
| 2| Sentence| 172| 11| false| 118|
| 3| String| 12| 0| true| 11|
| 4|version-style| 16| 0| false| 13|
| 5| Sentence| 339| 42| false| 110|
| 6|version-style| 16| 0| false| 13|
| 7| url_variable| 10| 2| false| 9|
| 8| url_variable| 10| 2| false| 9|
| 9| Sentence| 172| 11| false| 117|
| 10| contextid| 90| 2| false| 15|
| 11| Sentence| 170| 11| false| 114|
| 12|version-style| 16| 0| false| 13|
| 13| Sentence| 68| 10| false| 59|
| 14| String| 12| 0| true| 11|
| 15| Sentence| 173| 11| false| 118|
| 16| String| 12| 0| true| 11|
| 17| Sentence| 132| 8| false| 96|
| 18| String| 12| 0| true| 11|
| 19| contextid| 88| 2| false| 15|
+---+-------------+------+------------+-------------+-----------------+
Does this data count Low cardinal and I can't use XBOS?
from xbos-anomaly-detection.
Cardinality is the unique number of datas.
Cardinality examples.
[ 0, 0, 0, 0, 0 ] -> cardinality is 1.
[ 0, 0, 0, 1, 1 ] -> cardinality is 2.
[ 0, 0, 1, 1, 2 ] -> cardinality is 3.
[ 0, 0, 1, 2, 3 ] -> cardinality is 4.
And, because XBOS calculates( or uses ) distances, You can't use text data.
In that case, you need to drop 'Type' and 'Encoding_type'.
from xbos-anomaly-detection.
https://tylerburleigh.com/blog/working-with-categorical-features-in-ml/
"In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. "
from xbos-anomaly-detection.
Cardinality is the unique number of datas. Cardinality examples.
[ 0, 0, 0, 0, 0 ] -> cardinality is 1. [ 0, 0, 0, 1, 1 ] -> cardinality is 2. [ 0, 0, 1, 1, 2 ] -> cardinality is 3. [ 0, 0, 1, 2, 3 ] -> cardinality is 4.
And, because XBOS calculates( or uses ) distances, You can't use text data. In that case, you need to drop 'Type' and 'Encoding_type'.
For sure, before using ML-based models, I converted categorical features/dimensions/columns into numerical ones and dropped categorical columns using Label-encoding due to large data ( plz see last two columns after encoding):
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
| id| Type|Length|Token_number|Encoding_type|Character_feature| Freq|Type_Encoded|Encoding_Type_Encoded|
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
| 0| sap-contextid| 90| 2| false| 15| 1.0| 2.0| 0.0|
| 1| Sentence| 169| 11| true| 115| 0.0434355930699323| 0.0| 1.0|
| 2| url_variable| 12| 2| false| 11| 0.3768681063417741| 1.0| 0.0|
| 3|version-setting| 11| 2| true| 10| 0.08895918484530539| 6.0| 1.0|
| 4| Sentence| 722| 5| false| 117|0.004624917132551378| 0.0| 0.0|
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
Here I didn't drop those categorical columns just to showcase encoding. Considering this issue, XBOS
didn't work with n_cluster
for more than 2, which raises this question why? How can I force it to increase it?
Another issue is the mathematical concept behind the implementation. I mean, I'm not interested in deep math, but It would be great if you help me to understand which math formula has been implemented in the 2nd & 3rd functions in XBOS.py
Regarding this I tried to reach you via your email. maybe you haven't checked your mail or due to I attached the picture of the assumed formula, my email went to spam.
from xbos-anomaly-detection.
How can I force it to increase it?
You need to drop low cardinality( lower than cluster size ) columns before use XBOS.
For example, you can not apply K-means clustrering with k=3 on this data:
[ 0,0,0,0,1,1,1,1 ]
because cardinality is 2 and 2 is lower than 3.
We can not get 3 clusters from this data.
the mathematical concept behind the implementation
The math of XBOS is pretty easy.
This blog might help. ( Sorry it is written in Japanese )
https://www.scutum.jp/information/waf_tech_blog/2018/03/waf-blog-054.html
from xbos-anomaly-detection.
How can I force it to increase it?
You need to drop low cardinality( lower than cluster size ) columns before use XBOS. For example, you can not apply K-means clustrering with k=3 on this data: [ 0,0,0,0,1,1,1,1 ] because cardinality is 2 and 2 is lower than 3. We can not get 3 clusters from this data.
I just see your point. You mean that if in my dataframe I have one column/dimension/feature which has binary style e. g. like Encoding_Type_Encoded
in the above frame which was encoded out to its true
/false
info ruin this and therefore should be dropped. I mean if it is the case algorithm can be developed and check the cardinality of dimensions and drop those ones using automatically:
print("No of unique values in each column :\n", df.nunique(axis=0, dropna=False))
in my case for the above frame is:
No of unique values in each column:
Type 7
Length 12
Token_number 7
Character_feature 12
Encoding_type 2
Freq 3
here I should drop Encoding_Type
even no needs to encode it in form of Encoding_Type_Encoded
. Then n_clusters=<7
should be executed. now it makes sense why I was facing KeyError: 7
due to the cardinality of my 1st column when I was setting n_clusters=8
!! Do you confirm that?
The math of XBOS is pretty easy. This blog might help. ( Sorry it is written in Japanese )
Thanks, I do my best! :D Have a nice Sunday. Sunday is Funday! :)
from xbos-anomaly-detection.
Do you confirm that?
Yes.
But I'm sorry, I will not implement that ( automatically remove features ) because of ...
- This XBOS python code is just for proof of concept. It should be tiny.
- Anyone can extend it because it is open source.
- Dropping features automatically is not a good idea. Someone would not notice that.
from xbos-anomaly-detection.
By the way, if you have many low cardinality features, I think that Isolation Forest is a good solution for anomaly/outlier detection.
from xbos-anomaly-detection.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from xbos-anomaly-detection.