During my experiments I noticed that there is a limit for number of clusters

I just checked <a href="https://orangematter.solarwinds.com/2021/10/01/what-is-cardina

How can I force it to increase it? <p dir=

Do you confirm that? Yes. <p dir="aut

Problem with number of clusters to execute the XBOS about xbos-anomaly-detection HOT 9 OPEN

clevilll commented on July 17, 2024

Problem with number of clusters to execute the XBOS

from xbos-anomaly-detection.

Comments (9)

Kanatoko commented on July 17, 2024

It is normal for this to happen when the cardinality of the data in a particular dimension is too low. I think a good solution is to erase the dimensions with low cardinality from the data in advance and run XBOS.

from xbos-anomaly-detection.

clevilll commented on July 17, 2024

I just checked low cardinality definition in the database which in short means that the column contains a lot of repeats in its data range but imagine I have the excerpt of following data with (1516385, 5):

+---+-------------+------+------------+-------------+-----------------+
| id|         Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
|  0|     Sentence|  4014|         198|        false|              136| 
|  1|    contextid|    90|           2|        false|               15|
|  2|     Sentence|   172|          11|        false|              118| 
|  3|       String|    12|           0|         true|               11| 
|  4|version-style|    16|           0|        false|               13|   
|  5|     Sentence|   339|          42|        false|              110| 
|  6|version-style|    16|           0|        false|               13|  
|  7| url_variable|    10|           2|        false|                9| 
|  8| url_variable|    10|           2|        false|                9|
|  9|     Sentence|   172|          11|        false|              117| 
| 10|    contextid|    90|           2|        false|               15| 
| 11|     Sentence|   170|          11|        false|              114|
| 12|version-style|    16|           0|        false|               13|
| 13|     Sentence|    68|          10|        false|               59|
| 14|       String|    12|           0|         true|               11|
| 15|     Sentence|   173|          11|        false|              118|
| 16|       String|    12|           0|         true|               11|
| 17|     Sentence|   132|           8|        false|               96|
| 18|       String|    12|           0|         true|               11|
| 19|    contextid|    88|           2|        false|               15|
+---+-------------+------+------------+-------------+-----------------+

Does this data count Low cardinal and I can't use XBOS?

from xbos-anomaly-detection.

Kanatoko commented on July 17, 2024

Cardinality is the unique number of datas.
Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1.
[ 0, 0, 0, 1, 1 ] -> cardinality is 2.
[ 0, 0, 1, 1, 2 ] -> cardinality is 3.
[ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data.
In that case, you need to drop 'Type' and 'Encoding_type'.

from xbos-anomaly-detection.

Kanatoko commented on July 17, 2024

https://tylerburleigh.com/blog/working-with-categorical-features-in-ml/

"In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. "

from xbos-anomaly-detection.

clevilll commented on July 17, 2024

Cardinality is the unique number of datas. Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1. [ 0, 0, 0, 1, 1 ] -> cardinality is 2. [ 0, 0, 1, 1, 2 ] -> cardinality is 3. [ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data. In that case, you need to drop 'Type' and 'Encoding_type'.

For sure, before using ML-based models, I converted categorical features/dimensions/columns into numerical ones and dropped categorical columns using Label-encoding due to large data ( plz see last two columns after encoding):

+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
| id|           Type|Length|Token_number|Encoding_type|Character_feature|                Freq|Type_Encoded|Encoding_Type_Encoded|         
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
|  0|  sap-contextid|    90|           2|        false|               15|                 1.0|         2.0|                  0.0|
|  1|       Sentence|   169|          11|         true|              115|  0.0434355930699323|         0.0|                  1.0|
|  2|   url_variable|    12|           2|        false|               11|  0.3768681063417741|         1.0|                  0.0|
|  3|version-setting|    11|           2|         true|               10| 0.08895918484530539|         6.0|                  1.0|
|  4|       Sentence|   722|           5|        false|              117|0.004624917132551378|         0.0|                  0.0|
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+

Here I didn't drop those categorical columns just to showcase encoding. Considering this issue, XBOS didn't work with n_cluster for more than 2, which raises this question why? How can I force it to increase it?
Another issue is the mathematical concept behind the implementation. I mean, I'm not interested in deep math, but It would be great if you help me to understand which math formula has been implemented in the 2nd & 3rd functions in XBOS.py Regarding this I tried to reach you via your email. maybe you haven't checked your mail or due to I attached the picture of the assumed formula, my email went to spam.

from xbos-anomaly-detection.

Kanatoko commented on July 17, 2024

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS.
For example, you can not apply K-means clustrering with k=3 on this data:
[ 0,0,0,0,1,1,1,1 ]
because cardinality is 2 and 2 is lower than 3.
We can not get 3 clusters from this data.

the mathematical concept behind the implementation

The math of XBOS is pretty easy.
This blog might help. ( Sorry it is written in Japanese )
https://www.scutum.jp/information/waf_tech_blog/2018/03/waf-blog-054.html

from xbos-anomaly-detection.

clevilll commented on July 17, 2024

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS. For example, you can not apply K-means clustrering with k=3 on this data: [ 0,0,0,0,1,1,1,1 ] because cardinality is 2 and 2 is lower than 3. We can not get 3 clusters from this data.

I just see your point. You mean that if in my dataframe I have one column/dimension/feature which has binary style e. g. like Encoding_Type_Encoded in the above frame which was encoded out to its true/false info ruin this and therefore should be dropped. I mean if it is the case algorithm can be developed and check the cardinality of dimensions and drop those ones using automatically:

print("No of unique values in each column :\n", df.nunique(axis=0, dropna=False))

in my case for the above frame is:

No of unique values in each column:
 Type                  7
Length               12
Token_number          7
Character_feature    12
Encoding_type         2
Freq                  3

here I should drop Encoding_Type even no needs to encode it in form of Encoding_Type_Encoded . Then n_clusters=<7 should be executed. now it makes sense why I was facing KeyError: 7 due to the cardinality of my 1st column when I was setting n_clusters=8!! Do you confirm that?

The math of XBOS is pretty easy. This blog might help. ( Sorry it is written in Japanese )

Thanks, I do my best! :D Have a nice Sunday. Sunday is Funday! :)

from xbos-anomaly-detection.

Kanatoko commented on July 17, 2024

Do you confirm that?

Yes.

But I'm sorry, I will not implement that ( automatically remove features ) because of ...

This XBOS python code is just for proof of concept. It should be tiny.
Anyone can extend it because it is open source.
Dropping features automatically is not a good idea. Someone would not notice that.

from xbos-anomaly-detection.

Kanatoko commented on July 17, 2024

By the way, if you have many low cardinality features, I think that Isolation Forest is a good solution for anomaly/outlier detection.

from xbos-anomaly-detection.

Problem with number of clusters to execute the XBOS about xbos-anomaly-detection HOT 9 OPEN

Comments (9)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent