Giter Site home page Giter Site logo

Comments (9)

Kanatoko avatar Kanatoko commented on July 17, 2024

It is normal for this to happen when the cardinality of the data in a particular dimension is too low. I think a good solution is to erase the dimensions with low cardinality from the data in advance and run XBOS.

from xbos-anomaly-detection.

clevilll avatar clevilll commented on July 17, 2024

I just checked low cardinality definition in the database which in short means that the column contains a lot of repeats in its data range but imagine I have the excerpt of following data with (1516385, 5):

+---+-------------+------+------------+-------------+-----------------+
| id|         Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
|  0|     Sentence|  4014|         198|        false|              136| 
|  1|    contextid|    90|           2|        false|               15|
|  2|     Sentence|   172|          11|        false|              118| 
|  3|       String|    12|           0|         true|               11| 
|  4|version-style|    16|           0|        false|               13|   
|  5|     Sentence|   339|          42|        false|              110| 
|  6|version-style|    16|           0|        false|               13|  
|  7| url_variable|    10|           2|        false|                9| 
|  8| url_variable|    10|           2|        false|                9|
|  9|     Sentence|   172|          11|        false|              117| 
| 10|    contextid|    90|           2|        false|               15| 
| 11|     Sentence|   170|          11|        false|              114|
| 12|version-style|    16|           0|        false|               13|
| 13|     Sentence|    68|          10|        false|               59|
| 14|       String|    12|           0|         true|               11|
| 15|     Sentence|   173|          11|        false|              118|
| 16|       String|    12|           0|         true|               11|
| 17|     Sentence|   132|           8|        false|               96|
| 18|       String|    12|           0|         true|               11|
| 19|    contextid|    88|           2|        false|               15|
+---+-------------+------+------------+-------------+-----------------+

Does this data count Low cardinal and I can't use XBOS?

from xbos-anomaly-detection.

Kanatoko avatar Kanatoko commented on July 17, 2024

Cardinality is the unique number of datas.
Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1.
[ 0, 0, 0, 1, 1 ] -> cardinality is 2.
[ 0, 0, 1, 1, 2 ] -> cardinality is 3.
[ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data.
In that case, you need to drop 'Type' and 'Encoding_type'.

from xbos-anomaly-detection.

Kanatoko avatar Kanatoko commented on July 17, 2024

https://tylerburleigh.com/blog/working-with-categorical-features-in-ml/

"In the context of machine learning, “cardinality” refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. "

from xbos-anomaly-detection.

clevilll avatar clevilll commented on July 17, 2024

Cardinality is the unique number of datas. Cardinality examples.

[ 0, 0, 0, 0, 0 ] -> cardinality is 1. [ 0, 0, 0, 1, 1 ] -> cardinality is 2. [ 0, 0, 1, 1, 2 ] -> cardinality is 3. [ 0, 0, 1, 2, 3 ] -> cardinality is 4.

And, because XBOS calculates( or uses ) distances, You can't use text data. In that case, you need to drop 'Type' and 'Encoding_type'.

For sure, before using ML-based models, I converted categorical features/dimensions/columns into numerical ones and dropped categorical columns using Label-encoding due to large data ( plz see last two columns after encoding):

+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
| id|           Type|Length|Token_number|Encoding_type|Character_feature|                Freq|Type_Encoded|Encoding_Type_Encoded|         
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+
|  0|  sap-contextid|    90|           2|        false|               15|                 1.0|         2.0|                  0.0|
|  1|       Sentence|   169|          11|         true|              115|  0.0434355930699323|         0.0|                  1.0|
|  2|   url_variable|    12|           2|        false|               11|  0.3768681063417741|         1.0|                  0.0|
|  3|version-setting|    11|           2|         true|               10| 0.08895918484530539|         6.0|                  1.0|
|  4|       Sentence|   722|           5|        false|              117|0.004624917132551378|         0.0|                  0.0|
+---+---------------+------+------------+-------------+-----------------+--------------------+------------+---------------------+

Here I didn't drop those categorical columns just to showcase encoding. Considering this issue, XBOS didn't work with n_cluster for more than 2, which raises this question why? How can I force it to increase it?
Another issue is the mathematical concept behind the implementation. I mean, I'm not interested in deep math, but It would be great if you help me to understand which math formula has been implemented in the 2nd & 3rd functions in XBOS.py Regarding this I tried to reach you via your email. maybe you haven't checked your mail or due to I attached the picture of the assumed formula, my email went to spam.

from xbos-anomaly-detection.

Kanatoko avatar Kanatoko commented on July 17, 2024

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS.
For example, you can not apply K-means clustrering with k=3 on this data:
[ 0,0,0,0,1,1,1,1 ]
because cardinality is 2 and 2 is lower than 3.
We can not get 3 clusters from this data.

the mathematical concept behind the implementation

The math of XBOS is pretty easy.
This blog might help. ( Sorry it is written in Japanese )
https://www.scutum.jp/information/waf_tech_blog/2018/03/waf-blog-054.html

from xbos-anomaly-detection.

clevilll avatar clevilll commented on July 17, 2024

How can I force it to increase it?

You need to drop low cardinality( lower than cluster size ) columns before use XBOS. For example, you can not apply K-means clustrering with k=3 on this data: [ 0,0,0,0,1,1,1,1 ] because cardinality is 2 and 2 is lower than 3. We can not get 3 clusters from this data.

I just see your point. You mean that if in my dataframe I have one column/dimension/feature which has binary style e. g. like Encoding_Type_Encoded in the above frame which was encoded out to its true/false info ruin this and therefore should be dropped. I mean if it is the case algorithm can be developed and check the cardinality of dimensions and drop those ones using automatically:

print("No of unique values in each column :\n", df.nunique(axis=0, dropna=False)) 

in my case for the above frame is:

No of unique values in each column:
 Type                  7
Length               12
Token_number          7
Character_feature    12
Encoding_type         2
Freq                  3

here I should drop Encoding_Type even no needs to encode it in form of Encoding_Type_Encoded . Then n_clusters=<7 should be executed. now it makes sense why I was facing KeyError: 7 due to the cardinality of my 1st column when I was setting n_clusters=8!! Do you confirm that?

The math of XBOS is pretty easy. This blog might help. ( Sorry it is written in Japanese )

Thanks, I do my best! :D Have a nice Sunday. Sunday is Funday! :)

from xbos-anomaly-detection.

Kanatoko avatar Kanatoko commented on July 17, 2024

Do you confirm that?

Yes.

But I'm sorry, I will not implement that ( automatically remove features ) because of ...

  1. This XBOS python code is just for proof of concept. It should be tiny.
  2. Anyone can extend it because it is open source.
  3. Dropping features automatically is not a good idea. Someone would not notice that.

from xbos-anomaly-detection.

Kanatoko avatar Kanatoko commented on July 17, 2024

By the way, if you have many low cardinality features, I think that Isolation Forest is a good solution for anomaly/outlier detection.

from xbos-anomaly-detection.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.