kanatoko / xbos-anomaly-detection Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 2.0 224 KB

XBOS Anomaly Detection

Python 100.00%

xbos-anomaly-detection's People

Contributors

Stargazers

Watchers

Forkers

yamazaki-youichi animesh

xbos-anomaly-detection's Issues

Problem in spark based setup like MS Azure Databricks & its solution

I'm experimenting XBOS and I could manage to run it in google colab notebook successfully but not in MS Azure Databricks (DB) environment due to following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-1597574348978809> in <module>
      6 
      7 xbos = XBOS(n_clusters=2, max_iter=1)
----> 8 result = `xbos.fit_predict(df)`
      9 
     10 #for i in result:

<command-1597574348979803> in fit_predict(self, data)
     47 
     48     def fit_predict(self,data):
---> 49         self.fit(data)
     50         return self.predict(data)

<command-1597574348979803> in fit(self, data)
     25                 for k in range(self.n_clusters):
     26                     if i != k:
---> 27                         dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
     28                         effect = ratio[k]*(1/pow(self.effectiveness,dist))
     29                         cluster_score[i] = cluster_score[i]+effect

/databricks/spark/python/pyspark/sql/functions.py in abs(col)
    151     Computes the absolute value.
    152     """
--> 153     return _invoke_function_over_column("abs", col)
    154 
    155 

/databricks/spark/python/pyspark/sql/functions.py in _invoke_function_over_column(name, col)
     65     and wraps the result with :class:`~pyspark.sql.Column`.
     66     """
---> 67     return _invoke_function(name, _to_java_column(col))
     68 
     69 

/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
     44         jcol = _create_column_from_name(col)
     45     else:
---> 46         raise TypeError(
     47             "Invalid argument, not a string or column: "
     48             "{0} of type {1}. "

TypeError: Invalid argument, not a string or column: [-5.3974359] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

After I double checked and adapted all needed libraries & dependencies I realized that spark based setup can't distinguish pandas dataframe from Spark dataframe and result in TypeError.

Solution As it mentioned in error traceback (line 27) replace dist = np.abs(kmeans..........)/max_distance by dist = abs(kmeans..........)/max_distance so that DB setup understands <class 'numpy.ndarray'>.
PS: np is known as import numpy as np it is already imported in implementation.

Error occurs when no feature assign to some clusters

XBOS-anomaly-detection/xbos.py

Line 26 in 41ee780

cluster_score=assign.groupby('cluster').apply(len).apply(lambda x:x/length)

Error occurs when no feature assign to some clusters.
kmeans.predict doesn't promise to assign at least one feature to some clusters.

Problem with number of clusters to execute the XBOS

During my experiments I noticed that there is a limit for number of clusters which results in following error either in bigdata or your provided sample:

the shape of my data is (1516385, 8) and when I run the XBOS by default xbos = XBOS() it means n_clusters=15, effectiveness=500, max_iter=2 I'll face KeyError: 0 including following message along with error traceback:

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))

So in the end XBOS on my data can be executed with only two clusters n_clusters=2 which doesn't make sense. Even though when I tested on simple dataset you provided here when you configure it with n_clusters=8 it threw out the similar KeyError and mostly KeyError: 7

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 7

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
5 frames
<ipython-input-9-6c63ff2c8311> in <module>()
      6 
      7 xbos = XBOS(n_clusters=8, max_iter=1)
----> 8 result = xbos.fit_predict(dff)
      9 #for i in result:
     10 #    print(round(i,2))

/content/xbos.py in fit_predict(self, data)
     56 
     57     def fit_predict(self,data):
---> 58         self.fit(data)
     59         return self.predict(data)

/content/xbos.py in fit(self, data)
     35                     if i != k:
     36                         dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
---> 37                         effect = ratio[k]*(1/pow(self.effectiveness,dist))
     38                         cluster_score[i] = cluster_score[i]+effect
     39 

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 7

Please let me know if I should change my configuration to execute the XBOS successfully. Please feel free to check it out this Colab notebook and comment next to cells for quick debugging.

Some questions

I think there are some issues with the code as is:

First of all, I think
cluster_score=dict(assign.groupby('cluster').apply(len).apply(lambda x:x/length))

should be replaced by
cluster_score=dict(assign['cluster'].value_counts().apply(lambda x: x / length)) for i in range(self.n_clusters): if i not in cluster_score: cluster_score[i] = 0

... to prevent key errors.

Second:

for column in data.columns: kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter, random_state=0) self.kmeans[column]=kmeans kmeans.fit(data[column].values.reshape(-1,1))

Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. What is the motivation behind this?

Third:
sorted_centers = sorted(kmeans.cluster_centers_) max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]

...To me this doesn't seem to compute the max distance between your centers.

Example run with error.

version
scikit-learn==0.19.1
pandas==0.22.0
numpy==1.14.1

In line 54, cluster_score seems to be 0.

kanatoko / xbos-anomaly-detection Goto Github PK

xbos-anomaly-detection's People

Contributors

Stargazers

Watchers

Forkers

xbos-anomaly-detection's Issues

Problem in spark based setup like MS Azure Databricks & its solution

Error occurs when no feature assign to some clusters

Problem with number of clusters to execute the XBOS

Some questions

Example run with error.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent