kanatoko / xbos-anomaly-detection Goto Github PK
View Code? Open in Web Editor NEWXBOS Anomaly Detection
XBOS Anomaly Detection
I'm experimenting XBOS and I could manage to run it in google colab notebook successfully but not in MS Azure Databricks (DB) environment due to following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1597574348978809> in <module>
6
7 xbos = XBOS(n_clusters=2, max_iter=1)
----> 8 result = `xbos.fit_predict(df)`
9
10 #for i in result:
<command-1597574348979803> in fit_predict(self, data)
47
48 def fit_predict(self,data):
---> 49 self.fit(data)
50 return self.predict(data)
<command-1597574348979803> in fit(self, data)
25 for k in range(self.n_clusters):
26 if i != k:
---> 27 dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
28 effect = ratio[k]*(1/pow(self.effectiveness,dist))
29 cluster_score[i] = cluster_score[i]+effect
/databricks/spark/python/pyspark/sql/functions.py in abs(col)
151 Computes the absolute value.
152 """
--> 153 return _invoke_function_over_column("abs", col)
154
155
/databricks/spark/python/pyspark/sql/functions.py in _invoke_function_over_column(name, col)
65 and wraps the result with :class:`~pyspark.sql.Column`.
66 """
---> 67 return _invoke_function(name, _to_java_column(col))
68
69
/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
44 jcol = _create_column_from_name(col)
45 else:
---> 46 raise TypeError(
47 "Invalid argument, not a string or column: "
48 "{0} of type {1}. "
TypeError: Invalid argument, not a string or column: [-5.3974359] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
After I double checked and adapted all needed libraries & dependencies I realized that spark based setup can't distinguish pandas dataframe from Spark dataframe and result in TypeError.
Solution As it mentioned in error traceback (line 27) replace dist = np.abs(kmeans..........)/max_distance
by dist = abs(kmeans..........)/max_distance
so that DB setup understands <class 'numpy.ndarray'>.
PS: np
is known as import numpy as np
it is already imported in implementation.
XBOS-anomaly-detection/xbos.py
Line 26 in 41ee780
Error occurs when no feature assign to some clusters.
kmeans.predict doesn't promise to assign at least one feature to some clusters.
During my experiments I noticed that there is a limit for number of clusters which results in following error either in bigdata or your provided sample:
the shape of my data is (1516385, 8)
and when I run the XBOS by default xbos = XBOS()
it means n_clusters=15
, effectiveness=500
, max_iter=2
I'll face KeyError: 0
including following message along with error traceback:
/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
kmeans.fit(data[column].values.reshape(-1,1))
So in the end XBOS on my data can be executed with only two clusters n_clusters=2
which doesn't make sense. Even though when I tested on simple dataset you provided here when you configure it with n_clusters=8
it threw out the similar KeyError
and mostly KeyError: 7
/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
kmeans.fit(data[column].values.reshape(-1,1))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2897 try:
-> 2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 7
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
5 frames
<ipython-input-9-6c63ff2c8311> in <module>()
6
7 xbos = XBOS(n_clusters=8, max_iter=1)
----> 8 result = xbos.fit_predict(dff)
9 #for i in result:
10 # print(round(i,2))
/content/xbos.py in fit_predict(self, data)
56
57 def fit_predict(self,data):
---> 58 self.fit(data)
59 return self.predict(data)
/content/xbos.py in fit(self, data)
35 if i != k:
36 dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
---> 37 effect = ratio[k]*(1/pow(self.effectiveness,dist))
38 cluster_score[i] = cluster_score[i]+effect
39
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if is_hashable(key):
/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
988
989 # Similar to Index.get_value, but we do not fall back to positional
--> 990 loc = self.index.get_loc(label)
991 return self.index._get_values_for_loc(self, loc, label)
992
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2898 return self._engine.get_loc(casted_key)
2899 except KeyError as err:
-> 2900 raise KeyError(key) from err
2901
2902 if tolerance is not None:
KeyError: 7
Please let me know if I should change my configuration to execute the XBOS successfully. Please feel free to check it out this Colab notebook and comment next to cells for quick debugging.
I think there are some issues with the code as is:
First of all, I think
cluster_score=dict(assign.groupby('cluster').apply(len).apply(lambda x:x/length))
should be replaced by
cluster_score=dict(assign['cluster'].value_counts().apply(lambda x: x / length)) for i in range(self.n_clusters): if i not in cluster_score: cluster_score[i] = 0
... to prevent key errors.
Second:
for column in data.columns: kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter, random_state=0) self.kmeans[column]=kmeans kmeans.fit(data[column].values.reshape(-1,1))
Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. What is the motivation behind this?
Third:
sorted_centers = sorted(kmeans.cluster_centers_) max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
...To me this doesn't seem to compute the max distance between your centers.
version
scikit-learn==0.19.1
pandas==0.22.0
numpy==1.14.1
In line 54, cluster_score seems to be 0.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.