Giter Site home page Giter Site logo

xbos-anomaly-detection's People

Contributors

kanatoko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

xbos-anomaly-detection's Issues

Problem in spark based setup like MS Azure Databricks & its solution

I'm experimenting XBOS and I could manage to run it in google colab notebook successfully but not in MS Azure Databricks (DB) environment due to following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<command-1597574348978809> in <module>
      6 
      7 xbos = XBOS(n_clusters=2, max_iter=1)
----> 8 result = `xbos.fit_predict(df)`
      9 
     10 #for i in result:

<command-1597574348979803> in fit_predict(self, data)
     47 
     48     def fit_predict(self,data):
---> 49         self.fit(data)
     50         return self.predict(data)

<command-1597574348979803> in fit(self, data)
     25                 for k in range(self.n_clusters):
     26                     if i != k:
---> 27                         dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
     28                         effect = ratio[k]*(1/pow(self.effectiveness,dist))
     29                         cluster_score[i] = cluster_score[i]+effect

/databricks/spark/python/pyspark/sql/functions.py in abs(col)
    151     Computes the absolute value.
    152     """
--> 153     return _invoke_function_over_column("abs", col)
    154 
    155 

/databricks/spark/python/pyspark/sql/functions.py in _invoke_function_over_column(name, col)
     65     and wraps the result with :class:`~pyspark.sql.Column`.
     66     """
---> 67     return _invoke_function(name, _to_java_column(col))
     68 
     69 

/databricks/spark/python/pyspark/sql/column.py in _to_java_column(col)
     44         jcol = _create_column_from_name(col)
     45     else:
---> 46         raise TypeError(
     47             "Invalid argument, not a string or column: "
     48             "{0} of type {1}. "

TypeError: Invalid argument, not a string or column: [-5.3974359] of type <class 'numpy.ndarray'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

After I double checked and adapted all needed libraries & dependencies I realized that spark based setup can't distinguish pandas dataframe from Spark dataframe and result in TypeError.

Solution As it mentioned in error traceback (line 27) replace dist = np.abs(kmeans..........)/max_distance by dist = abs(kmeans..........)/max_distance so that DB setup understands <class 'numpy.ndarray'>.
PS: np is known as import numpy as np it is already imported in implementation.

Problem with number of clusters to execute the XBOS

During my experiments I noticed that there is a limit for number of clusters which results in following error either in bigdata or your provided sample:

the shape of my data is (1516385, 8) and when I run the XBOS by default xbos = XBOS() it means n_clusters=15, effectiveness=500, max_iter=2 I'll face KeyError: 0 including following message along with error traceback:

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))

So in the end XBOS on my data can be executed with only two clusters n_clusters=2 which doesn't make sense. Even though when I tested on simple dataset you provided here when you configure it with n_clusters=8 it threw out the similar KeyError and mostly KeyError: 7

/content/xbos.py:25: ConvergenceWarning: Number of distinct clusters (7) found smaller than n_clusters (8). Possibly due to duplicate points in X.
  kmeans.fit(data[column].values.reshape(-1,1))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897             try:
-> 2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 7

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
5 frames
<ipython-input-9-6c63ff2c8311> in <module>()
      6 
      7 xbos = XBOS(n_clusters=8, max_iter=1)
----> 8 result = xbos.fit_predict(dff)
      9 #for i in result:
     10 #    print(round(i,2))

/content/xbos.py in fit_predict(self, data)
     56 
     57     def fit_predict(self,data):
---> 58         self.fit(data)
     59         return self.predict(data)

/content/xbos.py in fit(self, data)
     35                     if i != k:
     36                         dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
---> 37                         effect = ratio[k]*(1/pow(self.effectiveness,dist))
     38                         cluster_score[i] = cluster_score[i]+effect
     39 

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if is_hashable(key):

/usr/local/lib/python3.7/dist-packages/pandas/core/series.py in _get_value(self, label, takeable)
    988 
    989         # Similar to Index.get_value, but we do not fall back to positional
--> 990         loc = self.index.get_loc(label)
    991         return self.index._get_values_for_loc(self, loc, label)
    992 

/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2898                 return self._engine.get_loc(casted_key)
   2899             except KeyError as err:
-> 2900                 raise KeyError(key) from err
   2901 
   2902         if tolerance is not None:

KeyError: 7

Please let me know if I should change my configuration to execute the XBOS successfully. Please feel free to check it out this Colab notebook and comment next to cells for quick debugging.

Some questions

I think there are some issues with the code as is:

First of all, I think
cluster_score=dict(assign.groupby('cluster').apply(len).apply(lambda x:x/length))

should be replaced by
cluster_score=dict(assign['cluster'].value_counts().apply(lambda x: x / length)) for i in range(self.n_clusters): if i not in cluster_score: cluster_score[i] = 0

... to prevent key errors.

Second:

for column in data.columns: kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter, random_state=0) self.kmeans[column]=kmeans kmeans.fit(data[column].values.reshape(-1,1))

Here you train nr_features kmeans models, on the first nr_features rows of the data. In other words you these models are trained using one sample of the data only each. What is the motivation behind this?

Third:
sorted_centers = sorted(kmeans.cluster_centers_) max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]

...To me this doesn't seem to compute the max distance between your centers.

Example run with error.

version
scikit-learn==0.19.1
pandas==0.22.0
numpy==1.14.1

In line 54, cluster_score seems to be 0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.