marcnuth / anomalydetection Goto Github PK

View Code? Open in Web Editor NEW

295.0 21.0 75.0 405 KB

Twitter's Anomaly Detection in Pure Python

License: Apache License 2.0

Python 100.00%

anomalydetection timeseries python statistics s-h-esd cusum

anomalydetection's Issues

Does this project can be used in production environment now?

Can I get your email?

Need a setup.py file

Refactor loops to enable parallelization?

I am looking into parallelizing a section of code in detect_anoms where the majority of execution time is spent:

    if not one_tail:
        ares = abs(data - data.median())
    elif upper_tail:
        ares = data - data.median()
    else:
        ares = data.median() - data

    ares = ares / data.mad()

    tmp_anom_index = ares[ares.values == ares.max()].index
    cand = pd.Series(data.loc[tmp_anom_index], index=tmp_anom_index)

    data.drop(tmp_anom_index, inplace=True)

Is there a way to refactor the code so that ordering enforced by the for loop for the data.drop invocations is no longer needed?

Exception with latest statsmodels library

Latest statsmodel==0.13.0 is released on October, 2nd 2021.
After that I started getting this issue in tad/anomaly_detect_ts.py:411.

decomposed = sm.tsa.seasonal_decompose(data, freq=num_obs_per_period, two_sided=False)
TypeError: seasonal_decompose() got an unexpected keyword argument 'freq'

Warning with statsmodel==0.12.2.
tad/anomaly_detect_ts.py:412: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead data, freq=num_obs_per_period, two_sided=False)

Best way to handle sec and ms level of granularity

I see in the anomaly_detect_ts the period is not set when the granularity is either 'sec' or 'ms':

elif timediff.seconds > 0:
    granularity = 'sec'
    # Aggregate data to minutely if secondly
    data = data.resample('60s', label='right').sum()
else:
    granularity = 'ms'

This causes the error reported in #11 Questions: should data at the sec and ms level (1) be resampled to 'min' level of granularity and set the period=1440 (2) be unsupported with ValueError being thrown to interrupt processing, (3) provide both as a configurable option

I personally vote for (3). @Marcnuth @triciascully what do you think?

Add period configuration option to anomaly_detect_ts

I've been manually running the code and setting the period to period = 96 because the data I'm working with is at a 15min interval - can you build in an option for configuration of the period when calling the function, since people might be working with data that's every 10min, 15min, 30min, 2hrs, etc.?

median absolute deviation or mean absolute deviation

The primary algorithm uses median absolute deviation to replace standard deviation, to make it more robust against anomaly points.

But in this code, pandas.mad() is used. However, pandas.mad() is mean absolute deviation, not median absolute deviation. Both can work, but median absolute deviation is better, in my opinion.

Plot=True referenced before assignment

When we call with "Plot=True" parameter, like this:
data_ = pd.read_csv('analytics_test.csv', index_col='timestamp', parse_dates=True, squeeze=True, date_parser=dparserfunc)

returns this error:
UnboundLocalError: local variable 'num_days_per_line' referenced before assignment

in anomaly_detect_ts.py file in 274. line:
... if plot: num_days_per_line breaks x_subset_week ...

unit tests don't run, need more test coverage

When I attempt to run the test_detect_ts I get the following error:

Traceback (most recent call last):
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 34, in
test_anomaly_detect_ts(data)
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 24, in test_anomaly_detect_ts
results = detts.anomaly_detect_ts(data, max_anoms=0.02,
AttributeError: 'function' object has no attribute 'anomaly_detect_ts'

Also, definitely need more unit test coverage

Code returns no results for Python2

The code works great in Python3 but returns no anomalies when I run it in Python2.

Strange error...

modularize code

Hi Everyone,
This algorithm works really well and we'll be using it in production. To that end, I wanted to modularize the code and add corresponding unit tests to make it easier to enhance this excellent implementation of the Twitter algorithm.

@Marcnuth Pull request coming soon

--John

A bug in test_detect_ts.py

In line

AnomalyDetection/tests/test_detect_ts.py

Line 6 in 2759fea

get_data_tuple, _get_max_outliers, _get_decomposed_data_tuple,\

The 'get_data_tuple' should be '_get_data_tuple' instead.

This ESD test implementation differs from the Twitter's R implementation and the definition of the test

By definition of the test and Twitter's R implementation, all the candidates that have been considered until the largest i such that max_R_i > lambda_i are all anomalies, not just the ones that are in the iterations that have max_R_i > lambda_i.

There have been simulation studies showing that the inequality of max_R_i > lambda_i can swing back and froth as the iteration is progressed. This python implementation of ESD test may miss some anomalies.

Library is not working as expected

Hello, I'm trying to test this library to a simple sinusoidal signal with some anomalies, but it's not working as I expected.

This is the sinusoidal:

And this is the script:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from anomaly_detection import anomaly_detect_ts
from datetime import datetime, timedelta

def datetime_range(start, end, delta):
    current = start
    while current < end:
        yield current
        current += delta

# Creating the base signal
Fs = 8000
f = 5
sample = 8000
now = datetime.now()
dts = [now + timedelta(hours=index) for index in range(sample)]
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)

# Now lets add some anomalies
for x in range(7200, 7270):
    y[x] = random.random()

# We call the library for detecting the anoms
pandas_dataframe = pd.Series(data=y, index=dts, dtype=None)
anoms = anomaly_detect_ts(pandas_dataframe,
                          max_anoms=0.1,
                          direction="both")

print(anoms)

plt.plot(dts, y)
plt.xlabel('date')
plt.ylabel('voltage(V)')
plt.show()

Do you know why it's not working, @Marcnuth ?

Fails if only_last=None

First of all, great work on this! I am testing this out now for use in a streaming TS anomaly detection analytic for data center monitoring. A pure Python port of Twitter's uber cool algorithm is awesome.

Found an easy to fix error when only_last=None

anomaly_detect_ts.anomaly_detect_ts(raw_data, max_anoms=0.02, direction="both", plot=True)
Traceback (most recent call last):
File "", line 1, in
File "/home/kjyost/development/git/AnomalyDetection/anomaly_detection/anomaly_detect_ts.py", line 257, in anomaly_detect_ts
anom_pct = all_anoms.size / x_subset_single_day.size * 100
UnboundLocalError: local variable 'x_subset_single_day' referenced before assignment

Since it appears that all that's really being done with this outside of this if block is to determine if there are anomalies or not. Accordingly, I added this code:

if all_anoms.empty:
    if verbose:
        print('No anomalies detected.')

    return {
        'anoms': pd.Series(),
        'plot': None
    }

I tested this with only_last=day, hr, or None and it works as expected. Pull request coming soon.

--John

No module named 'tad'

Hi,

I'm trying to use tad but it's not working.

Traceback (most recent call last):
  File "tad-test.py", line 1, in <module>
    import tad
ModuleNotFoundError: No module named 'tad'

tad version is 0.0.6

add support for microsoft anomaly detection

MS provide a R lib for anomaly tag: https://github.com/microsoft/TagAnomaly

Issues with recent commits (args,kwargs, and more)

Hi there -

When running:

anomaly_detect_vec(data, max_anoms=0.02, period=96, direction="both", plot=True)

I keep getting this:

TypeError: __verbose_if() missing 1 required positional argument: 'kwargs'

I changed the assert validation from

assert isinstance(x) == pd.Series, 'x must be pandas series'

assert isinstance(x, pd.Series), 'Data must be a series(Pandas.Series)'

because I kept getting an error that the isinstance() needed 2 arguments, and now I'm getting the __verbose error.

I have a feeling this might be a Python 2 vs. Python 3 issue but I'm not familiar with python enough to really know, and I think continuing on and trying to debug every new issue that pops up will not end up working... any suggestions/easily-identifiable-fixes you can see?

ModuleNotFoundError: No module named tad

After commit efbc4e1f27f82727f4f25d8725ffe16604cb6b76 I have problems with importing the library.

pip3 install tad
import tad
>ModuleNotFoundError: No module named tad

possible to support sec level of granularity?

Hi, was wondering if it is possible update this algorithm for seconds level of granularity?

parallelize logic where possible with dask

There are several spots where Pandas DataFrame operations can be parallelized with Dask

Integrate period selection and resampling

The period selection logic is set for increments > sec but missing for ms and sec. Need to enable period selection in the cases where resampling is performed for sec and ms granularity

local variable 'period' referenced before assignment

Getting this error in the context of TS anomaly detection on a metricbeat log stream where the timestamps are all from the same day:

results.append(anomaly_detect_ts(pd.Series(item), **self._generateParams()))
File "build/bdist.linux-x86_64/egg/anomaly_detection/anomaly_detect_ts.py", line 207, in anomaly_detect_ts
UnboundLocalError: local variable 'period' referenced before assignment

I can see in the code why this is happening I am hitting the else clause with granularity of ms and period is never initialized. Logging shows why:

2018-09-19 19:25:49.858000 (data.index[1]) 2018-09-19 19:25:49.180000 (data.index[0])

So from a pure computer science perspective, it's clear what's happening.
timediff = data.index[1] - data.index[0]
if timediff.days > 0:
num_days_per_line = 7
only_last = 'day' if only_last == 'hr' else only_last
period = 7
granularity = 'day'
elif timediff.seconds / 60 / 60 >= 1:
granularity = 'hr'
period = 24
elif timediff.seconds / 60 >= 1:
granularity = 'min'
period = 1440
elif timediff.seconds > 0:
granularity = 'sec'
# Aggregate data to minutely if secondly
data = data.resample('60s', label='right').sum()
else:
granularity = 'ms'

The questions:

Should I change the sampling so that anomaly_detect_ts gets data points at intervals of 1 min or more, or should I consider anomaly_detect_vec for granularity of sec?

Any guidance is very much appreciated, thanks!

--John

marcnuth / anomalydetection Goto Github PK

anomalydetection's Issues

Recommend Projects

Recommend Topics

Recommend Org