Giter Site home page Giter Site logo

anomalydetection's Introduction

Anomaly Detection for Python

PyPI - Downloads

Introduction

Twitter's Anomaly Detection is easy to use, but it's a R library.

Although there are some repos for python to run twitter's anomaly detection algorithm, but those libraies requires R installed.

This repo aims for rewriting twitter's Anomaly Detection algorithms in Python, and providing same functions for user.

Install

pip3 install tad

Usage

import tad

anomalydetection's People

Contributors

hokiegeek2 avatar lbl1985 avatar marcnuth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anomalydetection's Issues

No module named 'tad'

Hi,

I'm trying to use tad but it's not working.

Traceback (most recent call last):
  File "tad-test.py", line 1, in <module>
    import tad
ModuleNotFoundError: No module named 'tad'

tad version is 0.0.6

Exception with latest statsmodels library

Latest statsmodel==0.13.0 is released on October, 2nd 2021.
After that I started getting this issue in tad/anomaly_detect_ts.py:411.

decomposed = sm.tsa.seasonal_decompose(data, freq=num_obs_per_period, two_sided=False)
TypeError: seasonal_decompose() got an unexpected keyword argument 'freq'

Warning with statsmodel==0.12.2.
tad/anomaly_detect_ts.py:412: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead data, freq=num_obs_per_period, two_sided=False)

Plot=True referenced before assignment

When we call with "Plot=True" parameter, like this:
data_ = pd.read_csv('analytics_test.csv', index_col='timestamp', parse_dates=True, squeeze=True, date_parser=dparserfunc)

returns this error:
UnboundLocalError: local variable 'num_days_per_line' referenced before assignment

in anomaly_detect_ts.py file in 274. line:
... if plot: num_days_per_line breaks x_subset_week ...

Integrate period selection and resampling

The period selection logic is set for increments > sec but missing for ms and sec. Need to enable period selection in the cases where resampling is performed for sec and ms granularity

Refactor loops to enable parallelization?

I am looking into parallelizing a section of code in detect_anoms where the majority of execution time is spent:

    if not one_tail:
        ares = abs(data - data.median())
    elif upper_tail:
        ares = data - data.median()
    else:
        ares = data.median() - data

    ares = ares / data.mad()

    tmp_anom_index = ares[ares.values == ares.max()].index
    cand = pd.Series(data.loc[tmp_anom_index], index=tmp_anom_index)

    data.drop(tmp_anom_index, inplace=True)

Is there a way to refactor the code so that ordering enforced by the for loop for the data.drop invocations is no longer needed?

Similar question here:

for i in range(1, data.size + 1, num_obs_in_period):
    start_date = data.index[i]
    # if there is at least 14 days left, subset it, otherwise subset last_date - 14 days
    end_date = start_date + datetime.timedelta(days=num_days_in_period)
    if end_date < data.index[-1]:
        all_data.append(
            data.loc[lambda x: (x.index >= start_date) & (x.index <= end_date)])
    else:
        all_data.append(
            data.loc[lambda x: x.index >= data.index[-1] - datetime.timedelta(days=num_days_in_period)])
return all_data

I am a software engineer, not a data scientist, so this may be a very naive question. :)

--John

unit tests don't run, need more test coverage

When I attempt to run the test_detect_ts I get the following error:

Traceback (most recent call last):
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 34, in
test_anomaly_detect_ts(data)
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 24, in test_anomaly_detect_ts
results = detts.anomaly_detect_ts(data, max_anoms=0.02,
AttributeError: 'function' object has no attribute 'anomaly_detect_ts'

Also, definitely need more unit test coverage

modularize code

Hi Everyone,
This algorithm works really well and we'll be using it in production. To that end, I wanted to modularize the code and add corresponding unit tests to make it easier to enhance this excellent implementation of the Twitter algorithm.

@Marcnuth Pull request coming soon

--John

Library is not working as expected

Hello, I'm trying to test this library to a simple sinusoidal signal with some anomalies, but it's not working as I expected.

This is the sinusoidal:
image

And this is the script:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from anomaly_detection import anomaly_detect_ts
from datetime import datetime, timedelta

def datetime_range(start, end, delta):
    current = start
    while current < end:
        yield current
        current += delta

# Creating the base signal
Fs = 8000
f = 5
sample = 8000
now = datetime.now()
dts = [now + timedelta(hours=index) for index in range(sample)]
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)

# Now lets add some anomalies
for x in range(7200, 7270):
    y[x] = random.random()

# We call the library for detecting the anoms
pandas_dataframe = pd.Series(data=y, index=dts, dtype=None)
anoms = anomaly_detect_ts(pandas_dataframe,
                          max_anoms=0.1,
                          direction="both")

print(anoms)

plt.plot(dts, y)
plt.xlabel('date')
plt.ylabel('voltage(V)')
plt.show()

Do you know why it's not working, @Marcnuth ?

This ESD test implementation differs from the Twitter's R implementation and the definition of the test

By definition of the test and Twitter's R implementation, all the candidates that have been considered until the largest i such that max_R_i > lambda_i are all anomalies, not just the ones that are in the iterations that have max_R_i > lambda_i.

There have been simulation studies showing that the inequality of max_R_i > lambda_i can swing back and froth as the iteration is progressed. This python implementation of ESD test may miss some anomalies.

Best way to handle sec and ms level of granularity

I see in the anomaly_detect_ts the period is not set when the granularity is either 'sec' or 'ms':

elif timediff.seconds > 0:
    granularity = 'sec'
    # Aggregate data to minutely if secondly
    data = data.resample('60s', label='right').sum()
else:
    granularity = 'ms'

This causes the error reported in #11 Questions: should data at the sec and ms level (1) be resampled to 'min' level of granularity and set the period=1440 (2) be unsupported with ValueError being thrown to interrupt processing, (3) provide both as a configurable option

I personally vote for (3). @Marcnuth @triciascully what do you think?

Issues with recent commits (args,kwargs, and more)

Hi there -

When running:

anomaly_detect_vec(data, max_anoms=0.02, period=96, direction="both", plot=True)

I keep getting this:

TypeError: __verbose_if() missing 1 required positional argument: 'kwargs'

I changed the assert validation from

assert isinstance(x) == pd.Series, 'x must be pandas series'

to

assert isinstance(x, pd.Series), 'Data must be a series(Pandas.Series)'

because I kept getting an error that the isinstance() needed 2 arguments, and now I'm getting the __verbose error.

I have a feeling this might be a Python 2 vs. Python 3 issue but I'm not familiar with python enough to really know, and I think continuing on and trying to debug every new issue that pops up will not end up working... any suggestions/easily-identifiable-fixes you can see?

Add period configuration option to anomaly_detect_ts

I've been manually running the code and setting the period to period = 96 because the data I'm working with is at a 15min interval - can you build in an option for configuration of the period when calling the function, since people might be working with data that's every 10min, 15min, 30min, 2hrs, etc.?

median absolute deviation or mean absolute deviation

The primary algorithm uses median absolute deviation to replace standard deviation, to make it more robust against anomaly points.

But in this code, pandas.mad() is used. However, pandas.mad() is mean absolute deviation, not median absolute deviation. Both can work, but median absolute deviation is better, in my opinion.

local variable 'period' referenced before assignment

Getting this error in the context of TS anomaly detection on a metricbeat log stream where the timestamps are all from the same day:

results.append(anomaly_detect_ts(pd.Series(item), **self._generateParams()))
File "build/bdist.linux-x86_64/egg/anomaly_detection/anomaly_detect_ts.py", line 207, in anomaly_detect_ts
UnboundLocalError: local variable 'period' referenced before assignment

I can see in the code why this is happening I am hitting the else clause with granularity of ms and period is never initialized. Logging shows why:

2018-09-19 19:25:49.858000 (data.index[1]) 2018-09-19 19:25:49.180000 (data.index[0])

So from a pure computer science perspective, it's clear what's happening.
timediff = data.index[1] - data.index[0]
if timediff.days > 0:
num_days_per_line = 7
only_last = 'day' if only_last == 'hr' else only_last
period = 7
granularity = 'day'
elif timediff.seconds / 60 / 60 >= 1:
granularity = 'hr'
period = 24
elif timediff.seconds / 60 >= 1:
granularity = 'min'
period = 1440
elif timediff.seconds > 0:
granularity = 'sec'
# Aggregate data to minutely if secondly
data = data.resample('60s', label='right').sum()
else:
granularity = 'ms'

The questions:

Should I change the sampling so that anomaly_detect_ts gets data points at intervals of 1 min or more, or should I consider anomaly_detect_vec for granularity of sec?

Any guidance is very much appreciated, thanks!

--John

Fails if only_last=None

First of all, great work on this! I am testing this out now for use in a streaming TS anomaly detection analytic for data center monitoring. A pure Python port of Twitter's uber cool algorithm is awesome.

Found an easy to fix error when only_last=None

anomaly_detect_ts.anomaly_detect_ts(raw_data, max_anoms=0.02, direction="both", plot=True)
Traceback (most recent call last):
File "", line 1, in
File "/home/kjyost/development/git/AnomalyDetection/anomaly_detection/anomaly_detect_ts.py", line 257, in anomaly_detect_ts
anom_pct = all_anoms.size / x_subset_single_day.size * 100
UnboundLocalError: local variable 'x_subset_single_day' referenced before assignment

Since it appears that all that's really being done with this outside of this if block is to determine if there are anomalies or not. Accordingly, I added this code:

if all_anoms.empty:
    if verbose:
        print('No anomalies detected.')

    return {
        'anoms': pd.Series(),
        'plot': None
    }

I tested this with only_last=day, hr, or None and it works as expected. Pull request coming soon.

--John

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.