marcnuth / anomalydetection Goto Github PK
View Code? Open in Web Editor NEWTwitter's Anomaly Detection in Pure Python
License: Apache License 2.0
Twitter's Anomaly Detection in Pure Python
License: Apache License 2.0
Can I get your email?
I am looking into parallelizing a section of code in detect_anoms where the majority of execution time is spent:
if not one_tail:
ares = abs(data - data.median())
elif upper_tail:
ares = data - data.median()
else:
ares = data.median() - data
ares = ares / data.mad()
tmp_anom_index = ares[ares.values == ares.max()].index
cand = pd.Series(data.loc[tmp_anom_index], index=tmp_anom_index)
data.drop(tmp_anom_index, inplace=True)
Is there a way to refactor the code so that ordering enforced by the for loop for the data.drop invocations is no longer needed?
Similar question here:
for i in range(1, data.size + 1, num_obs_in_period):
start_date = data.index[i]
# if there is at least 14 days left, subset it, otherwise subset last_date - 14 days
end_date = start_date + datetime.timedelta(days=num_days_in_period)
if end_date < data.index[-1]:
all_data.append(
data.loc[lambda x: (x.index >= start_date) & (x.index <= end_date)])
else:
all_data.append(
data.loc[lambda x: x.index >= data.index[-1] - datetime.timedelta(days=num_days_in_period)])
return all_data
I am a software engineer, not a data scientist, so this may be a very naive question. :)
--John
Latest statsmodel==0.13.0 is released on October, 2nd 2021.
After that I started getting this issue in tad/anomaly_detect_ts.py:411.
decomposed = sm.tsa.seasonal_decompose(data, freq=num_obs_per_period, two_sided=False)
TypeError: seasonal_decompose() got an unexpected keyword argument 'freq'
Warning with statsmodel==0.12.2.
tad/anomaly_detect_ts.py:412: FutureWarning: the 'freq'' keyword is deprecated, use 'period' instead data, freq=num_obs_per_period, two_sided=False)
I see in the anomaly_detect_ts the period is not set when the granularity is either 'sec' or 'ms':
elif timediff.seconds > 0:
granularity = 'sec'
# Aggregate data to minutely if secondly
data = data.resample('60s', label='right').sum()
else:
granularity = 'ms'
This causes the error reported in #11 Questions: should data at the sec and ms level (1) be resampled to 'min' level of granularity and set the period=1440 (2) be unsupported with ValueError being thrown to interrupt processing, (3) provide both as a configurable option
I personally vote for (3). @Marcnuth @triciascully what do you think?
I've been manually running the code and setting the period to period = 96 because the data I'm working with is at a 15min interval - can you build in an option for configuration of the period when calling the function, since people might be working with data that's every 10min, 15min, 30min, 2hrs, etc.?
The primary algorithm uses median absolute deviation to replace standard deviation, to make it more robust against anomaly points.
But in this code, pandas.mad() is used. However, pandas.mad() is mean absolute deviation, not median absolute deviation. Both can work, but median absolute deviation is better, in my opinion.
When we call with "Plot=True" parameter, like this:
data_ = pd.read_csv('analytics_test.csv', index_col='timestamp', parse_dates=True, squeeze=True, date_parser=dparserfunc)
returns this error:
UnboundLocalError: local variable 'num_days_per_line' referenced before assignment
in anomaly_detect_ts.py file in 274. line:
... if plot: num_days_per_line breaks x_subset_week ...
When I attempt to run the test_detect_ts I get the following error:
Traceback (most recent call last):
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 34, in
test_anomaly_detect_ts(data)
File "/var/development/git/hokiegeek2-forks/AnomalyDetection/tests/test_detect_ts.py", line 24, in test_anomaly_detect_ts
results = detts.anomaly_detect_ts(data, max_anoms=0.02,
AttributeError: 'function' object has no attribute 'anomaly_detect_ts'
Also, definitely need more unit test coverage
The code works great in Python3 but returns no anomalies when I run it in Python2.
Hi Everyone,
This algorithm works really well and we'll be using it in production. To that end, I wanted to modularize the code and add corresponding unit tests to make it easier to enhance this excellent implementation of the Twitter algorithm.
@Marcnuth Pull request coming soon
--John
In line
The 'get_data_tuple' should be '_get_data_tuple' instead.
By definition of the test and Twitter's R implementation, all the candidates that have been considered until the largest i such that max_R_i > lambda_i are all anomalies, not just the ones that are in the iterations that have max_R_i > lambda_i.
There have been simulation studies showing that the inequality of max_R_i > lambda_i can swing back and froth as the iteration is progressed. This python implementation of ESD test may miss some anomalies.
Hello, I'm trying to test this library to a simple sinusoidal signal with some anomalies, but it's not working as I expected.
And this is the script:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from anomaly_detection import anomaly_detect_ts
from datetime import datetime, timedelta
def datetime_range(start, end, delta):
current = start
while current < end:
yield current
current += delta
# Creating the base signal
Fs = 8000
f = 5
sample = 8000
now = datetime.now()
dts = [now + timedelta(hours=index) for index in range(sample)]
x = np.arange(sample)
y = np.sin(2 * np.pi * f * x / Fs)
# Now lets add some anomalies
for x in range(7200, 7270):
y[x] = random.random()
# We call the library for detecting the anoms
pandas_dataframe = pd.Series(data=y, index=dts, dtype=None)
anoms = anomaly_detect_ts(pandas_dataframe,
max_anoms=0.1,
direction="both")
print(anoms)
plt.plot(dts, y)
plt.xlabel('date')
plt.ylabel('voltage(V)')
plt.show()
Do you know why it's not working, @Marcnuth ?
First of all, great work on this! I am testing this out now for use in a streaming TS anomaly detection analytic for data center monitoring. A pure Python port of Twitter's uber cool algorithm is awesome.
Found an easy to fix error when only_last=None
anomaly_detect_ts.anomaly_detect_ts(raw_data, max_anoms=0.02, direction="both", plot=True)
Traceback (most recent call last):
File "", line 1, in
File "/home/kjyost/development/git/AnomalyDetection/anomaly_detection/anomaly_detect_ts.py", line 257, in anomaly_detect_ts
anom_pct = all_anoms.size / x_subset_single_day.size * 100
UnboundLocalError: local variable 'x_subset_single_day' referenced before assignment
Since it appears that all that's really being done with this outside of this if block is to determine if there are anomalies or not. Accordingly, I added this code:
if all_anoms.empty:
if verbose:
print('No anomalies detected.')
return {
'anoms': pd.Series(),
'plot': None
}
I tested this with only_last=day, hr, or None and it works as expected. Pull request coming soon.
--John
Hi,
I'm trying to use tad but it's not working.
Traceback (most recent call last):
File "tad-test.py", line 1, in <module>
import tad
ModuleNotFoundError: No module named 'tad'
tad version is 0.0.6
MS provide a R lib for anomaly tag: https://github.com/microsoft/TagAnomaly
Hi there -
When running:
anomaly_detect_vec(data, max_anoms=0.02, period=96, direction="both", plot=True)
I keep getting this:
TypeError: __verbose_if() missing 1 required positional argument: 'kwargs'
I changed the assert validation from
assert isinstance(x) == pd.Series, 'x must be pandas series'
to
assert isinstance(x, pd.Series), 'Data must be a series(Pandas.Series)'
because I kept getting an error that the isinstance() needed 2 arguments, and now I'm getting the __verbose error.
I have a feeling this might be a Python 2 vs. Python 3 issue but I'm not familiar with python enough to really know, and I think continuing on and trying to debug every new issue that pops up will not end up working... any suggestions/easily-identifiable-fixes you can see?
After commit efbc4e1f27f82727f4f25d8725ffe16604cb6b76
I have problems with importing the library.
pip3 install tad
import tad
>ModuleNotFoundError: No module named tad
Hi, was wondering if it is possible update this algorithm for seconds level of granularity?
There are several spots where Pandas DataFrame operations can be parallelized with Dask
The period selection logic is set for increments > sec but missing for ms and sec. Need to enable period selection in the cases where resampling is performed for sec and ms granularity
Getting this error in the context of TS anomaly detection on a metricbeat log stream where the timestamps are all from the same day:
results.append(anomaly_detect_ts(pd.Series(item), **self._generateParams()))
File "build/bdist.linux-x86_64/egg/anomaly_detection/anomaly_detect_ts.py", line 207, in anomaly_detect_ts
UnboundLocalError: local variable 'period' referenced before assignment
I can see in the code why this is happening I am hitting the else clause with granularity of ms and period is never initialized. Logging shows why:
2018-09-19 19:25:49.858000 (data.index[1]) 2018-09-19 19:25:49.180000 (data.index[0])
So from a pure computer science perspective, it's clear what's happening.
timediff = data.index[1] - data.index[0]
if timediff.days > 0:
num_days_per_line = 7
only_last = 'day' if only_last == 'hr' else only_last
period = 7
granularity = 'day'
elif timediff.seconds / 60 / 60 >= 1:
granularity = 'hr'
period = 24
elif timediff.seconds / 60 >= 1:
granularity = 'min'
period = 1440
elif timediff.seconds > 0:
granularity = 'sec'
# Aggregate data to minutely if secondly
data = data.resample('60s', label='right').sum()
else:
granularity = 'ms'
The questions:
Should I change the sampling so that anomaly_detect_ts gets data points at intervals of 1 min or more, or should I consider anomaly_detect_vec for granularity of sec?
Any guidance is very much appreciated, thanks!
--John
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.