Giter Site home page Giter Site logo

anomalydetection's Introduction

AnomalyDetection R package

Build Status Pending Pull-Requests Github Issues

AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend. The AnomalyDetection package can be used in wide variety of contexts. For example, detecting anomalies in system metrics after a new software release, user engagement post an A/B test, or for problems in econometrics, financial engineering, political and social sciences.

How the package works

The underlying algorithm – referred to as Seasonal Hybrid ESD (S-H-ESD) builds upon the Generalized ESD test for detecting anomalies. Note that S-H-ESD can be used to detect both global as well as local anomalies. This is achieved by employing time series decomposition and using robust statistical metrics, viz., median together with ESD. In addition, for long time series (say, 6 months of minutely data), the algorithm employs piecewise approximation - this is rooted to the fact that trend extraction in the presence of anomalies in non-trivial - for anomaly detection.

Besides time series, the package can also be used to detect anomalies in a vector of numerical values. We have found this very useful as many times the corresponding timestamps are not available. The package provides rich visualization support. The user can specify the direction of anomalies, the window of interest (such as last day, last hour), enable/disable piecewise approximation; additionally, the x- and y-axis are annotated in a way to assist visual data analysis.

How to get started

Install the R package using the following commands on the R console:

install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

The function AnomalyDetectionTs is called to detect one or more statistically significant anomalies in the input time series. The documentation of the function AnomalyDetectionTs, which can be seen by using the following command, details the input arguments and the output of the function AnomalyDetectionTs.

help(AnomalyDetectionTs)

The function AnomalyDetectionVec is called to detect one or more statistically significant anomalies in a vector of observations. The documentation of the function AnomalyDetectionVec, which can be seen by using the following command, details the input arguments and the output of the function AnomalyDetectionVec.

help(AnomalyDetectionVec)

A simple example

To get started, the user is recommended to use the example dataset which comes with the packages. Execute the following commands:

data(raw_data)
res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', plot=TRUE)
res$plot

Fig 1

From the plot, we observe that the input time series experiences both positive and negative anomalies. Furthermore, many of the anomalies in the time series are local anomalies within the bounds of the time series’ seasonality (hence, cannot be detected using the traditional approaches). The anomalies detected using the proposed technique are annotated on the plot. In case the timestamps for the plot above were not available, anomaly detection could then carried out using the AnomalyDetectionVec function; specifically, one can use the following command:

AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction='both', only_last=FALSE, plot=TRUE)

Often, anomaly detection is carried out on a periodic basis. For instance, at times, one may be interested in determining whether there was any anomaly yesterday. To this end, we support a flag only_last whereby one can subset the anomalies that occurred during the last day or last hour. Execute the following command:

res = AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', only_last=”day”, plot=TRUE)
res$plot

Fig 2

From the plot, we observe that only the anomalies that occurred during the last day have been annotated. Further, the prior six days are included to expose the seasonal nature of the time series but are put in the background as the window of prime interest is the last day.

Anomaly detection for long duration time series can be carried out by setting the longterm argument to T.

Copyright and License

Copyright 2015 Twitter, Inc and other contributors

Licensed under the GPLv3

anomalydetection's People

Contributors

adamatw avatar akejariwal avatar caniszczyk avatar cozos avatar darrkj avatar gsee avatar terrytangyuan avatar wrathematics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anomalydetection's Issues

AnomalyDetection Installation Failed

Hey, i've had trouble installing AnomalyDetection, the code that it returns is below. R Version 3.1.2, Windows Version is Windows 7 Service Pack 1.

Code is below

devtools::install_github('twitter/AnomalyDetection')
Downloading github repo twitter/AnomalyDetection@master
Installing AnomalyDetection
"C:/R/R-3.1.2/bin/x64/R" --vanilla CMD INSTALL "C:\Users\Colin Glaes\AppData\Local\Temp\RtmpWsGLMv\devtools2464d8b3899\twitter-AnomalyDetection-4eb1baf"
--library="C:/R/R-3.1.2/library" --install-tests

  • installing source package 'AnomalyDetection' ...
    ** R
    ** data
    *** moving datasets to lazyload DB
    ** tests
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
    *** arch - i386
    ARGUMENT 'Glaes\AppData\Local\Temp\Rtmpsnxpqp\Rin363c5c64245' ignored

Error: object 'ÿþ' not found
Execution halted
*** arch - x64
ARGUMENT 'Glaes\AppData\Local\Temp\Rtmpsnxpqp\Rin363c574477c2' ignored

Error: object 'ÿþ' not found
Execution halted
ERROR: loading failed for 'i386', 'x64'

  • removing 'C:/R/R-3.1.2/library/AnomalyDetection'
    Error: Command failed (1)

High frequency data sets anomaly detection

I am trying to perform an anomaly detection on a data set with very high frequency (more than 5/10 row per seconds) and the timestamps are not consecutive (sometimes there is no row for as second

Exemple : 09:23:59 2014-12-19 09:23:59 2014-12-19 09:24:00 2014-12-19 09:24:00 2014-12-19 09:24:02 2014-12-19 09:24:02 2014-12-19 09:24:02

I understand that I should use AnomalyDetectionTs to perform the detection on this type of set.

But my set has 50K rows but the function cannot compute the detection and crashes. Maybe it is also due to the fact that the timeseries are not spaced with a fixe time (sometimes 1sec, or 0 or even 2 secs)?

What are your recommendations to work with this type of dataset ?

Thanks,

Flow

data.frame Column Error

I created a data.frame called foo and attempted to format it exactly like raw_data, but when I set res, I get an error.

My data.frame:

head(foo)
timestamp count
1 2015-05-11 13:54:00 42748.0
2 2015-05-11 13:55:00 44152.0
3 2015-05-11 13:56:00 43642.0
4 2015-05-11 13:57:00 42544.0
5 2015-05-11 13:58:00 41627.0
6 2015-05-11 13:59:00 42138.0

Setting res, getting an error:

res = AnomalyDetectionTs(foo, max_anoms=0.02, direction='both', plot=TRUE)
Error in AnomalyDetectionTs(foo, max_anoms = 0.02, direction = "both", :
data must be a 2 column data.frame, with the first column being a set of timestamps, and the second coloumn being numeric values.

raw_data looks quite like foo:

head(raw_data)
timestamp count
1 1980-09-25 14:01:00 182.478
2 1980-09-25 14:02:00 176.231
3 1980-09-25 14:03:00 183.917
4 1980-09-25 14:04:00 177.798
5 1980-09-25 14:05:00 165.469
6 1980-09-25 14:06:00 181.878

Any idea what I'm doing wrong?

Thanks,

Steve

Documentation to describe all dependencies needed to install the R package

I am on a private network disconnected from the internet and I would like to install the R AnomalyDetection packages. Installing local on my laptop from the internet seems to pull in a bunch of other packages. It would be really really useful if there was documentation on the exact packages I would need to transfer in order to install.

I'm also new to R so maybe it's possible there's some equivalent in 'install.packages()' similar to maven's "copy-dependencies" where I can put everything in a folder and tar it up.

Consider changing 'plot.new()' to 'NULL' in the output when 'plot = FALSE'

Line 287 - 291 of vec_anom_detection.R (and similarly in ts) is:

 if(plot){
    return (list(anoms = anoms, plot = xgraph))
  } else {
    return (list(anoms = anoms, plot = plot.new()))
  }

Consider changing the plot = plot.new() to plot = NULL or removing it altogether. When using in a non interactive environment (such as a Shiny app), this can run into issues with plot devices since 'plot.new()' interactively builds a plot.

The current workaround is to set plot = TRUE in the function call and then just never use the plot in your code but it would be cleaner to change this in the code directly.

AnomalyDetectionTs drops timezone from POSIXct objects and converts POSIXlt to POSIXct

AnomalyDetectionTs should keep the timestamp column of the output dataset as-is, rather than converting to POSIXct and dropping the timezone attribute:

library(AnomalyDetection)
data(data_raw)
data <- raw_data
data$timestamp <- as.POSIXct(data$timestamp)
attr(data$timestamp, "tzone") 
attr(data$timestamp, "tzone") <- "America/New_York"

res = AnomalyDetectionTs(data, max_anoms=0.002, direction='both', plot=FALSE)
attr(res$anoms, 'tzone')

Dropping the timezone is problematic if you wish to merge the anomalies back to the main dataset:

> merge(data, res$anoms, by='timestamp')
             timestamp   count    anoms
1  1980-09-28 22:40:00 114.308 193.1036
2  1980-09-30 12:26:00 130.222 180.8990
3  1980-09-30 12:30:00 126.721 178.8220
4  1980-09-30 12:31:00 152.956 198.3260
5  1980-09-30 12:32:00 136.004 203.9010
6  1980-09-30 12:33:00 134.589 200.3090
7  1980-09-30 12:34:00 122.490 178.4910
8  1980-09-30 12:36:00 126.806 183.0180
9  1980-09-30 12:38:00 117.334 186.8230
10 1980-09-30 12:39:00 121.061 183.6600
11 1980-09-30 12:40:00 116.924 179.2760
12 1980-09-30 12:41:00 129.097 197.2830
13 1980-09-30 12:42:00 119.566 191.0970
14 1980-09-30 12:43:00 137.694 194.6700
15 1980-09-30 12:46:00 136.876 200.8160
16 1980-09-30 12:47:00 125.126 186.2350
17 1980-09-30 12:48:00 122.008 185.4210
18 1980-09-30 12:49:00 127.935 178.9580
19 1980-09-30 12:51:00 138.159 203.2310
20 1980-09-30 12:52:00 130.939 181.3540
21 1980-09-30 12:53:00 122.351 186.7780
22 1980-09-30 12:55:00 121.120 176.1250
23 1980-09-30 12:56:00 122.707 181.5140
24 1980-09-30 12:57:00 118.378 175.2610
25 1980-10-05 05:18:00 101.332  40.0000
26 1980-10-05 05:28:00 103.798 250.0000
27 1980-10-05 05:38:00 100.839  40.0000

Using AnomalyDetection in parallel or in any forked environment fails

Using AnomalyDetection in parallel across a data.frame currently fails with the following error:

Error in (function (display = \"\", width, height, pointsize, gamma, bg,  :  
    a forked child should not open a graphics device

Here is a trivial example to reproduce the problem:

library(parallel)
library(AnomalyDetection)
mclapply(as.data.frame(ts.union(BJsales, BJsales.lead)), AnomalyDetectionVec, period = 5)

Which produces the above errors.

Cannot detect anomaly with custom dataset

I keep on getting below error when trying to detect anomaly with a custom data set which contains only a list of <timestamp, integer> list
res = AnomalyDetectionTs(data, max_anoms=0.1, direction='both', plot=TRUE) Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period, : must supply period length for time series deomosition

below is the first few lines of my data set

                     date   size
1     2014-11-09 03:39:31  19512
2     2014-11-09 03:42:20   5308
3     2014-11-09 03:46:14      0
4     2014-11-09 03:46:15   5270
5     2014-11-09 03:50:19    822
6     2014-11-09 03:52:58   5319
7     2014-11-09 03:53:23   5379
8     2014-11-09 03:53:23    266
9     2014-11-09 03:53:23     21
10    2014-11-09 03:53:23   7199
11    2014-11-09 03:53:23  15414
12    2014-11-09 03:53:23  95786
13    2014-11-09 03:53:24  12417
14    2014-11-09 03:53:26  29156
15    2014-11-09 03:53:27    462
16    2014-11-09 04:00:28      0
17    2014-11-09 04:00:29   5270
18    2014-11-09 04:01:54  51491
19    2014-11-09 04:02:05   5326
20    2014-11-09 04:06:10  47288

Error in data.frame

I am getting following error message:
Error in data.frame(timestamp = all_anoms[[1]], anoms = all_anoms[[2]], :
arguments imply differing number of rows: 1, 0

Data looks like this:
1 2014-12-28 00:00:00 46.25243
2 2014-12-28 01:00:00 43.16433
3 2014-12-28 02:00:00 40.06927
4 2014-12-28 03:00:00 39.27673
5 2014-12-28 04:00:00 40.28478
6 2014-12-28 05:00:00 47.17522
7 2014-12-28 06:00:00 56.34756
8 2014-12-28 07:00:00 66.45515

and method call is like this:
AnomalyDetectionTs(data, max_anoms=0.05, threshold = "None", direction='both', plot=FALSE, only_last = "day", e_value = TRUE)

Error when granularity is daily and only_last is null

There's seems to be a bug for data with daily granularity. If gran == day, AnomalyDetectionTs does a check:

if(only_last == 'hr')

However, only_last can also be null. If it is, this check generates an error which keeps AnomalyDetectionTs from finishing:

Error in if (only_last == "hr") { : argument is of length zero

I submitted a pull request which fixes the problem by checking only_last for null first.

Also, thanks for a super package.

No more auto-print of results if "No anomalies detected."

Suppose I have a twelve-month-periodic vector vec, for which there are no anomalies. The following should not happen automatically, but currently it does.

anomaly_detection_obj <- AnomalyDetectionVec(vec, period = 12, threshold = "p95")
# [1] "No amomalies detected."

At the very least, whether to auto-print this message should be an argument in the function.

detect_anoms erroneously reports at least one anomaly, regardless of data

I think I've found a bug in detect_anoms.

Before the main loop, num_anoms is initialized to 0.

At the end of each iteration, you update num_anoms if R is greater than lambda.

Then after the loop, you return R_idx[1L:num_anoms].

So if no elements made R exceed lambda, the return value works out to R_idx[1L:0L]. But this range subscript gives you the first element, not an empty vector:

> foo = c(4,5,6,7)
> foo[1:0]
[1] 4
>

So won't it always report the most extreme value as an outlier, no matter what data you give it? (Of course the user won't see this if they've set a threshold in AnomalyDetection, but they might not do that...)

Plot is empty if any point is <= 0 and y_log is used

blank_anomalylog
To reproduce:

data(raw_data)
raw_data[4000, "count"] <- 0
AnomalyDetectionTs(raw_data, y_log = F, plot = T)
AnomalyDetectionTs(raw_data, y_log = T, plot = T)

This seems to be a problem with coord_trans, as transforming the data to be plotted (after anomaly detection) and skipping add_formatted_y (it can't handle the presence of Na/NaN/Inf) will show all points with the undefined value set to 0.
anomalylog

Alternatively, if the transformation is only for ease of visualization,

AnomalyDetectionTs(raw_data, y_log = F, plot = T)$plot + scale_y_log10()

will transform the axis without changing the value of the data, which might be preferable. It will overwrite add_formatted_y, though.

I haven't submitted a fix since I'm not sure if this is desired behaviour or not, but I'd be happy to if it would be useful. And thank you for the awesome package!

Error Messages from AnomalyDetection: R_idx <- data[[1]][temp_max_id] and if(R > lam)

Hey guys,

I've ran some numbers through the library and run into a few issues.

When i was running a large amount of datasets through the package I continued being presented the error message below referencing line 89 of "detect_anoms.R"

Error in R_idx[i] <- data[[1]][temp_max_idx] : replacement has length zero

I noticed that the datasets which tripped the error seemed to have near constant time-series datasets with a low number of unique values (i.e. constant 0's for 1 month constant 1's for 2 months). so I set a minimum unique value for the dataset to get around it (I started at one and went up to nine). This allowed more datasets to get through but I eventually ran into the below error message referencing line 101 when i set the minimum unique value at 9.

Error in if (R > lam) num_anoms <- i :
missing value where TRUE/FALSE needed

I successfully ran all my datasets after setting the minimum unique value at 10, however i would like to know whether or not it is possible to run the package without this unique value threshold.

Thanks!

Period not set if granularity is "sec"

Hi there,

I'm trying to run an analysis for a time series that has a granularity of 20 seconds. Unfortunatley, there is a bug in the code...

In AnomalyDetectionTs, I run into the following problem:
I have tested the get_gran function, and it does return "sec". Later on the data is aggregated to minutes, but unfirtunately, the variable gran is never set to "min". When period is defined, the switch statement does not check for "sec" - which is still the value of gran - and thus remains null. This makes the call of detect_anoms crash with the error message "must supply period length for time series decomposition".

  # Aggregate data to minutely if secondly
  if(gran == "sec"){
    x <- format_timestamp(aggregate(x[2], format(x[1], "%Y-%m-%d %H:%M:00"), eval(parse(text="sum"))))
  }

  period = switch(gran,
                  min = 1440,
                  hr = 24,
                  # if the data is daily, then we need to bump the period to weekly to get multiple examples
                  day = 7)
  num_obs <- length(x[[2]])

  if(max_anoms < 1/num_obs){
    max_anoms <- 1/num_obs
  }

I'm pretty sure that setting gran <- "min" in the if(gran=="sec") block would fix the problem.

[Question] What's a good way to choose "longterm_period"?

This is a n00b question. Looking at the article (https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series) and code, it seems that AnomalyDetectionTs requires piecewise_median_period_weeks to be greater than or eq to 2 weeks.

If I have higher frequency data in a shorter timeline, I presume that I should use AnomalyDetectionVec and calibrate longterm_period manually?

What would be a good way to determine an optimal longterm_period and what are the drawbacks of choosing a shorter/longer period? Thanks in advance for your insights!

period problem with AnomalyDetectionTs

Hi everybody,

After successfully running the example, I created an own data set, which has the same format like raw_data, I create an myData, which has the same structure as the raw_data. But there still two places are a little different

  • It constains missing value in the second column (raw_data has no missing value)
  • The timestamp is just for one day, the time interval is every 15 seconds. (raw_data has 5 day history and the time interval is every minute)

It looks like:
1 1970-01-01 01:00:55 NA
2 1970-01-01 01:00:10 NA
3 1970-01-01 01:00:25 2.871
4 1970-01-01 01:00:40 2.654
5 1970-01-01 01:00:55 3.060
6 1970-01-01 01:00:10 9.074

after I run the same command like the example:

res = AnomalyDetectionTs(myData, max_anoms=0.02, direction='both', plot=TRUE)

I got the error message:

Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period, : must supply period length for time series decomposition

How can I fix this problem?

If I don't know the period, can I still find the anomalies?

Thanks very much for the great work!

Best Regards

Conny

csv import

Hello

I might be asking a dumb question but I tried to use my brain and it failed ;-)
When trying to use the library I import data from a .csv (data_raw <- read.csv(file = file_name, header = FALSE, sep = "," ).
I get a table with two rows, one containing a timestamp and the second a value.

My problem is whenerver I try res = AnomalyDetectionTs(data_raw, max_anoms=0.02, direction='both', plot=TRUE
or res = AnomalyDetectionVer(data_raw[1], max_anoms=0.02, direction='both', plot=TRUE
I get Error in Summary.factor(1:339, na.rm = FALSE) : ‘max’ not meaningful for factors

Can't install AnomalyDetection

On OS X 10.9.5
R 3.1.2

library(devtools)
devtools::install_github("twitter/AnomalyDetection")
Downloading github repo twitter/AnomalyDetection@master
Error in function (type, msg, asError = TRUE) :
SSL certificate problem: unable to get local issuer certificate

Error: "data must be a single data frame, list, or vector that holds numeric values" When data is a dataframe in correct format...

Hey everyone.

I am trying to run my dataframe through the AnomalyDetectorVec(). My dataframe is a small one, for now, and I believe it is in the correct format.

Here is the dataframe:

> str(es_out)

'data.frame':   500 obs. of  2 variables:
 $ timestamp_list: POSIXct, format: "2015-07-23 04:10:56" "2015-07-23 04:10:51" "2015-07-23 04:11:11" ...
 $ in_bytes_list : int  3893 3893 2335 2319 3893 125 71 71 52 657 ...

When I try to run it through AnomalyDetectorVec(), I get an error:

> AnomalyDetectionVec(es_out, period=500, plot=TRUE, verbose=TRUE)

Error in AnomalyDetectionVec(es_out, period = 500, plot = TRUE, verbose = TRUE) : 
  data must be a single data frame, list, or vector that holds numeric values.

What is going wrong here? I cannot seem to figure it out...

Here is a dput() of my dataset and my dataframe conversion funciton in a pastebin, for cleanliness.

Dataset dput(): http://pastebin.com/WkY7pvwt
Dataframe conversion function: http://pastebin.com/EsAcVNbV

Any help would be greatly appreciated. As far as I can tell, my dataframe is in the correct format, but I guess it actually isn't.

Thanks!

case when max_outliers = 0

If the user is running AnomalyDetectionTs() we can assume that they are looking for outliers. Therefore, could a warning be thrown if the user sets a percentage (max_anoms) that results in max_outliers being 0?

On a similar note, I think this degenerate case demonstrates that it is safer to iterate over seq_len(max_outliers) rather than 1:max_outliers.

Setting e_value=T causes "differing number of rows" error

Hi, great package.

However, when trying to extract the expected values from my dataset, I get this error:

## a_data holds daily count observations
> str(a_data)
'data.frame':   30 obs. of  2 variables:
 $ date  : POSIXct, format: "2013-01-15 01:00:00" "2013-01-16 01:00:00" "2013-01-17 01:00:00" ...
 $ metric: num  192 123 196 193 172 195 123 158 103 115 ...

## works
> AnomalyDetectionTs(a_data, max_anoms=0.02, direction='both')
$anoms
   timestamp anoms
1 2013-01-20   195

$plot
NULL

## error
> AnomalyDetectionTs(a_data, max_anoms=0.02, direction='both', e_value = T)
Error in data.frame(timestamp = all_anoms[[1]], anoms = all_anoms[[2]],  : 
  arguments imply differing number of rows: 1, 0

The same command works fine with the demo raw_data in the package

> AnomalyDetectionTs(raw_data, max_anoms=0.02, direction='both', e_value=T)
$anoms
              timestamp    anoms expected_value
1   1980-09-25 16:05:00  21.3510            129
2   1980-09-29 06:40:00 193.1036             97
3   1980-09-29 21:44:00 148.1740             96
...

> str(raw_data)
'data.frame':   14398 obs. of  2 variables:
 $ timestamp: POSIXlt, format: "1980-09-25 14:01:00" "1980-09-25 14:02:00" "1980-09-25 14:03:00" ...
 $ count    : num  182 176 184 178 165 ...

Here is a copy of my data used above (limited to 30 rows). The original data is 900 observations.

                  date (none)
1  2013-01-15 01:00:00    192
2  2013-01-16 01:00:00    123
3  2013-01-17 01:00:00    196
4  2013-01-18 01:00:00    193
5  2013-01-19 01:00:00    172
6  2013-01-20 01:00:00    195
7  2013-01-21 01:00:00    123
8  2013-01-22 01:00:00    158
9  2013-01-23 01:00:00    103
10 2013-01-24 01:00:00    115
11 2013-01-25 01:00:00    138
12 2013-01-26 01:00:00     95
13 2013-01-27 01:00:00    121
14 2013-01-28 01:00:00    143
15 2013-01-29 01:00:00    118
16 2013-01-30 01:00:00    110
17 2013-01-31 01:00:00    107
18 2013-02-01 01:00:00    120
19 2013-02-02 01:00:00     91
20 2013-02-03 01:00:00     93
21 2013-02-04 01:00:00    149
22 2013-02-05 01:00:00    112
23 2013-02-06 01:00:00    109
24 2013-02-07 01:00:00    109
25 2013-02-08 01:00:00     90
26 2013-02-09 01:00:00     74
27 2013-02-10 01:00:00     85
28 2013-02-11 01:00:00    113
29 2013-02-12 01:00:00    107
30 2013-02-13 01:00:00    110

Error while running from Rscript

AnomalyDetectionTs works file on the R gui however if I run it as a script I am getting the following error

Error in initFields(scales = scales) :
could not find function "initRefFields"
Calls: AnomalyDetectionTs ... initialize -> initialize -> -> initFields
Execution halted

The same thing in the script worked fine in the GUI.

using anomalydetection for a time series data package but there is an error i am getting

PFB the dataset: weekly data for a metrics. Want to detect anomalies in this time series. The error I get is : Error in if (data_sigma == 0) break :
missing value where TRUE/FALSE needed
1 2013-01-01 59.94
2 2013-01-08 59.65
3 2013-01-15 61.56
4 2013-01-22 58.37
5 2013-01-29 58.07
6 2013-02-05 57.31
7 2013-02-12 58.53
8 2013-02-19 63.22
9 2013-02-26 60.21
10 2013-03-05 59.09
11 2013-03-12 57.19
12 2013-03-19 55.97
13 2013-03-26 59.96

Error in if (data_sigma == 0) break :

Below is the sample: weekly data for a metrics. Want to detect anomalies in this time series. The error I encounter while running the code is : Error in if (data_sigma == 0) break :
missing value where TRUE/FALSE needed
1 2013-01-01 59.94
2 2013-01-08 59.65
3 2013-01-15 61.56
4 2013-01-22 58.37
5 2013-01-29 58.07
6 2013-02-05 57.31
7 2013-02-12 58.53
8 2013-02-19 63.22
9 2013-02-26 60.21
10 2013-03-05 59.09
11 2013-03-12 57.19
12 2013-03-19 55.97
13 2013-03-26 59.96

License Clarification

Hello,

The DESCRIPTION file licenses this project under Apache 2.0:
https://github.com/twitter/AnomalyDetection/blob/master/DESCRIPTION#L15

However, the LICENSE file indicates that Twitter, Inc. licenses this project under GPLv3:
https://github.com/twitter/AnomalyDetection/blob/master/LICENSE

The README file specifies that Twitter, Inc. and an unspecified list of "other contributors" license this project under GPLv3 as well:
https://github.com/twitter/AnomalyDetection#copyright-and-license

Can you please clarify the license under which this project is released and the copyright owner(s)? Thanks!

Trivial anomalies are NOT detected

x = 1:5000
x[4900:4910] = 3000
AnomalyDetectionVec(x, period=1440, direction = 'both', e_value = T, plot = T)

I get the following disappointing result:
$anoms
data frame with 0 columns and 0 rows

AnomalyDetectionVec Vector needs to be periodic?

I got this error when running AnomalyDetectionVec():

Error in stl(ts(data[[2L]], frequency = num_obs_per_period), s.window = "periodic",  : 
              series is not periodic or has less than two periods

which is triggered from running this stl(ts(c(2,4,5,5,4,3)), s.window = "periodic")

It looks like this vector needs to be periodic? Any workaround for this? Let me know and I might be able to help you make changes in the code.

Thanks,
Yuan

Anom detection needs at least 2 periods worth of data

str(bar)
'data.frame': 506 obs. of 2 variables:
$ timestamp: POSIXct, format: "2014-08-25 00:00:00" "2014-08-25 00:10:00" ...
$ count : num 40465895 54157589 34727655 38576160 36686470 ...

res = AnomalyDetectionTs(bar, direction='both', max_anoms=0.02, plot=TRUE)
Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period, :
Anom detection needs at least 2 periods worth of data

What's the definition of period here? The data contains a time series for about 4 days with granularity of 10 minutes.

Posting the data frame "bar" here
https://www.dropbox.com/s/1j263k6srq18qpp/bar.Rda?dl=0

Encounter Error when do anomaly detection on a constant series

hi, when I try anomaly detection on a constant series, there is an error. I know it's impossible to find out anomaly from that kind of data. I just think it's better to tell "there is no anomaly" than throw out error.

test <- rep(1,1000)
AnomalyDetectionVec(test, period=14, plot=T, direction='both')
Error in if (R > lam) num_anoms <- i :
missing value where TRUE/FALSE needed

Failed to Install AnomalyDetection

I am on windows 8. in R 3.1.1 x64 I get the following error:

Warning in library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE) :
no library trees found in 'lib.loc'
Error: loading failed
Execution halted
*** arch - x64
Warning in library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE) :
no library trees found in 'lib.loc'
Error: loading failed
Execution halted
ERROR: loading failed for 'i386', 'x64'

I tried installing R in cygwin bash and got the following error:

Downloading github repo twitter/AnomalyDetection@master
Installing AnomalyDetection
Error in parse_deps(paste(deps, collapse = ",")) :
Invalid comparison operator in dependency:

(the dependency is left blank)

Is there an optimal way to do point anomaly detection ?

Does anyone know how to optimally use this to check if a given data point is an anomaly or not ? Specifically, the use case is to use 1-3 month, 1minute aggregated dataset as an input and decide if the next 1 minute datapoint is an anomaly or not. I am also interested to see if anyone has adapted this to make it an online anomaly detection engine. Appreciate any pointers. What I am doing right now is to call AnomalyDetectionTs with only_last='hr' for each and every incoming datapoint and it tends to be pretty slow.

Error when running AnomalyDetection with Rscript

Hello,

when running AnomalyDetection inside R (gui or interactive terminal) I have no errors, but when running with Rscript I've got the following error:

Error in .setupMethodsTables(fdef, initialize = TRUE) :
  trying to get slot "group" from an object of a basic class ("NULL") with no slots
Calls: AnomalyDetectionTs ... getMethodsForDispatch -> .getMethodsTable -> .setupMethodsTables
Execution halted

Thus, to fix this I had to include library(methods) in my script.

Although it is running ok with this, it is generating a Rplots.pdf file after each iteration, which may indicate the cause for the above error.

Why the software history was not kept?

Hi there,

I'm a researcher studying software evolution. As part of my current research, I'm studying the implications of open-sourcing a proprietary software, for instance, if the project succeed in attracting newcomers. AnomalyDetection was in my list. However, I observed that the software history of when the software was developed as a proprietary software was not kept after the transition to Github.

Knowing that software history is indispensable for developers (e.g., developers need to refer to history several times a day), I would like to ask AnomalyDetection developers the following four brief questions:

  1. Why did you decide to not keep the software history?
  2. Do the core developers faced any kind of problems, when trying to refer to the old history? If so, how did they solve these problems?
  3. Do the newcomers faced any kind of problems, when trying to refer to the old history? If so, how did they solve these problems?
  4. How does the lack of history impacted on software evolution? Does it placed any burden in understanding and evolving the software?

Thanks in advance for your collaboration,

Gustavo Pinto, PhD
http://www.gustavopinto.org

Definition of period in AnomalyDetectionVec

Hi,

I am confused with the 'period' perimeter in in function AnomalyDetectionVec. I have minute level data for a day which is 1440 data record in total. I want to use AnomalyDetectionVec to find anomalies for the dataset. I am wondering should I set period= 24 or period = 60? Can someone explain more in detail on how the period perimeter work in AnomalyDetectionVec.

Thank you
Jim

Removing leading NA's and subtracting the median

Hi guys,

I came across the package which looks great. I have the following 2 questions on the code in 'detect_anoms.R':

  1. In line 51, any leading NA's are replaced by 1. Shouldn't it be 0 (zero)?

  2. In line 37, the median is subtracted from the data. In lines 72-80, the median is subtracted again. Is this correct?

I don't know the details of 'S-H-ESD' algorithm, so excuse me if I'm wrong!

Thanks!

Failed to install Anomaly Detection

OS X 10.9.2
R version 3.1.1
here is the error message:
devtools::install_github("twitter/AnomalyDetection")
Downloading github repo twitter/AnomalyDetection@master
Error in download(dest, src, auth) : client error: (403) Forbidden

seems other person also faces the same problem, any solution?

consistent output for AnomalyDetectionTs()

If there are no anomalies detected then list(anoms = NULL, plot = NULL) is returned. Does this special case need to be made? Instead could a data frame (with 0 rows) and a plot (if plot = TRUE) still be returned for consistency?

Error message: Error in if (gran >= 86400) { : missing value where TRUE/FALSE needed

I'm trying to use AnomalyDetectionTs() exactly as described in the example, but with my own data.
When i execute this command:

res = AnomalyDetectionTs(my_data, max_anoms=0.02, direction='both', plot=TRUE)

i get the following error:

Error in if (gran >= 86400) { : missing value where TRUE/FALSE needed

This is how my data looks like:

str(my_data)
'data.frame': 3841 obs. of 2 variables:
$ INFO_DATE : POSIXct, format: "2015-01-11 00:01:21" "2015-01-11 00:20:55" ...
$ QUANTITY : int 5881 9565 11268 12376 12983 13454 13956 14409 15613 21024 ...

Do you have any idea how to solve this problem?

Failed to install the AnomalyDetection

OS:windows
R version: verison 3.2.2

When i run "devtools::install_github("twitter/AnomalyDetection")

following error is reported:
"Downloading GitHub repo twitter/AnomalyDetection@master
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached
"
what's happened?

thanks
ndoors.

Issue - period length for time series decomposition

Hello team,

I started exploring this package and I am struck.
I have a data.frame which contains some parameter values captured every 15 minutes , hence 96 records for one day. I have data for 27 days.

I get the below error when I try to run:

> names(a)
[1] "DTime"          "Paramter"


> unique(as.Date(a$DTime))
 [1] "2016-06-27" "2016-06-28" "2016-06-29" "2016-06-30" "2016-06-09"
 [6] "2016-06-10" "2016-06-11" "2016-06-12" "2016-06-13" "2016-06-14"
[11] "2016-06-15" "2016-06-16" "2016-06-17" "2016-06-18" "2016-06-19"
[16] "2016-06-20" "2016-06-21" "2016-06-22" "2016-06-23" "2016-06-24"
[21] "2016-06-25" "2016-06-26" "2016-07-01" "2016-07-02" "2016-07-03"
[26] "2016-07-04" "2016-06-08"

> head(a)
                DTime Paramter
1 2016-06-27 00:00:00          13.03
2 2016-06-27 00:15:00           1.58
3 2016-06-27 00:30:00           1.39
4 2016-06-27 00:45:00           1.61
5 2016-06-27 01:00:00           6.99
6 2016-06-27 01:15:00           1.71

> AnomalyDetectionTs(a,   max_anoms = 0.01)
Error in detect_anoms(all_data[[i]], k = max_anoms, alpha = alpha, num_obs_per_period = period,  :
  must supply period length for time series decomposition

I tried longterm = T but didnt help. Please let me know how to solve this.

Short time serie error

Hello there,

I'm trying to apply the anomaly detection function to my time serie. Which is composed by only 60 observations, and it has 2 periods.

This is how I set up in the first place:

res <- AnomalyDetectionVec(data$value, max_anoms=0.4, period=29, 
                                           direction='neg', only_last=FALSE, 
                                           plot=TRUE)

# This is the output
$anoms
data frame with 0 columns and 0 rows
$plot
NULL

I don't know how, but I was able to run it in the first time. But now, I can't even get the graph, all the output variables are NULL.
I already checked if the object's class was compatible, and it matches the same the class as the dataset used in the example ("raw_data").

This is the data:
vector_data.txt

Diego Della Justina, PhD

Issues using daily data with the "long_term" option

I'm not sure that this package was meant to be used on daily data, as Twitter seems to be using it for very granular minutely data. But anyways, here are the issues I've encountered

Data Set: Daily timestamp/count pairs for the past two years (so around 730 rows)

With "long_term=true" and daily data (therefore "gran=day" "period = 7"), AnomalyDetectionTs will split the dataset into two week periods of 14 rows for each day. (ts_anom_detection.R, lines 168-177)

This causes two issues:

  1. detect_anoms is passed a dataset of 14 rows and num_obs_per_period of 7, which the causes the STL function to throw the error "stl : series is not periodic or has less than two periods"

    stl(ts(data[[2]], frequency=num_obs_per_period), s.window="periodic", robust=TRUE)
    (detect_anoms.R, line 33)

    I think this happens for one of two reasons. One, the STL function needs to dataset to have 2*frequency + 1 observations, which is a given for minutely/hourly data in a two week period, but not for days (14 days in two weeks). Two, it could happen when the last two-week subset is less than two weeks. For example, 53 weeks of data with the long_term enabled will create 26 2-week intervals and 1 1-week interval - the last 1-week interval will throw "series is not periodic or has less than two periods" when passed into STL.

  2. max_anoms on two-week intervals of daily data will always end up being 0 (0.02 * 14 days = 0), unless you have a very large max_anoms. Two week periods are probably too small for daily data.

Apologies if the expectation was to fix the issues and create a pull-request :), I'm not sure if the S-H-ESD is meant to be used on daily data.

-Arwin from Adroll

Suggestion: Identify and Remove Linear Trend Along with Seasonal Component

The generalized ESD method normalizes deviation from the mean based on an estimate of the population variance. If the data has an uncompensated, appreciable linear trend this is equivalent to estimating the noise in the data to be much higher than true noise in the signal and many outlying data points will be removed.

This package uses stl from the R stats library to remove the seasonal component means, and identfies the trend in the data but it doesn't remove it before doing the ESD analysis. My suggestion is to just use the remainder column of data_decomp for ESD analysis (optionally subtracting the median).

From https://github.com/twitter/AnomalyDetection/blob/master/R/detect_anoms.R

# -- Step 1: Decompose data. This returns a univarite remainder which will be used for anomaly detection. Optionally, we might NOT decompose.
    data_decomp <- stl(ts(data[[2L]], frequency = num_obs_per_period),
                       s.window = "periodic", robust = TRUE)

    # Remove the seasonal component, and the median of the data to create the univariate remainder
    data <- data.frame(timestamp = data[[1L]], count = (data[[2L]]-data_decomp$time.series[,"seasonal"]-median(data[[2L]])))

Here is a trivial example of the kind of issue this can cause:
Run the example
AnomalyDetectionVec(raw_data[,2], max_anoms=0.02, period=1440, direction=’both’, plot=TRUE)
rplot1
Add a linear trend and run again
new_data = raw_data + 0.01*(1:14398)
AnomalyDetectionVec(new_data[,2], max_anoms=0.02, period=1440, direction=’both’, plot=TRUE)
rplot2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.