Giter Site home page Giter Site logo

stream-ad / midas Goto Github PK

View Code? Open in Web Editor NEW
750.0 29.0 92.0 30.79 MB

Anomaly Detection on Dynamic (time-evolving) Graphs in Real-time and Streaming manner. Detecting intrusions (DoS and DDoS attacks), frauds, fake rating anomalies.

License: Apache License 2.0

C++ 83.26% Python 12.05% CMake 3.77% Dockerfile 0.92%
anomaly-detection fraud-detection denial-of-service intrusion-detection aaai2020

midas's Introduction

MIDAS

C++ implementation of

The old implementation is in another branch OldImplementation, it should be considered as being archived and will hardly receive feature updates.

Table of Contents

Features

  • Finds Anomalies in Dynamic/Time-Evolving Graph: (Intrusion Detection, Fake Ratings, Financial Fraud)
  • Detects Microcluster Anomalies (suddenly arriving groups of suspiciously similar edges e.g. DoS attack)
  • Theoretical Guarantees on False Positive Probability
  • Constant Memory (independent of graph size)
  • Constant Update Time (real-time anomaly detection to minimize harm)
  • Up to 55% more accurate and 929 times faster than the state of the art approaches
  • Experiments are performed using the following datasets:

Demo

If you use Windows:

  1. Open a Visual Studio developer command prompt, we want their toolchain
  2. cd to the project root MIDAS/
  3. cmake -DCMAKE_BUILD_TYPE=Release -GNinja -S . -B build/release
  4. cmake --build build/release --target Demo
  5. cd to MIDAS/build/release/
  6. .\Demo.exe

If you use Linux/macOS:

  1. Open a terminal
  2. cd to the project root MIDAS/
  3. cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release
  4. cmake --build build/release --target Demo
  5. cd to MIDAS/build/release/
  6. ./Demo

The demo runs on MIDAS/data/DARPA/darpa_processed.csv, which has 4.5M records, with the filtering core (MIDAS-F).

The scores will be exported to MIDAS/temp/Score.txt, higher means more anomalous.

All file paths are absolute and "hardcoded" by CMake, but it's suggested NOT to run by double clicking on the executable file.

Requirements

Core

  • C++11
  • C++ standard libraries

Demo (if experimental ROC-AUC impl)

  • C++ standard libraries

Demo (if sklearn ROC-AUC impl)

  • Python 3 (MIDAS/util/EvaluateScore.py)
    • pandas: I/O
    • scikit-learn: Compute ROC-AUC

Experiment

  • (Optional) Intel TBB: Parallelization
  • (Optional) OpenMP: Parallelization

Other python utility scripts

  • Python 3
    • pandas
    • scikit-learn

Customization

Switch to sklearn ROC-AUC Implementation

In MIDAS/example/Demo.cpp.
Comment out section "Evaluate scores (experimental)"
Uncomment section "Write output scores" and "Evaluate scores".

Different CMS Size / Decay Factor / Threshold

Those are arguments of cores' constructors, which are at MIDAS/example/Demo.cpp:67-69.

Switch Cores

Cores are instantiated at MIDAS/example/Demo.cpp:67-69, uncomment the chosen one.

Custom Dataset + Demo.cpp

You need to prepare three files:

  • Meta file
    • Only includes an integer N, the number of records in the dataset
    • Use its path for pathMeta
    • E.g. MIDAS/data/DARPA/darpa_shape.txt
  • Data file
    • A header-less csv format file of shape [N,3]
    • Columns are sources, destinations, timestamps
    • Use its path for pathData
    • E.g. MIDAS/data/DARPA/darpa_processed.csv
  • Label file
    • A header-less csv format file of shape [N,1]
    • The corresponding label for data records
      • 0 means normal record
      • 1 means anomalous record
    • Use its path for pathGroundTruth
    • E.g. MIDAS/data/DARPA/darpa_ground_truth.csv

Custom Dataset + Custom Runner

  1. Include the header MIDAS/src/NormalCore.hpp, MIDAS/src/RelationalCore.hpp or MIDAS/src/FilteringCore.hpp
  2. Instantiate cores with required parameters
  3. Call operator() on individual data records, it returns the anomaly score for the input record

Other Files

example/

Experiment.cpp

The code we used for experiments.
It will try to use Intel TBB or OpenMP for parallelization.
You should comment all but only one runner function call in the main() as most results are exported to MIDAS/temp/Experiiment.csv together with many intermediate files.

Reproducible.cpp

Similar to Demo.cpp, but with all random parameters hardcoded and always produce the same result.
It's for other developers and us to test if the implementation in other languages can produce acceptable results.

util/

DeleteTempFile.py, EvaluateScore.py and ReproduceROC.py will show their usage and a short description when executed without any argument.

AUROC.hpp

Experimental ROC-AUC implementation in C++11. More info at this repo.

PreprocessData.py

The code to process the raw dataset into an easy-to-read format.
Datasets are always assumed to be in a folder in MIDAS/data/.
It can process the following dataset(s)

  • DARPA/darpa_original.csv -> DARPA/darpa_processed.csv, DARPA/darpa_ground_truth.csv, DARPA/darpa_shape.txt

In Other Languages

  1. Python: Rui Liu's MIDAS.Python, Ritesh Kumar's pyMIDAS
  2. Python (pybind): Wong Mun Hou's MIDAS
  3. Golang: Steve Tan's midas
  4. Ruby: Andrew Kane's midas
  5. Rust: Scott Steele's midas_rs
  6. R: Tobias Heidler's MIDASwrappeR
  7. Java: Joshua Tokle's MIDAS-Java
  8. Julia: Ashrya Agrawal's MIDAS.jl

Online Coverage

  1. ACM TechNews
  2. AIhub
  3. Hacker News
  4. KDnuggets
  5. Microsoft
  6. Towards Data Science

Citation

If you use this code for your research, please consider citing our TKDD and AAAI papers.

@article{bhatia2022realtime,
author = {Bhatia, Siddharth and Liu, Rui and Hooi, Bryan and Yoon, Minji and Shin, Kijung and Faloutsos, Christos},
title = {Real-Time Anomaly Detection in Edge Streams},
year = {2022},
issue_date = {August 2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {16},
number = {4},
issn = {1556-4681},
url = {https://doi.org/10.1145/3494564},
doi = {10.1145/3494564},
journal = {ACM Trans. Knowl. Discov. Data},
month = {jan},
articleno = {75},
numpages = {22}
}

@inproceedings{bhatia2020midas,
    title={MIDAS: Microcluster-Based Detector of Anomalies in Edge Streams},
    author={Siddharth Bhatia and Bryan Hooi and Minji Yoon and Kijung Shin and Christos Faloutsos},
    booktitle={AAAI Conference on Artificial Intelligence (AAAI)},
    year={2020}
}

midas's People

Contributors

bhatiasiddharth avatar liurui39660 avatar ritesh99rakesh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

midas's Issues

Go package

Hey @bhatiasiddharth, very nice project and research.

Just letting you know that there is now implementation of MIDAS in golang.

If you have any feedback, let me know or feel free to create an issue on the project.

I've bench-marked the AUC.py against this project using the darpa dataset and it's similar. :)

How to decide whther edge is anomalous ?

In the Algorithm, how (on what basis ) you are deciding whether an edge is anomalous or not, given the anomaly score?
(I've read the paper but couldn't find it )

Segmentation fault: 11

Hello, I am currently trying to use MIDAS-R on a dataset however I have this error right after running it:

$ ./midas -i ../Wednesday-14-02-2018_4GRAPH.csv -o ../scores.txt
Finished Loading Data from ../Wednesday-14-02-2018_4GRAPH.csv
Segmentation fault: 11

Here is a sample of Wednesday-14-02-2018_4GRAPH.csv file:

source,destination,time
1451698946054,901943132206,352877
1451698946054,901943132206,628353
1451698946054,901943132206,973076
1451698946054,901943132206,980110
1451698946054,901943132206,981852
103079215137,1460288880642,1518566400
1322849927169,1047972020228,1518566400
1322849927169,1047972020228,1518566400
1322849927169,1047972020228,1518566400
687194767395,1640677507073,1518566400
1236950581249,1700807049228,1518566400
1322849927169,1047972020228,1518566400
1700807049228,712964571136,1518566400
1322849927169,1047972020228,1518566400
1632087572482,1477468749825,1518566400
1597727834115,94489280524,1518566400
1236950581249,979252543497,1518566400
1580547964930,979252543497,1518566400
1322849927169,1047972020228,1518566400
1116691496960,1047972020228,1518566401
1374389534736,163208757249,1518566401
1116691496960,1047972020228,1518566401
1520418422807,575525617668,1518566401

What is wrong?

Thanks

Why source and dest must be int?

Hello,

I was wondering why do we need to consider source and dest are int and not strings. Indeed, it would make more sense (to me) because usually, source and dest are IP addresses.
Thanks

Any recommendation to normalize score?

Hi,

Thank you for implementing this wonderful AD method!

I've read through your paper and the score is calculated as
image

We usually use Unix timestamp to represent time, therefore the score we get is usually very large. Do you have any recommendations to narrow the value range?

Thank you!

Unclear Docker volume binds for Demo

When running the Demo code on Docker, it took me a while before noticing that I needed to bind both $PWD/data and $PWD/temp (if I want the raw scores) when running the container. I would suggest adding a section to the README about executing the Demo on Docker and include something like the following snippet:

docker run -it \
	--rm \
	--name midas \
	--volume $PWD/data:/MIDAS/data \
	--volume $PWD/temp:/MIDAS/temp \
	midas

Any thoughts?

1.0 Changes

Hey, I tried to summarize the changes with 1.0 I encountered while upgrading the Ruby gem. It may be worth adding some version to the readme to make it easier for others to upgrade. Demo.cpp was really helpful. Assuming src, dst, and times are std::vector<int>:

Version 0.1.0

#include <anom.hpp>

vector<double>* result;
result = midasR(src, dst, times, num_rows, num_buckets, factor);

Version 1.0.0

#include <RelationalCore.hpp>

size_t n = src.size();
std::vector<float> result;
result.reserve(n);

MIDAS::RelationalCore midas(num_rows, num_buckets, factor);
for (size_t i = 0; i < n; i++) {
  result.push_back(midas(src[i], dst[i], times[i]));
}

Use NormalCore for the no relations version.

Other changes:

  • the midas function takes float input and returns a float score (previously took int input and returned a double score)
  • factor is now a float instead of a double
  • there's a new FilteringCore

Production implementation

Hi first off this is really cool, Im a novice coder and for research I would like to implement this on Netflow data in real time, the only thing is Im unsure how this can be integrated into a live environment and not on some local dataset, but maybe its a dumb question, but how should or could this be implemented?

ground truth labels for TwitterworldCup2014 dataset

I want to run MIDAS on the TwitterWorldCup2014 dataset,
but in the given dataset, the ground truth does not include the label as 0 or 1,
instead, it shows the following

1 | Arena de Sao Paulo, Sao Paulo, Brazil | Brazil, Croatia | Marcelo | Own Goal | 6-12-2014 20:11:00 | High importance events.

please suggest, how to generate labels as 0 or 1 i.e anomalous or not.
Have you already prepared ground truth labels for this, if yes could you please share that?

Here in this dataset , there are three events such as

  1. goal
    2.penalty
    3, Injury.
    what could be the anomaly in these events.

Thanks.

Threshold Used For Experimental Results

Hi there, I was attempting to replicate your results on the Darpa dataset, but realized you didn't specify the threshold you used. I understand the threshold is user defined, but would like to know what value was used in the experimental setup. Could you please clarify how you calculate the MIDAS(R) ROC and what threshold you used?

Thanks!

Should either Dockerize or better specify dependencies

I'm running Ubuntu 18.04 and so created the following initial Dockerfile to get around the cmake version requirements that prevent my following the steps listed in the Demo section of the README:

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update \
    && apt-get install --yes \
      build-essential \
      cmake \
      python-is-python3 \
    && apt-get clean \
    && rm --recursive --force \
      /var/lib/apt/lists/* \
      /tmp/* \
      /var/tmp/*

RUN mkdir /src
WORKDIR /src

COPY CMakeLists.txt ./
RUN mkdir --parents build/release \
    && cp CMakeLists.txt build/release/

COPY example ./example
COPY src ./src
COPY temp ./temp
COPY util ./util

RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
    && cmake --build build/release --target Demo

I then build it via

# Wouldn't need to use `sudo` on macOS
sudo docker build . --tag midas

and run the compile Demo app via

sudo docker run \
  --tty \
  --interactive \
  --rm \
  --volume $PWD/data:/src/data \
  midas \
  build/release/Demo

which, when shelling out to the Python scripts, aborts with the following

Traceback (most recent call last):
  File "/src/util/EvaluateScore.py", line 20, in <module>
    from pandas import read_csv
ModuleNotFoundError: No module named 'pandas'

since pandas is not available.


To better avoid the need for local environment debugging, my personal preference would be for a known-working Dockerfile.

Ruby Library

Hey, thanks for this project and research! Just wanted to let you know there are now Ruby bindings for it. If you have any feedback, let me know or feel free to create an issue on the project.

SyntaxError : print(f"ROC-AUC{indexRun} = {auc:.4f}")

When I run the Demo.py, I got the following error which I coulnt resolve after trying much. Why is that so? ( I dont think it is a syntax error also I dont find such syntax as well ) :-
Seed = 1606470101 // In case of reproduction #Records = 4554344 // Dataset is loaded Time = 826ms // Algorithm is finished // Raw anomaly scores are exported to // /home/rohit/MIDAS/MIDAS/temp/Score.txt File "/home/rohit/MIDAS/MIDAS/util/EvaluateScore.py", line 33 print(f"ROC-AUC{indexRun} = {auc:.4f}") ^ SyntaxError: invalid syntax
although output result is there in Score.txt

how to detect anomaly edges

hello, I have a question.
The output is anomaly score of edges,
but how to detect which edge is anomaly

And how to define the threashod of anomaly score

Thanks!

Tagged Releases

Hey, it'd be great to add MIDAS to Homebrew so Mac users can do brew install midas. However, this requires tagged versions. What do you think of tagging releases on GitHub?

Implement question: Should I fill in for the absent data?

Hi,
Thank you for implement this amazing anomaly detection method!
In the implementation, I'm wondering if I should fill in for the absent data,
for example, if the directional IP pair A to B appears at 10:00, but is absent at 11:00 and 12:00.
Should I fill A to B count 0 in 11:00 and 12:00?

image

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.