Giter Site home page Giter Site logo

omnisketch's Introduction

OmniSketch

OmniSketch is a C++ framework for designing, simulating and testing sketch, a telemetry technique that aims at collecting streaming statistics. It is designed to be efficient, consistent and easy to use. Currently it is supported on Windows, Linux and MacOS.

Feature Overview

  • A wide variety of sketches are implemented with minimal code. Some complicated sketches, such as ElasticSketch, are also supported. For a full list of the sketches available, check this table.
  • An auto-testing framework that tests sketch performance and collects interested metrics with minimal configurations.
  • A generally applicable Counter Hierarchy framework that enables a tremendous amount of space saving via hierarchical counter braiding, applicable to all value-based counters.
  • Pcap Parser parses .pcap and .pcapng files to generate streaming data that interoperates with OmniSketch. Data format of a streaming data record is customizable.
  • Flexible definitions of a flowkey are available. Currently, 1-tuple, 2-tuple and 5-tuple are supported.
  • No redundant code has to be rewritten. Designer can focus mainly on the sketch algorithm, and let the framework do the rest, e.g. data parsing and testing details.
  • Since sketching algorithms vary a lot, the framework includes only the most common methods. Do not panic! You can always write inject your own code and tell the framework to run it.

Download & Build

Download from GitHub release page:

git clone --recurse-submodules https://github.com/N2-Sys/OmniSketch.git
cd OmniSketch
mkdir build; cd build

The project is built with CMake. If an additional module of PcapParser, which is capable of parsing .pcap/.pcapng files in quite a versatile manner and is interoperable with OmniSketch, is requested, please define the BUILD_PCAP_PARSER macro by

cmake .. -DBUILD_PCAP_PARSER=True

Upon defining this macro, CMake automatically checks for dependencies on PcapPlusPlus and pcap libraries, so MAKE SURE you have installed these libraries in advance. The default included directory of the PcapPlusPlus header files is /usr/local/include/pcapplusplus (which is certainly not true on Windows). To change this default searching path, provide to CMake a new argument:

cmake .. -DBUILD_PCAP_PARSER=True -DPCPP_INCLUDE_PATH=[path on your system]
# e.g. cmake .. -DBUILD_PCAP_PARSER=True -DPCPP_INCLUDE_PATH=/usr/local/include/pcapplusplus

In the simplest case, if PcapParser is not requested, the PcapPlusPlus and pcap libraries are never needed, so you can just build with

cmake ..

Gee, it saves you a lot of work!

To verify that you indeed build a runnable copy of OmniSketch, in the same directory (build/) run

ctest

If you see the line

100% tests passed, 0 tests failed out of 8

you can proceed to poke around OmniSketch and design your new sketches.

Design New Sketches

Here is an overview of how to design your own sketch in OmniSketch. For a detailed description, please check the docs.

  1. Sketch algorithm should be in src/sketch/. Suppose you add a file XXX.h to this directory.
  2. Sketch testing procedure is defined in src/sketch_test/. You should name your testing file as XXXTest.h accordingly.
  3. Add a new sketch target in CMakeLists.txt. This is done in a single line add_user_sketch(YYY XXX).
  4. Write down the sketch config in a toml file. The default config file is src/sketch_config.toml. All the sketch configs are user-defined, but typically should contain (though not required)
  • Streaming data file
  • Format of the streaming data record
  • Metrics measured during testing
  • Sketch parameters
  • Other user-defined configs

It is the user who controls what, where and how to parse in the config file. Omnisketch imposes no restrictions on how you organize the toml file and what you put in there. Fortunately, OmniSketch provides a rich set of tools to help you do it in just a couple of lines.

  1. Goto the building directory build/, cmake and make it. (CMake is needed since a new target has just been added) From the time on, if the code is modified, only make is needed.
  2. At this point, the sketch is compiled and linked. It is callable from the terminal with ./YYY -c config. If no -c option is provided, src/sketch_config.toml is assumed to be the default config file. A possible output of Count Min Sketch runs as follows:
terminal> ./CM -c ../src/sketch_config.toml
   INFO| Loading config from ../src/sketch_config.toml... @utils.cpp:62
VERBOSE| Config loaded. @utils.cpp:76
VERBOSE| Preparing test data... @data.h:718
   INFO| Loading records from ../data/records.bin... @data.h:720
VERBOSE| Records Loaded. @data.h:746
DataSet: 1090120 records with 99999 keys (../data/records.bin)
============     Count Min      ============
  Mem Footprint: 1.52646 MB
    Update Rate: 1853.95 Mpac/s
     Query Rate: 1694.9 Mpac/s
      Query ARE: 0.161831
      Query AAE: 0.252943
============================================
  1. From now on, every time your sketch is about to run on different data and formats, or to collect some new statistics, all you have to do is simply modifying the config file. If the template header should be changed, you have to make a new driver.

API Docs

Please follow this link.

Tables of Sketches

Here are a list of all the sketches implemented and tested. The last column indicates the name of generated executable drivers, which can be modified in lines starting with add_user_sketch in CMakeLists.txt. For example, suppose there is a line saying add_user_sketch(CM CMSketch). It means that one may run driver compiled from src/sketch/CMSketch.h and src/test/CMSketchTest.h by ./CM in the building directory after making it.

Status: have, checked, checked but with doubt, tested.

Status Name of the Executable
CM Sketch t CM
CH-optimized CM Sketch t CHCM
Count Sketch t CS
CU Sketch t CU
Bloom Filter t BF
counting bloom filter t
LD-sketch t
MV-sketch t
HashPipe t HP
FM-sketch(PCSA) t
Linear Counting
Kmin(KMV)
Deltoid t
Flow Radar t FR
sketch learn
elastic sketch t
univmon
nitro sketch t
reversible sketch
Mrac t
k-ary sketch t
seqHash
TwoLevel t
multi-resolution bitmap
lossy count t
space saving t
HyperLogLog t
Misra-Gries t
Fast Sketch t
CounterBraids t
HeavyKeeper d

omnisketch's People

Contributors

dromniscience avatar

Stargazers

Kaihui Gao avatar Lucas Bleme avatar  avatar  avatar  avatar Ke Liu avatar HUANG Qun avatar Valiant avatar Kira avatar  avatar deadlycat avatar chen hongyang avatar CHEN Xiang avatar EnanaShinonome avatar Qingxiu Liu avatar Jiaqi avatar JeremyGuo avatar Jintao He avatar Yuchen Song avatar

Watchers

HUANG Qun avatar JeremyGuo avatar Valiant avatar Yongtong Wu avatar Kira avatar SHENG Siyuan avatar CHEN Xiang avatar

omnisketch's Issues

Heavy hitter estimation is broken when there are multiple flows of the same size

When getting the ground truth for heavy hitter, getHeavyHitter in data.h simply takes the largest $K$ flows, and use the size of the $K$'th flow as the threshold $thr$. However there might be multiple flows of the same size as the $K$'th one, and when testing, we ask the sketch to retrieve all flows of size $\ge thr$, which will definitely result in a drop in the recall rate. This effect is significant when the flow size are distributed evenly.

A possible fix could be including all flows of size $\ge thr$ when calculating the ground truth.

Inaccurate time measurement in test.h

In test.h, we use the following macros to measure the update and query time for each sketch:

#define DEFINE_TIMERS                                                          \
  auto timer = std::chrono::microseconds::zero();                              \
  auto tick = std::chrono::steady_clock::now();                                \
  auto tock = std::chrono::steady_clock::now();
#define START_TIMER tick = std::chrono::steady_clock::now();
#define STOP_TIMER                                                             \
  tock = std::chrono::steady_clock::now();                                     \
  timer += std::chrono::duration_cast<std::chrono::microseconds>(tock - tick);
#define TIMER_RESULT static_cast<int64_t>(timer.count())

Here the timer is measured in microseconds. However, in testUpdate, we measure the time for each update separately:

  DEFINE_TIMERS;
  for (auto ptr = begin; ptr != end; ptr++) {
    START_TIMER;
    ptr_sketch->update(ptr->flowkey,
                       cnt_method == Data::InLength ? ptr->length : 1);
    STOP_TIMER;
  }
  if (metric_vec.in(Metric::RATE))
    update[Metric::RATE] = 1.0 * (end - begin) / TIMER_RESULT * 1e6;

For many sketch algorithms (e.g. CMSketch), the update operation takes far less than 1us (~200ns on my machine), and the timer result is rounded to 0us. This causes significant inaccuracy for time measurement. The same problem exists in testInsert and testQuery, etc.

Changing microseconds to nanoseconds would be a solution.

Also notice that, std::chrono measures the wall clock time instead of CPU time. We might want to switch to std::clock(), which measures the CPU time and produces a more accurate result.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.