ucbrise / confluo Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 203.0 11.55 MB

Real-time Monitoring and Analysis of Data Streams

Home Page: https://ucbrise.github.io/confluo/

License: Apache License 2.0

CMake 1.89% C++ 91.68% Shell 0.42% C 0.02% Thrift 0.27% Python 2.52% Java 3.21%

atomic lock-free log

confluo's People

Contributors

Stargazers

Watchers

Forkers

attaluris appcoreopc lzpfmh lusong1986 bunnyrabbit8mile slpcat tomzhang tisan2021 satchel9 servicefoundation hcm517617595 abioy oldtree rongliang666 zhangyang520 szbalx tg0428 handong890 etsangsplk minghao2016 skyformat99 tiger902 foxxnuaa chenkangcrack eagle518 fcccode anglenet delkyd sunbiaozj jackiezhang2017 navono guohf cube3power skymysky duhenglucky shigongchen randomye rim99 scanry zhoudaqing programlife555 zerolugithub linecode newlysoft dut3062796s zqhyz stamhe jangocheng giddily fhaoquan landcever luffyzhang leeoo numberen hj3938 lyzw charlieyqin vincentchen lidi100 xubingyue leaderyangzi ycj007 code-mx china-liweihong raybbb zoelov shefronyudy zzqch frankie34 shikun0818 shaunstanislauslau pureqsh allanceng restmad yuejw gcoinman mbrukman awesome-nfv bairenjunior yubobo joe2hpimn fishgege pythonlinuxmanager yinoliver zmyer 280185386 yhtsnda unbreakablehf ruanjianershu yuguangjie forging2012 412537896 ujvl wwaww mu-l happywumin codreamx davidmr001 gitqifan haxine

confluo's Issues

Consistent parameters across Confluo servers and clients

Some parameters shared between the server and the client (e.g., TIME_RESOLUTION_NS) can become inconsistent if either the server or the client changes them. We need to add a mechanism to ensure consistency for these parameters.

Add support for user-defined types

Support projections

Support projections for cases where not all columns are required in the output of a query.

Support Golang client and demo.

Add trigger time-period as a configurable parameter

Right now, the time-period for trigger evaluation defaults to whatever the filter granularity is. We should dissociate the two and make sure the monitor thread correctly aggregates required filters at the proper time-period.

storage_mode usage

storage_mode is used by both the read tail and data log, so the current in_memory mode allocates using a mempool only if the size requested matches; otherwise it calls malloc. Need a cleaner way to make the allocations.

More tests for C++ writer client

The rpc_dialog_writer hasn't had remove_trigger, remove_filter or remove_index tested. Add test cases to the C++ writer client test to cover this functionality.

Monolog compatible with mempool design

The current reflog is a monolog_exp2, which has exponentially growing buckets. This would require a memory pool for each bucket. Instead, the reflog should have exponentially growing bucket containers that have pointers to fixed size buckets. The buckets can then be allocated using a single memory pool.

Add SQL interface

Add (potentially modified) SQL interface for supporting queries, adding/removing filters/indexes/triggers, etc.

Cleanup Query Planner

The current query planner has a few issues that need to be resolved:

Does not work if there are no indexes for a minterm (yields zero results)
Reduce filter is hard-coded
Does not handle all operators (e.g., !=)

Add tests for contiguous read/write monolog methods

The following methods of the monolog implementations need to be tested if they will be used in the future.

void set(size_t idx, const T* data, size_t len);
void set_unsafe(size_t idx, const T* data, size_t len);
void get(T* data, size_t idx, size_t len);
size_t storage_size();

More intelligent mmapping by storage_allocator

On Linux machines the vm.max_map_count limit is hit very quickly (default on an ec2 c4.8xlarge instance is 65536). This needs to be manually increased for archival to be successful (since reflogs have relatively small buckets and each bucket is memory mapped).

Instead the storage_allocator could potentially memory map larger-than-requested chunks of files and store the excess of the available range for future calls. It would also have to map at a different granularity depending on what's being mmapped (data log vs reflog buckets). This could be done by just looking at the size passed to the mmap call.

an error on complie confluo

i complie confluo occur an error, and i already read other complie error issues, but no one is my problem, colud you help me? thank you very much.

error message:
In file included from /root/work/confluo/libconfluo/../libutils/utils/atomic.h:7:0,
from /root/work/confluo/libconfluo/confluo/archival/archival_utils.h:4,
from /root/work/confluo/libconfluo/src/archival/archival_utils.cc:1:
/usr/include/c++/4.8.2/atomic: In instantiation of ‘struct std::atomic<confluo::storage::encoded_ptr >’:
/root/work/confluo/libconfluo/confluo/storage/swappable_encoded_ptr.h:367:32: required from ‘class confluo::storage::swappable_encoded_ptr’
/root/work/confluo/libconfluo/src/archival/archival_utils.cc:17:42: required from here
/usr/include/c++/4.8.2/atomic:167:7: error: function ‘std::atomic<_Tp>::atomic() [with _Tp = confluo::storage::encoded_ptr]’ defaulted on its first declaration with an exception-specification that differs from the implicit declaration ‘std::atomic<confluo::storage::encoded_ptr >::atomic()’
atomic() noexcept = default;
^
make[2]: *** [libconfluo/CMakeFiles/confluo.dir/src/archival/archival_utils.cc.o] Error 1

my enviroment configure
boost 1.69
gcc (GCC) 5.3.1 20160406 (Red Hat 5.3.1-6)

i only build RPC component, others all OFF

Add LZ4 and delta compressors to Confluo Archival framework.

Expose Sketch API for multilog

Expose Sketch API to be able to:

Add/remove a sketch on a field
Execute ad-hoc queries on frequencies
Evaluate approximate triggers in real-time

Expose corresponding remote API.

Command line parser parses short options incorrectly

Command line parser sometimes maps short options to completely unrelated arguments.

More thorough testing of compression

Need to write more tests to cover edge cases of delta and lz4 encoding/decoding
- Different array sizes
- Different data sets (current tests are monotonically increasing elements with constant increments)

Universal sketches, approximate queries

Implement count-min-sketch (supposedly more space-efficient than count-sketch)
Time-adaptive sketch: decay accuracy & storage for older values (using inflation)
Integrate with atomic multilog for offline/online approximate queries.
- Figure out what the basic interface should look like
- Consider:
  - cross-stream correlation queries
  - use-cases where the same stream is partitioned across multiple servers
  - storage-accuracy tradeoff, blowfish

Detect missing Python client build dependencies

If possible, in the CMakeLists.txt check if all of the dependencies for the Python client are installed.

Currently if a Python dependency is missing, it isn't detected and the build fails:

Traceback (most recent call last):
File "setup.py", line 3, in <module>
from setuptools import setup
ImportError: No module named setuptools
pyclient/CMakeFiles/pyclient_build.dir/build.make:57: recipe for target 'pyclient/CMakeFiles/pyclient_build' failed
make[2]: *** [pyclient/CMakeFiles/pyclient_build] Error 1
CMakeFiles/Makefile2:1281: recipe for target 'pyclient/CMakeFiles/pyclient_build.dir/all' failed
make[1]: *** [pyclient/CMakeFiles/pyclient_build.dir/all] Error 2
Makefile:160: recipe for target 'all' failed
make: *** [all] Error 2```

Remove double copy in server side record batches

Currently, there is data is double copied at the server side:

once while reading it from thrift's serialized format
once while converting it to dialog's expected format
This can be collapsed into a single copy with ease, and should lead to performance improvements.

Switch to thrift v0.11.0

This resolves the java bug where it prints "Received n" for every RPC call.

Possible memory leak

Appending to an atomic multilog even when it has no filters and indexes may have a large memory overhead despite only having to update a single data structure (the data log). Need to profile to discover possible memory leak(s).

Archival for data, filters and indexes

Archive data, filters and indexes periodically and when hitting a memory limit.

Support nullable types

Support nullable types for records that may have missing values.

Monitor thread may miss alerts

Possibly caused by atomic multilog append latency.

JSON interface for RPC Client

Currently the client must read from and write to a table using a contiguous string buffer. Add an interface to the C++ and Python RPC clients to read and write using JSON for easier usability. Validate all fields when a record is written using this interface.

How to use Confluo as a pubsub system?

The documents do not mention about these, and there is nothing about subscription in the client API.
How to use Confluo as a pubsub system?

Add tests for persistence recovery

These tests would check if we can successfully recover a table from an on-disk version of the data.

Aggregates as double precision values

Store/compute all aggregates as double precision values rather than the specific types of aggregated attribute. This avoids issues of overflow, etc for smaller types.

Memory size parser

Add a parser for memory sizes, e.g, the values provided could be 10k or 100M or 5g, and it would get parsed to the appropriate number of bytes.

size_t configuration_params::MAX_MEMORY = dialog_conf.get<size_t>(
    "max_memory", constants::DEFAULT_MAX_MEMORY);

Radix tree node compatibility with mempool

The way the current radix_tree_node is written does not permit for a simple mempool allocation of the entire struct.

Switch to dynamically sized records

Currently the record size for a schema is fixed; switch to dynamically sized records.

i have a problem on install confluo

Determining if the pthread_create exist failed with the following output:
Change Dir: /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/gmake" "cmTC_6011a/fast"
/usr/bin/gmake -f CMakeFiles/cmTC_6011a.dir/build.make CMakeFiles/cmTC_6011a.dir/build
gmake[1]: Entering directory /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp' Building C object CMakeFiles/cmTC_6011a.dir/CheckSymbolExists.c.o /usr/bin/cc -o CMakeFiles/cmTC_6011a.dir/CheckSymbolExists.c.o -c /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c Linking C executable cmTC_6011a /usr/local/bin/cmake -E cmake_link_script CMakeFiles/cmTC_6011a.dir/link.txt --verbose=1 /usr/bin/cc -rdynamic CMakeFiles/cmTC_6011a.dir/CheckSymbolExists.c.o -o cmTC_6011a CMakeFiles/cmTC_6011a.dir/CheckSymbolExists.c.o: In function main':
CheckSymbolExists.c:(.text+0x16): undefined reference to pthread_create' collect2: error: ld returned 1 exit status gmake[1]: *** [cmTC_6011a] Error 1 gmake[1]: Leaving directory /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp'
gmake: *** [cmTC_6011a/fast] Error 2

File /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp/CheckSymbolExists.c:
/* */
#include <pthread.h>

int main(int argc, char** argv)
{
(void)argv;
#ifndef pthread_create
return ((int*)(&pthread_create))[argc];
#else
(void)argc;
return 0;
#endif
}

Determining if the function pthread_create exists in the pthreads failed with the following output:
Change Dir: /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp

Run Build Command:"/usr/bin/gmake" "cmTC_f4480/fast"
/usr/bin/gmake -f CMakeFiles/cmTC_f4480.dir/build.make CMakeFiles/cmTC_f4480.dir/build
gmake[1]: Entering directory /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp' Building C object CMakeFiles/cmTC_f4480.dir/CheckFunctionExists.c.o /usr/bin/cc -DCHECK_FUNCTION_EXISTS=pthread_create -o CMakeFiles/cmTC_f4480.dir/CheckFunctionExists.c.o -c /usr/local/share/cmake-3.13/Modules/CheckFunctionExists.c Linking C executable cmTC_f4480 /usr/local/bin/cmake -E cmake_link_script CMakeFiles/cmTC_f4480.dir/link.txt --verbose=1 /usr/bin/cc -DCHECK_FUNCTION_EXISTS=pthread_create -rdynamic CMakeFiles/cmTC_f4480.dir/CheckFunctionExists.c.o -o cmTC_f4480 -lpthreads /usr/bin/ld: cannot find -lpthreads collect2: error: ld returned 1 exit status gmake[1]: *** [cmTC_f4480] Error 1 gmake[1]: Leaving directory /root/confluo_test/confluo/build/CMakeFiles/CMakeTmp'
gmake: *** [cmTC_f4480/fast] Error 2

Add generic parsing framework

A generic parsing framework would eliminate code redundancy in parsing:

Schema
Filter
Trigger
Alert

Filter using a function

Allow filters to use a function instead of just an expression. The filter class already accepts a function in the constructor but there's currently no way for an RPC client to specify a function, so the default_filter is always used.

More alert functionality in dialog_server

Get alerts by time and trigger name
Subscribe to alerts
Active alerts
Corresponding C++/Python client methods and tests

Clustering/scale-out roadmap?

Hi there,

I see there's been some activity in a branch called chain-replication but it'd be nice to know what you folks have in mind for scale-out architecture. :)

Separate debugging build

Create a separate build process for debugging that enables the debug flag and does not enable compiler optimizations.

Move read and write buffering to streaming interface

Currently the client exposes buffered reads and writes that assume sequential access patterns. These should be moved to a separate streaming interface.

Range query bug

Adhoc filter is creates an empty plan for queries on long fields with no upper bound (e.g., value >= 100). This may potentially extend to other types, have not confirmed.

Add configuration files

Improve storage allocation using pools and file consolidation.

Improve allocation using pools and file consolidation.

Where to set a maven repo proxy.

Hi， I'm trying to make install confluo. However, the installation has some errors to get maven repo. I have a maven proxy (mirror). Where can I set it in the conflue source?

i have a problem on complie confluo

[ 32%] Building CXX object libconfluo/CMakeFiles/confluo.dir/src/confluo_store.cc.o
In file included from /home/root/C++/confluo/libconfluo/confluo/aggregate/aggregate_info.h:7:0,
from /home/root/C++/confluo/libconfluo/confluo/aggregated_reflog.h:5,
from /home/root/C++/confluo/libconfluo/confluo/filter.h:4,
from /home/root/C++/confluo/libconfluo/confluo/archival/load_utils.h:7,
from /home/root/C++/confluo/libconfluo/confluo/atomic_multilog.h:11,
from /home/root/C++/confluo/libconfluo/confluo/confluo_store.h:8,
from /home/root/C++/confluo/libconfluo/src/confluo_store.cc:1:
/home/root/C++/confluo/libconfluo/confluo/parser/aggregate_parser.h:32:27: error: expected identifier before ‘(’ token
(std::string, agg)
^
/home/root/C++/confluo/libconfluo/confluo/parser/aggregate_parser.h:32:41: error: ‘agg’ has not been declared
(std::string, agg)
^
/home/root/C++/confluo/libconfluo/confluo/parser/aggregate_parser.h:33:23: error: ‘field_name’ has not been declared
(std::string, field_name))
^
/home/root/C++/confluo/libconfluo/confluo/parser/aggregate_parser.h:33:33: error: ‘parameter’ declared as function returning a function
(std::string, field_name))
^
/home/root/C++/confluo/libconfluo/confluo/parser/aggregate_parser.h:35:1: error: expected constructor, destructor, or type conversion before ‘namespace’
namespace confluo {
^
make[2]: *** [libconfluo/CMakeFiles/confluo.dir/src/confluo_store.cc.o] Error 1
make[1]: *** [libconfluo/CMakeFiles/confluo.dir/all] Error 2
make: *** [all] Error 2

GCC VERSION:
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC)

BOOST VERSION:
1.53.0

LOCATE :
boost: /usr/include/boost

Improve documentation

Improve and add documentation where none exists in several header files, including but not limited to:

storage.h
monolog_linear.h
monolog_exp2.h
dialog_table.h
expression.h
tiered_index.h

Provide User Guide using Python

The User Guide at https://ucbrise.github.io/confluo/ could use a full Python version. At this time it's completely unclear how to use Confluo from Python.

Trigger evaluation should be agnostic to application timestamps

Currently, the trigger evaluation framework uses realtime (more precisely, server time) to determine which time-buckets to check for trigger evaluation. This can potentially be an issue if the application supplies the timestamps, and these timestamps have no correlation with server time.