armon / bloomd Goto Github PK

C network daemon for bloom filters

Home Page: http://armon.github.io/bloomd

License: Other

Python 4.88% C 94.84% Dockerfile 0.28%

bloomd's Introduction

Bloomd

Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It uses a simple ASCII protocol which is human readable, and similar to memcached.

Bloom filters are a type of sketching or approximate data structure. They trade exactness for efficiency of representation. Bloom filters provide a set-like abstraction, with the important caveat that they may contain false-positives, meaning they may claim an element is part of the set when it was never in fact added. The rate of false positives can be tuned to meet application demands, but reducing the error rate rapidly increases the amount of memory required for the representation.

TL;DR: Bloom filters enable you to represent 1MM items with a false positive rate of 0.1% in 2.4MB of RAM.

Features

Scalable non-blocking core allows for many connected clients and concurrent operations
Implements scalable bloom filters, allowing dynamic filter sizes
Supports asynchronous flushes to disk for persistence
Supports non-disk backed bloom filters for high I/O
Automatically faults cold filters out of memory to save resources
Dead simple to start and administer
FAST, FAST, FAST

Install

Download and build from source:

$ git clone https://[email protected]/armon/bloomd.git
$ cd bloomd
$ pip install SCons  # Uses the Scons build system, may not be necessary
$ scons
$ ./bloomd

This will generate some errors related to building the test code as it depends on libcheck. To build the test code successfully, do the following:

$ cd deps/check-0.9.8/
$ ./configure
$ make
# make install
# ldconfig (necessary on some Linux distros)

Then re-build bloomd. At this point, the test code should build successfully.

For CentOS or RHEL users, the kind folks from Vortex RPM have made a repo available with RPM's.

Repo: http://vortex-rpm.org/
Bloomd RPM spec: https://github.com/vortex-rpm/bloomd

Usage

Bloomd can be configured using a file which is in INI format. Here is an example configuration file:

# Settings for bloomd
[bloomd]
tcp_port = 8673
data_dir = /mnt/bloomd
log_level = INFO
flush_interval = 300
workers = 2

Then run bloomd, pointing it to that file:

bloomd -f /etc/bloomd.conf

A full list of configuration options is below.

Clients

Here is a list of known client implementations:

Python : https://github.com/kiip/bloom-python-driver
Ruby : https://github.com/SponsorPay/bloomrb
Erlang : https://github.com/armon/erl-bloomd
Go : https://github.com/geetarista/go-bloomd
Perl : https://github.com/dams/Bloomd-Client
Node.js : https://github.com/majelbstoat/node-bloomd
PHP: https://github.com/mdlayher/php-bloomd
Java: https://github.com/casidiablo/java-bloomd-client
Lua/OpenResty: https://github.com/jie123108/lua-resty-bloomd

Here is a list of "best-practices" for client implementations:

Maintain a set of open connections to the server to minimize connection time
Make use of the bulk operations when possible, as they are more efficient.
For long keys, it is better to do a client-side hash (SHA1 at least), and send the hash as the key to minimize network traffic.

Configuration Options

Each configuration option is documented below:

tcp_port : Integer, sets the tcp port to listen on. Default 8673.
port: Same as above. For compatibility.
udp_port : Integer, sets the udp port. Currently listened on but otherwise unused. Default 8674.
bind_address: The IP to bind to. Defaults to 0.0.0.0
data_dir : The data directory that is used. Defaults to /tmp/bloomd
log_level : The logging level that bloomd should use. One of: DEBUG, INFO, WARN, ERROR, or CRITICAL. All logs go to syslog, and stderr if that is a TTY. Default is DEBUG.
workers : This controls the number of worker threads that are used. Defaults to 1. If many different filters are used, it can be advantageous to increase this to the number of CPU cores. If only a few filters are used, the increased lock contention may reduce throughput, and a single worker may be better.
flush_interval : This is the time interval in seconds in which filters are flushed to disk. Defaults to 60 seconds. Set to 0 to disable.
cold_interval : If a filter is not accessed (check or set), for this amount of time, it is eligible to be removed from memory and left only on disk. If a filter is accessed, it will automatically be faulted back into memory. Set to 3600 seconds by default (1 hour). Set to 0 to disable cold faulting.
in_memory : If set to 1, then all filters are in-memory ONLY by default. This means they are not persisted to disk, and are not eligible for cold fault out. Defaults to 0.
initial_capacity : If a create command does not provide an initial capacity for a filter, this value is used. Defaults to 100K items.
default_probability : If a create command does not provide a false-positive probability rate, this value is used. Defaults to 1/10K.
use_mmap : If set to 1, the bloomd internal buffer management is disabled, and instead buffers use a plain mmap() and rely on the kernel for all management. This increases data safety in the case that bloomd crashes, but has adverse affects on performance if the total memory utilization of the system is high. In general, this should be left to 0, which is the default.
scale_size : When a bloom filter is "scaled" up, this is the multiplier that is used. It should either be 2 or 4. Setting it to 2 will conserve memory, but is slower due to the increased number of filters that need to be checked. Defaults to 4.
probability_reduction : This is a subtle control value that affects the scaling of bloom filters. It should probably not be modified. Defaults to 0.9.

Protocol

By default, Bloomd will listen for TCP connections on port 8673. It uses a simple ASCII protocol that is very similar to memcached.

A command has the following syntax:

cmd [args][\r]\n

We start each line by specifying a command, providing optional arguments, and ending the line in a newline (carriage return is optional).

There are a total of 11 commands:

create - Create a new filter (a filter is a named bloom filter)
list - List all filters or those matching a prefix
drop - Drop a filters (Deletes from disk)
close - Closes a filter (Unmaps from memory, but still accessible)
clear - Clears a filter from the lists (Removes memory, left on disk)
check|c - Check if a key is in a filter
multi|m - Checks if a list of keys are in a filter
set|s - Set an item in a filter
bulk|b - Set many items in a filter at once
info - Gets info about a filter
flush - Flushes all filters or just a specified one

For the create command, the format is:

create filter_name [capacity=initial_capacity] [prob=max_prob] [in_memory=0|1]

Note:

capacity must > 10,000 (1e4, 10K)
capacity is suggested <= 1,000,000,000 (1e9, 1G)
prob must < 0.1 (1e-1)
prob is suggested <= 0.01 (1e-2)

Where filter_name is the name of the filter, and can contain the characters a-z, A-Z, 0-9, ., _. If an initial capacity is provided the filter will be created to store at least that many items in the initial filter. Otherwise the configured default value will be used. If a maximum false positive probability is provided, that will be used, otherwise the configured default is used. You can optionally specify in_memory to force the filter to not be persisted to disk.

As an example:

create foobar capacity=1000000 prob=0.001

This will create a filter foobar that has a 1M initial capacity, and a 1/1000 probability of generating false positives. Valid responses are either "Done", "Exists", or "Delete in progress". The last response occurs if a filter of the same name was recently deleted, and bloomd has not yet completed the delete operation. If so, a client should retry the create in a few seconds.

The list command takes either no arguments or a set prefix, and returns information about the matching filters. Here is an example response to a command:

> list foo
START
foobar 0.001 1797211 1000000 0
END

With the list prefix "foo", this indicates a single filter named foobar, with a probability of 0.001 of false positives, a 1.79MB size, a current capacity of 1M items, and 0 current items. The size and capacity automatically scale as more items are added.

The drop, close and clear commands are like create, but only takes a filter name. It can either return "Done" or "Filter does not exist". clear can also return "Filter is not proxied. Close it first.". This means that the filter is still in-memory and not qualified for being cleared. This can be resolved by first closing the filter.

Check and set look similar, they are either:

[check|set] filter_name key

The command must specify a filter and a key to use. They will either return "Yes", "No" or "Filter does not exist".

The bulk and multi commands are similar to check/set but allows for many keys to be set or checked at once. Keys must be separated by a space:

[multi|bulk] filter_name key1 [key_2 [key_3 [key_N]]]

The check, multi, set and bulk commands can also be called by their aliasses c, m, s and b respectively.

The info command takes a filter name, and returns information about the filter. Here is an example output:

START
capacity 1000000
checks 0
check_hits 0
check_misses 0
page_ins 0
page_outs 0
probability 0.001
sets 0
set_hits 0
set_misses 0
size 0
storage 1797211
END

The command may also return "Filter does not exist" if the filter does not exist.

The flush command may be called without any arguments, which causes all filters to be flushed. If a filter name is provided then that filter will be flushed. This will either return "Done" or "Filter does not exist".

Example

Here is an example of a client flow, assuming bloomd is running on the default port using just telnet:

$ telnet localhost 8673
> list
START
END

> create foobar
Done

> check foobar zipzab
No

> set foobar zipzab
Yes

> check foobar zipzab
Yes

> multi foobar zipzab blah boo
Yes No No

> bulk foobar zipzab blah boo
No Yes Yes

> multi foobar zipzab blah boo
Yes Yes Yes

> list
START
foobar 0.000100 300046 100000 3
END

> drop foobar
Done

> list
START
END

Performance

Although extensive performance evaluations have not been done, casual testing on a 2012 MBP with pure set/check operations allows for a throughput of at least 600K ops/sec. On Linux, response times can be as low as 1.5 μs.

Bloomd also supports multi-core systems for scalability, so it is important to tune it for the given work load. The number of worker threads can be configured either in the configuration file, or by providing a -w flag. This should be set to at most 2 * CPU count. By default, only a single worker is used.

References

Here are some related works which we make use of:

Space/Time Trade-offs in Hash Coding with Allowable Errors (Bloom): https://dl.acm.org/doi/10.1145/362686.362692
Scalable Bloom Filters (Almeida et. al): http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf
Less Hashing, Same Performance: Building a Better Bloom Filter (Kirsch and Mitzenmacher): https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf

bloomd's People

Stargazers

Watchers

Forkers

sponsorpay geetarista alepharchives jedisct1 karlpilkington therevoltingx liluxdev dams dana lemonhall chinnurtb billwangcs gavioto javarange timwee siddhaism shakefu nivertech qudexin1986 chuanma mtngrown darsenault irr atulshridhar rkdec89 betashepherd rishabg saitejab mistobaan ckolek aburan28 xpsair v09-software ditonogo sudhakarvenkatesh mcanthony aawm ldesiqueira chhsiao1981 aitech-lab pombredanne yunlong junix gosom jgeewax hanumathrao qzm1218 fengshao0907 yancanchen xorllc naveenkunareddy casidiablo ycaihua felixdae tomy113 donglei ffng sureddy yodamaster abc100m linearregression jameshloving core4ce kioco secsecsec km3945 jingfg123 izogain skyformat99 nicolasramy ayiyayiya liumorgan zzl1787 prcyangli applifier lukw00heck panfago p0pr0ck5 dubaokun austinyan deveshd020 rouzier forchid zuiwanting moneytech theringer openrm vicfred huanglg estella whymarrh yogeshjaiswani artystable sqooba nawenqiang adolphlua ilinxiao gerhobbelt zhulu323 willingbuglive

bloomd's Issues

Repo can update to the latest version

bitmap_from_filename & bf_from_bitmap

in a C program, I would like to reuse an existing file generated by bloomd. I've used the standard bloomd/telnet commands "create"/"set", which generates the bloomd flushed file : bloomd.foobar/data.000.mmap

int retCode ;
char myFile[1024];
strcpy(myFile,"bloomd.foobar/data.000.mmap");
struct stat myFileStat;
retCode = stat(myFile,&myFileStat) ; // return 0 : ok
bloom_bitmap myBloom_bitmap ;
retCode = bitmap_from_filename(myFile,myFileStat.st_size, 0, ANONYMOUS , &myBloom_bitmap); // return 0 : ok
bloom_bloomfilter mybloom_bloomfilter ;
retCode = bf_from_bitmap(&myBloom_bitmap, 1, 0, &mybloom_bloomfilter ); // return -1 : ko

bitmap_from_filename is OK ( return 0 )
bf_from_bitmap is KO ( return -1 )
Next steps : check if a string does not exist in the bloom_bloomfilter ...
Thank you for your support.

cast-function-type error in bloomd.c

I'm working off the latest master, on Ubuntu 18.10.

scons version:

	script: v3.0.5.a56bbd8c09fb219ab8a9673330ffcd55279219d0, 2019-03-26 23:16:31, by bdeegan on kufra
	engine: v3.0.5.a56bbd8c09fb219ab8a9673330ffcd55279219d0, 2019-03-26 23:16:31, by bdeegan on kufra

$ python3 `which scons`
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/bloomd/bloomd.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/bloomd.c
src/bloomd/bloomd.c: In function 'main':
src/bloomd/bloomd.c:180:43: error: cast between incompatible function types from 'void (*)(worker_args *)' {aka 'void (*)(struct <anonymous> *)'} to 'void * (*)(void *)' [-Werror=cast-function-type]
         pthread_create(&threads[i], NULL, (void*(*)(void*))worker_main, &wargs);
                                           ^
cc1: all warnings being treated as errors
scons: *** [src/bloomd/bloomd.o] Error 1
scons: building terminated because of errors.

CentOS/RHEL RPM

Hello! We created rpm for this daemon and init script, so maybe you mention our repo for rhel users in readme?

http://vortex-rpm.org -- repo
https://github.com/vortex-rpm/bloomd -- spec for bloomd

why udp clients are not supported?

Connection drops when flush thread starts.

Hello,

I'm trying to use Bloomd and bloom-python-driver with lots of sequential checks and also I'm adding keys when they are not existing. This process takes more than flush interval and when flush thread starts, it drops all connections to the Bloomd.

I guess this is a bug, or is it made like this on purpose? Do you have any recommendation?

Thank you very much for your help in advance!

writev fails with too many filters

writev(8, [{"START\n", 6}, {"metrics:rewards:WEEKLY:2012-06-0"..., 74}, {"metrics:sessions:DAILY:2012-05-1"..., 77}, {"metrics:ach_lb:DAILY:2012-05-30T"..., 74}, {"metrics:rewards:DAILY:2012-04-19"..., 71}, {"metrics:rewards:ALL_TIME 0.00010"..., 45}, {"metrics:rewards:DAILY:2012-06-05"..., 71}, {"metrics:sessions:DAILY:2012-09-1"..., 77}, {"metrics:offers:DAILY:2012-08-07T"..., 75}, {"metrics:rewards:DAILY:2012-04-21"..., 71}, {"metrics:rewards:WEEKLY:2012-03-2"..., 74}, {"metrics:offers:WEEKLY:2012-04-09"..., 76}, {"metrics:sessions:WEEKLY:2012-10-"..., 78}, {"metrics:sessions:DAILY:2012-03-2"..., 77}, {"metrics:rewards:DAILY:2012-06-09"..., 71}, {"metrics:rewardsgifted:MONTHLY:20"..., 79}, {"metrics:sessions:DAILY:2012-08-1"..., 77}, {"metrics:ach_lb:DAILY:2012-07-26T"..., 75}, {"metrics:ach_lb:DAILY:2012-06-24T"..., 75}, {"metrics:rewardsgifted:DAILY:2012"..., 75}, {"metrics:sessions:DAILY:2012-07-0"..., 77}, {"metrics:sessions:DAILY:2012-09-0"..., 77}, {"metrics:offers:DAILY:2012-05-20T"..., 74}, {"metrics:rewards:DAILY:2012-07-16"..., 73}, {"metrics:sessions:DAILY:2012-04-1"..., 77}, {"metrics:swarm:DAILY:2012-06-30T0"..., 68}, {"metrics:ach_lb:DAILY:2012-06-16T"..., 75}, {"metrics:ach_lb:DAILY:2012-05-19T"..., 74}, {"metrics:ach_lb:DAILY:2012-08-05T"..., 75}, {"metrics:ach_lb:DAILY:2012-07-11T"..., 75}, {"metrics:sessions:DAILY:2012-06-2"..., 77}, {"metrics:offers:DAILY:2012-05-24T"..., 74}, ...], 1270) = -1 EINVAL (Invalid argument)

Best way to log to file?

I see that this is logging to syslog. Can this location be changed through the config file?

race condition in filtmgr_check_keys

we current use bloomd to filter user browser history, bloomd' check performance is very good, about 300k (24 workers), BUT we recive SIGSEGV sometimes in bloomf_contains, because of there are so many filters, we have to change sata to ssd. yesterday i found filtmgr_check_keys has race condition that cause SIGSEGV, change pthread_rwlock_rdlock(&filt->rwlock); to pthread_rwlock_wrlock(&filt->rwlock);, sloves problem. because thread_safe_fault change sbf->filter，there's tiny change sbf->filter has been read wrong.

forgive my poor engish

bloomd hashmap vulnerable to DOS via predictable hash collisions

This is a pretty well-known issue with hashmap implementations; it made the rounds of major programming languages over the past few years, here's Python for an example:

http://bugs.python.org/issue13703

Anywhere untrusted user input is used for a key, collisions can be generated so that all keys land in the same bucket, yielding O(N) lookups (instead of expected O(1)). This can DOS hashmap-implementation consumers (in the case of Python, web apps; for bloomd, unwary users of bloomd).

Memory consumption optimization

Hi,

I've been using bloomd in production since yesterday and I must say I'm impressed by its stability and low CPU consumption. You did a really good job there, congrats!

However, I've got some questions regarding memory consumption. Currently bloomd memory consumption is constantly increasing (RES: 106M, SHR: 105M, VIRT 243M).

Here are my bloom filters after one day.

f1 0.000100 300046 100000 91793
f2 0.000100 105575294 34100000 13919873
f3 0.000100 300046 100000 72710
f3 0.000100 1509656 500000 291040
f4 0.000100 1509656 500000 124098

Note that they are going to increase like that at nearly constant rate. And since Scalable Bloom Filters work by adding new bloom filters when the size ratio is reached, the memory consumption will indefinitely increase.

I'm not an expert in C, but I wondered if you could update the readme to give some input on how the "Automatically faults cold filters out of memory to save resources" feature works, in order to take advantage of it.

If I understand it well, since here my filters won't be ever cold (new data is added constantly), I thought maybe I could create filters with composed name like "f{filterid}{weekoftheyear}{year}" where "{weekoftheyear}{year}" are informations extracted and available from every data that the filters test against. That way, filters with older data could be removed from memory but still available just in case.

Is this the right approach? What do you think?

Does bloomd have problems with more than 1 billion entries?

What are the limits for the number of elements i can add into bloomd?

Explain what bloom filters are in the README file

Why does the probability value (FP) decrease from initial setting.

Thanks armon, this is a very helpful project! My lack of full understanding of stable bloom filters is probably the cause of this issue. But i see the "default" probability that i provide when creating a new sbf change from .00000001 to .0001000 on subsequent restarts.

I am not sure if when successive plain bloom filters are created and with the probability reduction (default .9) the overall error probability tends towards the original prob value specified (.0000001) as we approach the capacity (in this case 1B); and the probability specified from the info command after initial start is the probability value of latest plain filter in the sbf array?

repo

$ bloomd &
$ echo "create test capacity=1000000000 prob=0.0000001" | nc 127.0.0.1 8673
$ echo "info test" | nc 127.0.0.1 8673
START
capacity 1000000000
checks 0
check_hits 0
check_misses 0
in_memory 1
page_ins 0
page_outs 0
probability 0.0000001000
sets 0
set_hits 0
set_misses 0
size 0
storage 4792529701
END

$ kill -9 ps | grep bloomd | awk '{print $1}'
$ bloomd &
$ echo "info test" | nc 127.0.0.1 8673
START
capacity 1000000000
checks 0
check_hits 0
check_misses 0
in_memory 1
page_ins 0
page_outs 0
probability 0.0001000000
sets 0
set_hits 0
set_misses 0
size 0
storage 4792529701
END

Syslog facilities

Hi there,

It would be nice if we could configure the syslog facilities (facility, priority and tag) in the bloomd config file and imagine this kind of configuration:

syslog_facility=local0
syslog_tag="myprogram"
syslog_stdout_priority=info
syslog_stderr_priority=err

Is there any client lib implemented in c？

Hello, I want to access bloomd in a service written in c, is there any client lib I could use？You know, there exists hiredis which helps us to access redis in a c service, dose such lib exist for bloomd?

Crash on filter recreate

Hi,

Bloomd works for us very stable in production environment when we do only adds and checks. We do hunderds queries per second and it never crashed.

However, when performing automated tests it crashes quite often. My guess is that it crashes when we recreate the filters between the tests.

I can easily crash bloomd with this script https://gist.github.com/4682164 by running it multiple times. One after one until it crashes... Like this

for i in `seq 100`; do ruby crash.rb; sleep 1; done

can not build on ARM CPU

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/bloomd/art.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/art.c
src/bloomd/art.c:5:23: fatal error: emmintrin.h: No such file or directory
#include <emmintrin.h>
^
compilation terminated.
scons: *** [src/bloomd/art.o] Error 1
scons: building terminated because of errors.

can somebody tell how to fix it?

"scons: building terminated because of errors." -Werror on MacOS 10.7

Using built-in specs.
Target: i686-apple-darwin11
Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.9~22/src/configure --disable-checking --enable-werror --prefix=/Applications/Xcode.app/Contents/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.9~22/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1
Thread model: posix
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.9.00)

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/bloomd/networking.o -c -std=c99 -D_GNU_SOURCE -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/networking.c
gcc -o src/bloomd/conn_handler.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/conn_handler.c
gcc -o src/bloomd/filter.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/filter.c
cc1: warnings being treated as errors
src/bloomd/filter.c: In function 'bloomf_delete':
src/bloomd/filter.c:250: warning: passing argument 3 of 'scandir' from incompatible pointer type
src/bloomd/filter.c: In function 'discover_existing_filters':
src/bloomd/filter.c:452: warning: passing argument 3 of 'scandir' from incompatible pointer type
src/bloomd/filter.c: In function 'bloomf_sbf_callback':
src/bloomd/filter.c:596: warning: passing argument 3 of 'scandir' from incompatible pointer type
scons: *** [src/bloomd/filter.o] Error 1
scons: building terminated because of errors.

I've removed -Werror from SConstruct and it did compile.

Everything works beside that, great job!

create in_memory option is ignored

running bloomd with default options

create bar capacity=89000 prob=0.002 in_memory=0
Done

I am setting in_memory to 0, then

info bar
START
capacity 89000
checks 0
check_hits 0
check_misses 0
in_memory 1      # HERE
page_ins 0
page_outs 0
probability 0.002000
sets 0
set_hits 0
set_misses 0
size 0
storage 197730

seems bloomd is ignoring the in_memory option. why?

Add support for a WAL

If we add support for a write-ahead-log, then external tools can be used to setup log streaming and build a sane high availability setup.

src/libbloom/bitmap.c:fill_buffer() does not handle pread() errors correctly

GCC 4.7.2 with -Wextra noticed this. 'more' is a uint64_t, while pread() returns a ssize_t ('signed size_t'). The code checks for error returns (-1) by checking if 'more' is less than zero, which will never be true (it's unsigned). So, the error handling never happens.

Abandoned?

Is this project abandoned? I'm asking this question because the build for the master branch has been broken for almost a year. I think this is a really useful project and I want to use it in a production environment.

(Style nit) MurmurHash3.cpp inconsistent use of 'const'

MurmurHash3_x86_32, MurmurHash3_x86_128, and MurmurHash3_x64_128 all take a 'uint32_t seed', but only MurmurHash3_x64_128 calls it a 'const uint32_t seed'. I don't think the const really matters, this is a pass-by-value C API … and this is a super minor nit. Sorry.

Time outs [+question]

Hi there. Recently we started seeing a lot of timeouts when performing queries against a BIG set (over 200 Million filters). Timeouts for us, means that the query to bloomd is taking longer than 250ms (we even increased it to 500ms and still times out).

Is there any best practices regarding the size of the filters? Is it better to split them into multiple small ones and perform multiple queries? Are there any optimizations we can do regarding the query to it?

Bind bloomd to localhost

I did not see a configuration option to bind bloomd to 127.0.0.1 to make it more secure on public servers.

Build errors (cc1: all warnings being treated as errors)

OS: Pop!_OS 22.04 LTS

Build Log

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/bloomd/conn_handler.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/conn_handler.c
src/bloomd/conn_handler.c: In function 'handle_client_connect':
src/bloomd/conn_handler.c:160:5: error: 'arg_buf_len' may be used uninitialized in this function [-Werror=maybe-uninitialized]
  160 |     handle_filt_key_cmd(handle, args, args_len, filtmgr_check_keys);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/bloomd/conn_handler.c:65:18: note: 'arg_buf_len' was declared here
   65 |     int buf_len, arg_buf_len, should_free;
      |                  ^~~~~~~~~~~
cc1: all warnings being treated as errors
scons: *** [src/bloomd/conn_handler.o] Error 1
scons: building terminated because of errors.

And once this is fixed, another error

scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/bloomd/conn_handler.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/conn_handler.c
gcc -o src/bloomd/bloomd.o -c -std=c99 -D_GNU_SOURCE -Wall -Wextra -Werror -O2 -pthread -Isrc/bloomd/ -Ideps/inih/ -Ideps/libev/ -Isrc/libbloom/ src/bloomd/bloomd.c
src/bloomd/bloomd.c: In function 'main':
src/bloomd/bloomd.c:180:43: error: cast between incompatible function types from 'void (*)(worker_args *)' to 'void * (*)(void *)' [-Werror=cast-function-type]
  180 |         pthread_create(&threads[i], NULL, (void*(*)(void*))worker_main, &wargs);
      |                                           ^
cc1: all warnings being treated as errors
scons: *** [src/bloomd/bloomd.o] Error 1
scons: building terminated because of errors.

NodeJS Client

Just wanted to let you know of a NodeJS implementation that I recently put together, which you can see here:

https://github.com/majelbstoat/node-bloomd

Please feel free to add it to the list of known clients, if you think it suitable.

Thanks for bloomd; it's awesome! :)

rename filter command

It would be great to have server supported atomic renames.

We have a script that builds several large bloom filters, which can take a while. It would be great if we could build the filters under temporary names and then (atomically) rename the all the filters:

rename temp_filter_1 filter_1, temp_filter_2 filter_2, temp_filter_3 filter_3

That would allow us to build the bloom filters in the background under a pseudonym, then checks against the final filter name will fail. Once the build completes, we perform the renames and suddenly the client checks begin working.

Client 1:

c filter_1 key
Filter does not exist

Client 2:

rename temp_filter_1 filter_1

Client 1:

c filter_1 key
No

Failed to accept() connection! Too many open files.

Hi, I'm running your server on one of my machines and a few clients that are trying to put some links in a filter. Seems that this is my output from bloomd server.

Failed to accept() connection! Too many open files.
Failed to accept() connection! Too many open files.
Failed to accept() connection! Too many open files.

And my configuration of client:

        client = BloomdClient(["localhost"])
        bloom = self.client.create_filter("domains")
        --- some code and a loop here ---
        bloom.add(domain)

So I have almost 10 clients trying to put items in the same filter, sometimes in the same time.
What can be the problem?

Question: cuckoo filter

Hi,

I was wondering, using both bloom and cuckoo filters with different hash functions and thus reducing the collision rate, it would be highly unlikely that both filters could make false positive for the same element, also the overall size/memory footprint and performance would be better.

Ie. using both bloom and cuckoo filters with high FPP (0.01) would lead to wayyy lower collision rate than using only Bloom or Cuckoo filter with low FPP (0.00001) and it would be faster and smaller.

Have you considered adding support for https://github.com/efficient/cuckoofilter ?

Can't compile on alpine linux

Alpine linux uses musl.

$ scons
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o src/libbloom/bitmap.o -c -std=c99 -Wall -Werror -Wextra -O2 -D_GNU_SOURCE src/libbloom/bitmap.c
In file included from src/libbloom/bitmap.c:9:0:
/usr/include/sys/errno.h:1:2: error: #warning redirecting incorrect #include <sys/errno.h> to <errno.h> [-Werror=cpp]
 #warning redirecting incorrect #include <sys/errno.h> to <errno.h>
  ^
cc1: all warnings being treated as errors
scons: *** [src/libbloom/bitmap.o] Error 1
scons: building terminated because of errors.

Content of /usr/include/sys/errno.h

#warning redirecting incorrect #include <sys/errno.h> to <errno.h>
#include <errno.h>

Higher false positive rate than expected

Hello,

After setting up and running bloomd without a problem, I see higher false positive rate than expected. I'm using bloom-python-driver and creating filter with initial_capacity=1000000, 0.0001 max false positive probability. Then I'm trying to insert and check same url set for several times. After adding almost 110000 URL to filter (in one loop), I'm running the script again, and this time it's adding almost 40.000 of them again.

While script running sometimes it returns "Client Error: Command not supported" error, I suppose it's because of some malformed etc. URLs and I just pass (in try..except) and continue to loop, can it be the reason of subsequent false positives?

When I check key with telnet connection, it returns "yes". What can be the reason for this problem? Do you have any ideas?

Thank you very much for your help in advance...

Bloomd - Internal Error (too many open files)

I just got an pybloomd.BloomdError: Got response: Internal Error (from filters[fname] = client.create_filter(fname) see code below).

I have 6 python bloomd clients that send a lot of requests to it. The data dir of bloomd currently contains 1023 bloom filters and counting.

My bloomd configuration is:

[bloomd]
tcp_port = 8673
udp_port = 8674
data_dir = /data/bloomd
log_level = INFO
cold_interval = 3600
flush_interval = 300
initial_capacity = 10001
default_probability = 0.0001
workers = 4
use_mmap = 1

Each client, does the following:

for each incoming message
 - filter = getBloomFilter(message.created_at)
 - if not filter.__contains__(message.id):
 -   filter.add(message.id)
 -   ... do other things ...

with the following python code:

client = BloomdClient(['localhost'])
filters = {}
serviceName = "hey"

def getBloomFilterName(created_at):
    now = datetime.utcfromtimestamp(created_at/1000.0)
    now = now.replace(microsecond=0, second=0, minute=0, hour=0)
    return "%s.%s-%s-%s" % (serviceName, now.year, now.month, now.day)


def getBloomFilter(created_at):
    fname = getBloomFilterName(created_at)

    if fname not in filters:
        filters[fname] = client.create_filter(fname)

    return filters[fname]

Multiple clients got the pybloomd.BloomdError: Got response: Internal Error error within ~50 seconds interval while calling client.create_filter(fname). Some of them were just under huge load so I'm sure they were creating a lot of bloom filters at the same time.

It's too bad that this error did not say what was the cause so I've got several questions and feedbacks:

Is my code correct or is there a better way to handle filter creation/retrieval?
The error (and its cause) was not printed in syslog but I think it should be (?).
The returned error could be more explicit about the cause.

[Update] In fact I've seen this error in the log:

Apr  8 09:24:33 serv1 bloomd[11870]: Failed to fault in the filter 'f1.2012-8-10'. Err: -1
Apr  8 09:24:45 serv1 bloomd[11870]: Failed to scan files for filter 'f2.2012-11-15'. Too many open files
Apr  8 09:24:59 serv1 bloomd[11870]: Failed to scan files for filter f1.2013-1-26'. Too many open files
Apr  8 09:24:59 serv1 bloomd[11870]: Failed to fault in the filter 'f1.2013-1-26'. Err: -1

After a restart:

cat /proc/sys/fs/file-nr
1728    0   6606290

So currently the only way to fix this is to restart bloomd, so I think there may be an issue (or a place for improvement) with the file handling in bloomd.

Again, thanks for your hard work!

bloomd as background daemon

I try to get bloomd running in a cluster configuration with the data_dir on a drbd volume synced between two nodes. Everything works fine except the redirect of stdout. nohup with redirect, daemonize or any other approach seems to fail. does bloomd require a terminal?

screen -d -m -L ./bloomd seems to be the only option

Human readable protocol is hard to parse programmatically on error

Hi,

The protocol for bloomd, while very easy to read (for an English-speaker ;) ), makes it hard to program against for some reasonable use cases.

A concrete example:

I'm implementing a "set this key, and automatically create the filter if it doesn't exist, then retry" method for node-bloomd. This will important for our use case, and for incrementally creating bloom filters on-demand instead of provisioning them all up front.

The response I get back from the server, when the filter doesn't exist is "Filter does not exist". String matching against this is flaky, and does not fill me with confidence :) What if you changed it to "Filter doesn't exist"? :)

Another example: it is hard(er than it should be) to differentiate a bulk/multi response which is a single line of words, from an error, which is also a single line of words.

I'd like to propose that errors come back with a distinct first character, an error code, and then the human readable part. For example:

"_42: Filter does not exist"
"_17: Filter is being deleted"

This would balance the needs of being able to read the response (which is very helpful in debugging), with the ability to respond programmatically to error cases where appropriate.

I don't care about the specific syntax really, only that it be reasonably parseable in a sane manner :)

Thanks for your consideration!

Cheers,

Jamie.

clear command report filter is not proxied

macOS Mojave 10.14.6
bloomd v0.7.5

in_memory = 0

Created in_memory = 0 Why is shown in the info or 1

create test capacity=10001 prob=0.01 in_memory=0
Array
(
    [capacity] => 10001
    [checks] => 0
    [check_hits] => 0
    [check_misses] => 0
    [in_memory] => 1
    [page_ins] => 0
    [page_outs] => 0
    [probability] => 0.010000
    [sets] => 0
    [set_hits] => 0
    [set_misses] => 0
    [size] => 0
    [storage] => 18486
)

bloomd.conf

[bloomd]
port = 8673
udp_port = 8674
scale_size = 4
flush_interval = 60
cold_interval = 3600
in_memory = 0
use_mmap = 0
workers = 1
initial_capacity = 100000
default_probability = 1e-4
probability_reduction = 0.9
data_dir = /var/lib/bloomd
log_level = INFO