bitly / dablooms Goto Github PK

View Code? Open in Web Editor NEW

964.0 77.0 119.0 211 KB

scaling, counting, bloom filter library

License: MIT License

Go 4.91% C 76.44% Python 8.71% Makefile 9.94%

dablooms's Introduction

Dablooms: A Scalable, Counting, Bloom Filter

Note: this project has been mostly unmaintained for a while.

Overview

This project aims to demonstrate a novel Bloom filter implementation that can scale, and provide not only the addition of new members, but reliable removal of existing members.

Bloom filters are a probabilistic data structure that provide space-efficient storage of elements at the cost of possible false positive on membership queries.

dablooms implements such a structure that takes additional metadata to classify elements in order to make an intelligent decision as to which Bloom filter an element should belong.

Features

dablooms, in addition to the above, has several features.

Implemented as a static C library
Memory mapped
4 bit counters
Sequence counters for clean/dirty checks
Python wrapper

For performance, the low-level operations are implemented in C. It is also memory mapped which provides async flushing and persistence at low cost. In an effort to maintain memory efficiency, rather than using integers, or even whole bytes as counters, we use only four bit counters. These four bit counters allow up to 15 items to share a counter in the map. If more than a small handful are sharing said counter, the Bloom filter would be overloaded (resulting in excessive false positives anyway) at any sane error rate, so there is no benefit in supporting larger counters.

The Bloom filter also employs change sequence numbers to track operations performed on the Bloom filter. These allow the application to determine if a write might have only partially completed (perhaps due to a crash), leaving the filter in an inconsistent state. The application can thus determine if a filter is ok or needs to be recreated. The sequence number can be used to determine what a consistent but out-of-date filter missed, and bring it up-to-date.

There are two sequence numbers (and helper functions to get them): "mem_seqnum" and "disk_seqnum". The "mem" variant is useful if the user is sure the OS didn't crash, and the "disk" variant is useful if the OS might have crashed since the Bloom filter was last changed. Both values could be "0", meaning the filter is possibly inconsistent from their point of view, or a non-zero sequence number that the filter is consistent with. The "mem" variant is often non-zero, but the "disk" variant only becomes non-zero right after a (manual) flush. This can be expensive (it's an fsync), so the value can be ignored if not relevant for the application. For example, if the Bloom file exists in a directory which is cleared at boot (like /tmp), then the application can safely assume that any existing file was not affected by an OS crash, and never bother to flush or check disk_seqnum. Schemes involving batching up changes are also possible.

The dablooms library is not inherently thread safe, this is the clients responsibility. Bindings are also not thread safe, unless they state otherwise.

Installing

Clone the repo, or download and extract a tarball of a tagged version from github. In the source tree, type make, make install (sudo may be needed). This will only install static and dynamic versions of the C dablooms library "libdablooms".

To use a specific build directory, install prefix, or destination directory for packaging, specify BLDDIR, prefix, or DESTDIR to make. For example: make install BLDDIR=/tmp/dablooms/bld DESTDIR=/tmp/dablooms/pkg prefix=/usr

Look at the output of make help for more options and targets.

Also available are bindings for various other languages:

Python (pydablooms)

To install the Python bindings "pydablooms" (currently only compatibly with python 2.x) run make pydablooms, make install_pydablooms (sudo may be needed).

To use and install for a specific version of Python installed on your system, use the PYTHON option to make. For example: make install_pydablooms PYTHON=python2.7. You can override the module install location with the PY_MOD_DIR option to make, and the BLDDIR and DESTDIR options also affect pydablooms.

The Makefile attempts to determine the python module location PY_MOD_DIR automatically. It prefers a location in /usr/local, but you can specify PY_MOD_DIR_ARG=--user to try to use the location which pip install --user would use in your HOME dir. You can instead specify PY_MOD_DIR_ARG=--system to prefer the normal/central system python module dir.

See pydablooms/README.md for more info.

Go (godablooms)

The Go bindings "godablooms" are not integrated into the Makefile. Install libdablooms first, then look at godablooms/README.md

Contributing

If you make changes to C portions of dablooms which you would like merged into the upstream repository, it would help to have your code match our C coding style. We use astyle, svn rev 353 or later, on our code, with the following options:

astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
       --fill-empty-lines --pad-header --indent-switches           \
       --align-pointer=name --align-reference=name --pad-oper -n     <files>

Testing

To run a quick and dirty test, type make test. This test uses a list of words and defaults to /usr/share/dict/words. If your path differs, you can use the WORDS flag to specific its location, such as make test WORDS=/usr/dict/words.

This will run a simple test that iterates through a word list and adds each word to dablooms. It iterates again, removing every fifth element. Lastly, it saves the file, opens a new filter, and iterates a third time checking the existence of each word. It prints results of the true negatives, false positives, true positives, and false negatives, and the false positive rate.

The false positive rate is calculated by "false positives / (false positivies + true negatives)". That is, what rate of real negatives are false positives. This is the interesting statistic because the rate of false negatives should always be zero.

The test uses a maximum error rate of .05 (5%) and an initial capacity of 100k. If the dictionary is near 500k, we should have created 4 new filters in order to scale to size.

A second test adds every other word in the list, and removes no words, causing each used filter to stay at maximum capacity, which is a worse case for accuracy.

Check out the performance yourself, and checkout the size of the resulting file!

Bloom Filter Basics

Bloom filters are probabilistic data structures that provide space-efficient storage of elements at the cost of occasional false positives on membership queries, i.e. a Bloom filter may state true on query when it in fact does not contain said element. A Bloom filter is traditionally implemented as an array of M bits, where M is the size of the Bloom filter. On initialization all bits are set to zero. A filter is also parameterized by a constant k that defines the number of hash functions used to set and test bits in the filter. Each hash function should output one index in M. When inserting an element x into the filter, the bits in the k indices h1(x), h2(x), ..., hk(X) are set.

In order to query a Bloom filter, say for element x, it suffices to verify if all bits in indices h1(x), h2(x), ..., hk(x) are set. If one or more of these bits is not set then the queried element is definitely not present in the filter. However, if all these bits are set, then the element is considered to be in the filter. Given this procedure, an error probability exists for positive matches, since the tested indices might have been set by the insertion of other elements.

Counting Bloom Filters: Solving Removals

The same property that results in false positives also makes it difficult to remove an element from the filter as there is no easy means of discerning if another element is hashed to the same bit. Unsetting a bit that is hashed by multiple elements can cause false negatives. Using a counter, instead of a bit, can circumvent this issue. The bit can be incremented when an element is hashed to a given location, and decremented upon removal. Membership queries rely on whether a given counter is greater than zero. This reduces the exceptional space-efficiency provided by the standard Bloom filter.

Scalable Bloom Filters: Solving Scale

Another important property of a Bloom filter is its linear relationship between size and storage capacity. If the maximum allowable error probability and the number of elements to store are both known, it is relatively straightforward to dimension an appropriate filter. However, it is not always possible to know how many elements will need to be stored a priori. There is a trade off between over-dimensioning filters or suffering from a ballooning error probability as it fills.

Almeida, Baquero, Preguiça, Hutchison published a paper in 2006, on Scalable Bloom Filters, which suggested a means of scalable Bloom filters by creating essentially a list of Bloom filters that act as one large Bloom filter. When greater capacity is desired, a new filter is added to the list.

Membership queries are conducted on each filter with the positives evaluated if the element is found in any one of the filters. Naively, this leads to an increasing compounding error probability since the probability of the given structure evaluates to:

1 - 𝚺(1 - P)

It is possible to bound this error probability by adding a reducing tightening ratio, r. As a result, the bounded error probability is represented as:

1 - 𝚺(1 - P0 * r^i) where r is chosen as 0 < r < 1

Since size is simply a function of an error probability and capacity, any array of growth functions can be applied to scale the size of the Bloom filter as necessary. We found it sufficient to pick .9 for r.

Problems with Mixing Scalable and Counting Bloom Filters

Scalable Bloom filters do not allow for the removal of elements from the filter. In addition, simply converting each Bloom filter in a scalable Bloom filter into a counting filter also poses problems. Since an element can be in any filter, and Bloom filters inherently allow for false positives, a given element may appear to be in two or more filters. If an element is inadvertently removed from a filter which did not contain it, it would introduce the possibility of false negatives.

If however, an element can be removed from the correct filter, it maintains the integrity of said filter, i.e. prevents the possibility of false negatives. Thus, a scaling, counting, Bloom filter is possible if upon additions and deletions one can correctly decide which Bloom filter contains the element.

There are several advantages to using a Bloom filter. A Bloom filter gives the application cheap, memory efficient set operations, with no actual data stored about the given element. Rather, Bloom filters allow the application to test, with some given error probability, the membership of an item. This leads to the conclusion that the majority of operations performed on Bloom filters are the queries of membership, rather than the addition and removal of elements. Thus, for a scaling, counting, Bloom filter, we can optimize for membership queries at the expense of additions and removals. This expense comes not in performance, but in the addition of more metadata concerning an element and its relation to the Bloom filter. With the addition of some sort of identification of an element, which does not need to be unique as long as it is fairly distributed, it is possible to correctly determine which filter an element belongs to, thereby able to maintain the integrity of a given Bloom filter with accurate additions and removals.

Enter dablooms

dablooms is one such implementation of a scaling, counting, Bloom filter that takes additional metadata during additions and deletions in the form of a (generally) monotonically increasing integer to classify elements (possibly a timestamp). This is used during additions/removals to easily determine the correct Bloom filter for an element (each filter is assigned a range). Checking an item against the Bloom filter, which is assumed to be the dominant activity, does not use the id (it works like a normal scaling Bloom filter).

dablooms is designed to scale itself using these identifiers and the given capacity. When a Bloom filter is at capacity, dablooms will create a new Bloom filter which starts at the next id after the greatest id of the previous Bloom filter. Given the fact that the identifiers monotonically increase, new elements will be added to the newest Bloom filter. Note, in theory and as implemented, nothing prevents one from adding an element to any "older" filter. You just run the increasing risk of the error probability growing beyond the bound as it becomes "overfilled".

You can then remove any element from any Bloom filter using the identifier to intelligently pick which Bloom filter to remove from. Consequently, as you continue to remove elements from Bloom filters that you are not continuing to add to, these Bloom filters will become more accurate.

The "id" of an element does not need to be known to check the Bloom filter, but does need to be known when the element is removed (and the same as when it was added). This might be convenient if the item already has an appropriate id (almost always increasing for new items) associated with it.

Example use case

There is a database with a collection of entries. There is a series of items, each of which you want to look up in the database; most will have no entry in the database, but some will. Perhaps it's a database of spam links. If you use dablooms in front of the database, you can avoid needing to check the database for almost all items which won't be found in it anyway, and save a lot of time and effort. It's also much easier to distribute the Bloom filter than the entire database. But to make it work, you need to determine an "id" whenever you add to or remove from the Bloom filter. You could store the timestamp when you add the item to the database as another column in the database, and give it to scaling_bloom_add() as well. When you remove the item, you look it up in the database first and pass the timestamp stored there to scaling_bloom_remove(). The timestamps for new items will be equal or greater, and definitely greater over time. Instead of timestamps, you could also use an auto-incrementing index. Checks against the Bloom don't need to know the id and should be quick. If a check comes back negative, you can be sure the item isn't in the database, and skip that query completely. If a check comes back positive, you have to query the database, because there's a slight chance that the item isn't actually in there.

dablooms's People

Contributors

Stargazers

Watchers

Forkers

ploxiln jphines nivertech tralamazza mreiferson navyjeff souravzzz vmg chobie wadujj ankurcha girishrao andradeandrey peterjc weixu8 awakmu xushiwei lilydjwg zcourts pombredanne ranman chenbk85 alepharchives liluxdev danieljue martin-ly harkeerat revolet pydatastructs rymo4 di3online songtian aburan28 yanyiwu gregorgt atrmat zclfly martinthomas rsrsps vgslavov xiaolongphoenix green4984 usownh madninja bawerd dxtkdxtk marchon wangfengliang sunghoon jameswei abronte xpsair harishks chideat rajesh0346 lenovor huiyi1990 giserh shan2015 milannic ewindfall mieitza simudream liinnux drtall xlong88 qzm1218 seanwaye linearregression wmswu ifa6 harishraj parwatsinghanjana houjun gybay btccom rvantonder kioco cuiweifu hsuehyi jingfg123 ghp268 footpatch lizhen0909 wanyisunflower squareofthehypotenuse renjunqu hysteria87 linuxfl shadowwalker2718 wheelcomplex yangronghai gptcod cms-externals unixian97 tnpe bcwst10 denglq max19931 sjtutravel

dablooms's Issues

build broken on linux

... and also, this repo is unmaintained (I left the bitly org and lost push privileges)

but I've pushed up a fix to https://github.com/ploxiln/dablooms ( ploxiln@a9cc856 )

pydablooms compilation failing on OSX 10.8.5

I'm getting a linker error. GCC and Python versions included. Any recommendations?

(bloom)09:26:30 dablooms@master$ make
 CC  build/dablooms.o
 CC  build/murmur.o
 AR  build/libdablooms.a
 SO  build/libdablooms.1.1.dylib
 SYMLNK  build/libdablooms.dylib
 SYMLNK  build/libdablooms.1.dylib
(bloom)09:26:31 dablooms@master$ make pydablooms
 PY_BUILD build/python/pydablooms.so
clang: warning: argument unused during compilation: '-mno-fused-madd'
ld: warning: ignoring file build/libdablooms.a, file was built for archive which is not the architecture being linked (i386): build/libdablooms.a
(bloom)09:26:34 dablooms@master$
(bloom)09:27:51 dablooms@master$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(bloom)09:27:53 dablooms@master$ python --version
Python 2.7.2
(bloom)09:27:55 dablooms@master$

make install_pydablooms ignores PREFIX argument

I have tried to install dablooms under my home directory on Linux,

$ git clone https://github.com/bitly/dablooms.git
Cloning into dablooms...
remote: Counting objects: 391, done.
remote: Compressing objects: 100% (193/193), done.
remote: Total 391 (delta 209), reused 363 (delta 193)
Receiving objects: 100% (391/391), 86.44 KiB, done.
Resolving deltas: 100% (209/209), done.
$ cd dablooms/
$ ls
godablooms  LICENSE  Makefile  phpdablooms  pydablooms  README.md  src
$ make
 CC  build/dablooms.o
 CC  build/murmur.o
 AR  build/libdablooms.a
 SO  build/libdablooms.so.1.0
 SYMLNK  build/libdablooms.so
 SYMLNK  build/libdablooms.so.1
$ make install PREFIX=$HOME
 INSTALL  /home/pjcock/lib/libdablooms.a
 INSTALL  /home/pjcock/lib/libdablooms.so.1.0
 SYMLNK  /home/pjcock/lib/libdablooms.so
 SYMLNK  /home/pjcock/lib/libdablooms.so.1
 INSTALL  /home/pjcock/include/dablooms.h

That all seems fine, but for the Python bindings the prefix seems to be ignored:

$ make install_pydablooms PYTHON=python2.7 PREFIX=$HOME
 PY_INSTALL  /usr/local/lib/python2.7/site-packages/pydablooms.so
install: cannot change permissions of `/usr/local/lib/python2.7/site-packages/': Operation not permitted
make: *** [/usr/local/lib/python2.7/site-packages/pydablooms.so] Error 1

There appears to be nothing in the Makefile to attempt to add --prefix=$PREFIX or similar to the call to setup.py (as is typically supported, see http://docs.python.org/install/index.html for an example), and furthermore attempting to call pydablooms/setup.py directly with this optional argument fails.

undefined reference to `log`, `ceil`, `pow`

trying to make this package so I can use it with golang, keep getting these errors

$ make test
 CC  build/test_dablooms.o
 CC  build/dablooms.o
 CC  build/murmur.o
 AR  build/libdablooms.a
 LD  build/test_dablooms
build/libdablooms.a(dablooms.o): In function `counting_bloom_init':
/work/dablooms/src/dablooms.c:226: undefined reference to `log'
/work/dablooms/src/dablooms.c:226: undefined reference to `ceil'
/work/dablooms/src/dablooms.c:227: undefined reference to `log'
/work/dablooms/src/dablooms.c:227: undefined reference to `ceil'
build/libdablooms.a(dablooms.o): In function `new_counting_bloom_from_scale':
/work/dablooms/src/dablooms.c:327: undefined reference to `pow'
collect2: error: ld returned 1 exit status
Makefile:124: recipe for target 'build/test_dablooms' failed
make: *** [build/test_dablooms] Error 1

some googling tells me to add -lm but make test ALL_LDFLAGS="-lm" doesn't seem to do the trick.

sorry if this is supposed to be something obvious, but I'm not too familiar with compiling C packages and the README.md doesn't mention anything along these lines.

$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
$ uname -a
Linux sakura 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

about the Makefile

Dablooms is very benefit for understanding the bloom filter. The Makefile of dablooms is very good and professional, but it is a little bit too professional for beginners :)

I believe that dablooms prefer to let people concentrate on the bloom filter itself rather than how to read and spend time on understanding the Makefile.

I wish their would be a very simple version of the Makefile.

What is the max count number

What is the max count in the scalable bloom filter

pydablooms compiling in Ubuntu 12.04 64bit

make pydablooms
PY_BUILD build/python/pydablooms.so
pydablooms/pydablooms.c: In function ‘Dablooms_dealloc’:
pydablooms/pydablooms.c:15:9: error: ‘Dablooms’ has no member named ‘ob_type’
pydablooms/pydablooms.c: In function ‘contains’:
pydablooms/pydablooms.c:50:5: warning: implicit declaration of function ‘PyString_Check’ [-Wimplicit-function-declaration]
pydablooms/pydablooms.c:54:32: warning: implicit declaration of function ‘PyString_AsString’ [-Wimplicit-function-declaration]
pydablooms/pydablooms.c:55:32: warning: implicit declaration of function ‘PyString_Size’ [-Wimplicit-function-declaration]
pydablooms/pydablooms.c:55:32: warning: passing argument 2 of ‘scaling_bloom_check’ makes pointer from integer without a cast [enabled by default]
src/dablooms.h:76:5: note: expected ‘const char ’ but argument is of type ‘int’
pydablooms/pydablooms.c: At top level:
pydablooms/pydablooms.c:138:5: warning: missing braces around initializer [-Wmissing-braces]
pydablooms/pydablooms.c:138:5: warning: (near initialization for ‘DabloomsType.ob_base.ob_base’) [-Wmissing-braces]
pydablooms/pydablooms.c:140:5: warning: initialization makes integer from pointer without a cast [enabled by default]
pydablooms/pydablooms.c:140:5: warning: (near initialization for ‘DabloomsType.tp_basicsize’) [enabled by default]
pydablooms/pydablooms.c:143:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:143:5: warning: (near initialization for ‘DabloomsType.tp_print’) [enabled by default]
pydablooms/pydablooms.c:150:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:150:5: warning: (near initialization for ‘DabloomsType.tp_as_mapping’) [enabled by default]
pydablooms/pydablooms.c:158:5: warning: initialization makes pointer from integer without a cast [enabled by default]
pydablooms/pydablooms.c:158:5: warning: (near initialization for ‘DabloomsType.tp_doc’) [enabled by default]
pydablooms/pydablooms.c:159:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:159:5: warning: (near initialization for ‘DabloomsType.tp_traverse’) [enabled by default]
pydablooms/pydablooms.c:166:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:166:5: warning: (near initialization for ‘DabloomsType.tp_members’) [enabled by default]
pydablooms/pydablooms.c:167:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:167:5: warning: (near initialization for ‘DabloomsType.tp_getset’) [enabled by default]
pydablooms/pydablooms.c:174:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:174:5: warning: (near initialization for ‘DabloomsType.tp_alloc’) [enabled by default]
pydablooms/pydablooms.c:176:5: warning: initialization from incompatible pointer type [enabled by default]
pydablooms/pydablooms.c:176:5: warning: (near initialization for ‘DabloomsType.tp_free’) [enabled by default]
pydablooms/pydablooms.c: In function ‘initpydablooms’:
pydablooms/pydablooms.c:210:9: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
pydablooms/pydablooms.c:213:5: warning: implicit declaration of function ‘Py_InitModule3’ [-Wimplicit-function-declaration]
pydablooms/pydablooms.c:213:7: warning: assignment makes pointer from integer without a cast [enabled by default]
pydablooms/pydablooms.c:216:9: warning: ‘return’ with no value, in function returning non-void [-Wreturn-type]
pydablooms/pydablooms.c:219:5: warning: implicit declaration of function ‘PyString_FromString’ [-Wimplicit-function-declaration]
pydablooms/pydablooms.c:219:5: warning: passing argument 3 of ‘PyModule_AddObject’ makes pointer from integer without a cast [enabled by default]
/usr/include/python3.4m/modsupport.h:47:17: note: expected ‘struct PyObject *’ but argument is of type ‘int’
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
make: ** [build/python/pydablooms.so] Error 1

Linux luna 3.2.0-70-generic #105-Ubuntu SMP Wed Sep 24 19:49:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Proper "Bloom filter" capitalization

It's a really minor issue, but the correct name is "Bloom filter", not "bloom filter", since Bloom is the inventor's last name.

counting_bloom_add() segfaults

While testing your counting bloom filter implementation, I got an ugly segfault (the scaling bloom filter works fine btw).

This is the snippet which segfaults for me:

https://gist.github.com/3444985

Unfortunately I don't get more information out of gdb as the following:

Program received signal SIGSEGV, Segmentation fault.
0x00000000004018bd in counting_bloom_add (bloom=0x604010, s=0x402a00 "test", len=4) at src/dablooms.c:336
336     (*bloom->header->count)++;

pydablooms compiling in OS X Mavericks 64bit

$ make pydablooms

PY_BUILD build/python/pydablooms.so
pydablooms/pydablooms.c:15:11: error: no member named 'ob_type' in 'Dablooms'
self->ob_type->tp_free((PyObject )self);
~~~~ ^
pydablooms/pydablooms.c:50:10: warning: implicit declaration of function 'PyString_Check' is invalid in C99 [-Wimplicit-function-declaration]
if (!PyString_Check(key)) {
^
pydablooms/pydablooms.c:54:32: warning: implicit declaration of function 'PyString_AsString' is invalid in C99 [-Wimplicit-function-declaration]
PyString_AsString(key),
^
pydablooms/pydablooms.c:55:37: warning: implicit declaration of function 'PyString_Size' is invalid in C99 -Wimplicit-function-declarationPyString_Size(key));
^
pydablooms/pydablooms.c:54:32: warning: incompatible integer to pointer conversion passing 'int' to parameter of type 'const char *' [-Wint-conversion]
PyString_AsString(key),
^~~~~~~~~~~~~~~~~~~~~~
src/dablooms.h:76:61: note: passing argument to parameter 's' here
int scaling_bloom_check(scaling_bloom_t *bloom, const char *s, size_t len);
^
pydablooms/pydablooms.c:138:5: warning: suggest braces around initialization of subobject [-Wmissing-braces]
PyObject_HEAD_INIT(NULL)
^~~~~~~~~~~~~~~~~~~~~~~~
/Library/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:86:5: note: expanded from macro 'PyObject_HEAD_INIT'
1, type },
^~~~~~~
pydablooms/pydablooms.c:140:5: warning: incompatible pointer to integer conversion initializing 'Py_ssize_t' (aka 'long') with an expression of type 'char [20]' [-Wint-conversion]
"pydablooms.Dablooms", /tp_name/
^~~~~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:143:5: warning: incompatible pointer types initializing 'printfunc' (aka 'int ()(PyObject , FILE *, int)') with an expression of type 'destructor' (aka 'void ()(PyObject _)')
-Wincompatible-pointer-typesDablooms_dealloc, /tp_dealloc/
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:150:5: warning: incompatible pointer types initializing 'PyMappingMethods *' with an expression of type 'PySequenceMethods *' [-Wincompatible-pointer-types]
&Dablooms_sequence, /tp_as_sequence/
^~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:158:5: warning: incompatible integer to pointer conversion initializing 'const char ' with an expression of type 'unsigned long' [-Wint-conversion]
Py_TPFLAGS_DEFAULT, /_tp_flags/
^~~~~~~~~~~~~~~~~~
/Library/Frameworks/Python.framework/Versions/3.4/include/python3.4m/object.h:643:29: note: expanded from macro 'Py_TPFLAGS_DEFAULT'

define Py_TPFLAGS_DEFAULT ( \

^~~

pydablooms/pydablooms.c:159:5: warning: incompatible pointer types initializing 'traverseproc' (aka 'int ()(PyObject *, visitproc, void *)') with an expression of type 'char [17]'
[-Wincompatible-pointer-types]
"Dablooms objects", /tp_doc/
^~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:166:5: warning: incompatible pointer types initializing 'struct PyMemberDef *' with an expression of type 'PyMethodDef [7]' [-Wincompatible-pointer-types]
Dablooms_methods, /tp_methods/
^~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:167:5: warning: incompatible pointer types initializing 'struct PyGetSetDef *' with an expression of type 'PyMemberDef [1]' [-Wincompatible-pointer-types]
Dablooms_members, /tp_members/
^~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:174:5: warning: incompatible pointer types initializing 'allocfunc' (aka 'PyObject *()(struct _typeobject , Py_ssize_t)') with an expression of type 'initproc'
(aka 'int ()(PyObject , PyObject *, PyObject *)') -Wincompatible-pointer-typesDablooms_init, /tp_init/
^~~~~~~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:176:5: warning: incompatible pointer types initializing 'freefunc' (aka 'void ()(void _)') with an expression of type 'PyObject *(PyTypeObject *, PyObject *, PyObject )'
[-Wincompatible-pointer-types]
Dablooms_new, /_tp_new/
^~~~~~~~~~~~
pydablooms/pydablooms.c:210:9: error: non-void function 'initpydablooms' should return a value [-Wreturn-type]
return;
^
pydablooms/pydablooms.c:213:9: warning: implicit declaration of function 'Py_InitModule3' is invalid in C99 [-Wimplicit-function-declaration]
m = Py_InitModule3("pydablooms", pydablooms_methods, "Dablooms module");
^
pydablooms/pydablooms.c:213:7: warning: incompatible integer to pointer conversion assigning to 'PyObject *' (aka 'struct _object *') from 'int' [-Wint-conversion]
m = Py_InitModule3("pydablooms", pydablooms_methods, "Dablooms module");
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
pydablooms/pydablooms.c:216:9: error: non-void function 'initpydablooms' should return a value [-Wreturn-type]
return;
^
pydablooms/pydablooms.c:219:42: warning: implicit declaration of function 'PyString_FromString' is invalid in C99 [-Wimplicit-function-declaration]
PyModule_AddObject(m, "version", PyString_FromString(dablooms_version()));
^
pydablooms/pydablooms.c:219:42: warning: incompatible integer to pointer conversion passing 'int' to parameter of type 'PyObject _' (aka 'struct object *') [-Wint-conversion]
PyModule_AddObject(m, "version", PyString_FromString(dablooms_version()));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Library/Frameworks/Python.framework/Versions/3.4/include/python3.4m/modsupport.h:47:72: note: passing argument to parameter here
PyAPI_FUNC(int) PyModule_AddObject(PyObject *, const char *, PyObject *);
^
18 warnings and 3 errors generated.
error: command '/usr/bin/clang' failed with exit status 1
make: ** [build/python/pydablooms.so] Error 1

$ uname -a
Darwin .local 13.4.0 Darwin Kernel Version 13.4.0: Wed Dec 17 19:05:52 PST 2014; root:xnu-2422.115.10~1/RELEASE_X86_64 x86_64

Can't delete Go bloom filter

Is there currently no way to delete a Go bloom filter?
https://github.com/bitly/dablooms/blob/master/godablooms/dablooms.go#L44

Am I also correct in thinking this memory will never get GC'ed, because it's in C?

Invalid keyword arguments to pydablooms segfaults

If I initialize a new dabloom filter with the incorrect keyword arguments, I get a segfault:

>>> b = pydablooms.Dablooms(capactiy=1000, error_rate=0.05, filepath='/tmp/bloom.bin')
Segmentation fault: 11

The segfault happens in line 15 of the python module.

counting_bloom stand-alone creation functions need work

function name for "from_file" version doesn't match between header and c file

all untested and probably don't work

could use refactoring

Cant build

undefined reference to pow
it seems to be a problem with the linking and the math lib

Force counts_per_func to nearest prime > counts_per_func ?

In the paper referenced for the bloom filter hashing algorithm used (https://github.com/bitly/dablooms/blob/master/src/dablooms.c#L185) (Kirsch, Mitzenmacher [2006]) - the recommended algorithm is given by:

gi(x)=h1(x)+ih2(x)(mod p)

From my reading it seems that their theoretical estimate for collision probabilities will only hold provided that counts_per_func is prime. It is also notable that 64 bits of entropy from MurmurHash3_x64_128 is discarded without being used. I'm not sure how significant the difference is on average, but it may make sense to either:
a) force counts_per_func to the nearest prime greater than counts_per_func
or
b) use a triple hashing algorithm taking advantage of the unused entropy provided by murmur hash

error in free_counting_bloom.

There is a serious error in free_counting_bloom, and memory leak will happen.
Please modify the code as follows:
don't free bloom->bitmap directly.
call free_bitmap( bloom->bitmap ) instead.

//
int free_counting_bloom(counting_bloom_t *bloom)
{
if (bloom != NULL) {
free(bloom->hashes);
bloom->hashes = NULL;
//
//free(bloom->bitmap);
free_bitmap( bloom->bitmap );
//
free(bloom);
bloom = NULL;
}
return 0;
}