jbapple / libfilter Goto Github PK
View Code? Open in Web Editor NEWHigh-speed Bloom filters and taffy filters for C, C++, and Java
License: Apache License 2.0
High-speed Bloom filters and taffy filters for C, C++, and Java
License: Apache License 2.0
See the Binary Fuse Filter paper for the rationale. It increases locality.
I noticed the com.github.jbapple.libfilter.TaffyBlockFilter has a high fpp (100% at some point). Likely there is a simple bug:
private void MakeMask(long hash, int[] result) {
// The loops are done individually for easier auto-vectorization by the JIT
for (int i = 0; i < 8; ++i) {
result[0] = (int) hash * INTERNAL_HASH_SEEDS[i]; <== BUG HERE
}
for (int i = 0; i < 8; ++i) {
result[i] = result[i] >>> (32 - 5);
}
for (int i = 0; i < 8; ++i) {
result[i] = 1 << result[i];
}
}
To me it looks like it should be result[i]
instead of result[0]
.
There's no need for null values in the internal arrays used for constructing filter via peeling
Thanks for the amazing work.
Can we please also add some benchmarking results for different filters.
Trying to understand the Build latency , query latency , fpp rates and bits per value for these different filters to choose something that fits under some of opur sla requirements.
use case is upper bound 50-100 million values to be added , for a given bits/value , what would be the memory size for the filter and build latency when adding huge number of elements to the filter
libfilter is very fast.
I have about 2^36 items to add in bf, it is possible cache the bloom filter to NVMe?
Thank you for implementing this awesome filter, I have a question I would like to ask, I wonder if you can give me some pointers.
I want to use C/C++ to save the filter to SSD so that it can be load again next time, or directly use mmap
the filter to the memory instead of reading the entire filter into the memory.
grateful.
Are these filters serializable ?
can we load and serialize filters in a compatible way in different languages and platform ?
@jbapple I test the blocked bloom filter (which is https://github.com/FastFilter/fastfilter_cpp/blob/master/src/bloom/simd-block.h version) in my codes. The vtune result shows that Find
avx2 implementation's _mm256_testc_si256 opcode's performance is bad.
So, do we have other tricks to improve it ? Maybe a prefetch will help ?
Hi! I've found an important bug in TaffyBlockFilter Java realization which causes creation of new Block Filters with the same fpp.
I guess, in this line there should be x
instead of cursor
.
While i am going through some amazing work you are doing , wanted to get your thoughts on some implementation details or caveats
IIUC SBBF is same as in https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h ? Are there any implementation difference or caveats in your particular approach (any significant perf difference) ? Bit confused around terminology here between block
and split block
so wanted to understand better if your implementation is something additional on top of classical blocked bloom filter.
I found your repo through : https://arxiv.org/pdf/2101.01719.pdf , it says for the split block bloom fpp are somewhat fixed due to pre-determined number of hash functions used . In your repo example is see double fpp = 0.0065;
, is this because we are using a fixed number of hashes and bits per value (pre-configured at the given fpp level) . Am i able to tune this for lower fpp by using more hash functions and may be more bits per value ? What is the current default bits per value set in current SBBF implementation ? which parameter is that ?
In the apache kudu project which uses block bloom filter , here : https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter.h#L66 it says we are able to get 0.1% fpp with 15 bits per value , is it possible to achieve this in your split block BF implementation as well , or are we stuck to 0.4% - 19% range only ,
thanks for bearing with my naive questions as i am new into this area and still trying to figure out best filter for my case where i have strict latency requirement and need to insert ~10-100 million ids , these ids are not uuid but sequential range [0-100M] , so was trying to use roaring bitmaps before but now looking into this probabilistic filters since constructing filters is a potential bottleneck for me .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.