Giter Site home page Giter Site logo

poppy's Introduction

Logo

Crates.io Version GitHub Actions Workflow Status docs.rs

Poppy is a Rust crate offering an efficient implementation of Bloom filters. It also includes a command-line utility (also called poppy) allowing users to effortlessly create filters with their desired capacity and false positive probability. Values can be added to the filters via standard input, facilitating the use of this tool in a pipeline workflow.

Poppy ensures cross-compatibility with the bloom filter format used by DCSO bloom software but also provides its own Bloom filter implementation and format.

FAQ

Which format to choose ?

It depends what you want to achieve. If you want to be cross compatible with DCSO tools and library, you must absolutely choose DCSO format. In any other scenario we advice to use Poppy format (the default), as it is more robust, faster and provides room for customization. A comparison between the two formats and implementations can be found in this blog post. By default, library and CLI chooses poppy format. If one wants to select DCSO format when creating a filter from CLI, one has to use poppy create --version 1.

How to build the project ?

Regular building

cargo build --release --bins

Building with MUSL (static binary)

# You can skip this step if you already have musl installed
rustup target add x86_64-unknown-linux-musl
# Build poppy with musl target
cargo build --release --target=x86_64-unknown-linux-musl --bins

How to use Poppy in other languages ?

In Python

Poppy comes with Python bindings, using the great PyO3 crate.

Please take a look at Poppy Bindings for further details.

Command Line Interface

Installation

In order to install poppy command line utility, one has to run the following command: cargo install poppy-filters

An alternative installation is by cloning this repository and compile from source using cargo.

Usage

Usage: poppy [OPTIONS] <COMMAND>

Commands:
  create  Create a new bloom filter
  insert  Insert data into an existing bloom filter
  check   Checks entries against an existing bloom filter
  bench   Benchmark the bloom filter. If the bloom filter behaves in an unexpected way, the benchmark fails. Input data is read from stdin
  show    Show information about an existing bloom filter
  help    Print this message or the help of the given subcommand(s)

Options:
  -v, --verbose      Verbose output
  -j, --jobs <JOBS>  The number of jobs to use when parallelization is possible. For write operations the original filter is copied into the memory of each job so you can expect the memory of the whole process to be N times the size of the filter [default: 2]
  -h, --help         Print help

Every command has its own arguments and help information. For example to get create command help run: poppy create help.

Examples

Creating an empty Bloom filter

# creating a filter with a desired capacity `-c` and false positive probability `-p`
poppy create -c 1000 -p 0.001 /path/to/output/filter.pop

# showing information about the filter we just created
poppy show /path/to/output/filter.pop

Inserting data into the filter

One can insert data in the filter in two ways, either by reading from stdin or by files. Reading data from stdin cannot be parallelized, so if one wants to insert a lot of data in the filter and speed up insertion, one has to insert from files (and setting the number of CPUs to use with -j option).

# insertion from stdin
cat data-1.txt data-2.txt | poppy insert filter.pop
# we verify number of element in the filter
poppy show filter.pop

# insertion from files
poppy insert filter.pop data-1.txt data-2.txt
# we verify number of element in the filter
poppy show filter.pop

# insertion from several files in parallel
poppy -j 0 insert filter.pop data-1.txt data-2.txt

Creating and Inserting in one command

One can easily create filter directly from a bunch of data. In this case the filter capacity will be set to the number of entries in the dataset.

# this creates a new filter saved in filter.pop with all entries (one per line)
# found in .txt files under the dataset directory using available CPUs (-j 0)
poppy -j 0 create -p 0.001 /path/to/output/filter.pop /path/to/dataset/*.txt

Checking if some data is in the filter

Check operation comes in the same variant as insertion, either from stdin or from files (when one need to take advantage of parallelization). By default, when an entry is inside the filter it is going to be printed out to stdout.

# check from stdin
cat data-1.txt data-2.txt | poppy check filter.pop

# check from files
poppy check filter.pop data-1.txt data-2.txt

# check from several files in parallel
poppy -j 0 check filter.pop data-1.txt data-2.txt

Benchmarking filter

Benchmarking a filter is an important step as it allow you to make sure that what you get is what you expected, in terms of false positive probability. The benchmark needs to take data already inserted in the filter, it will then randomly mutate entries and check them against the filter.

# run a benchmark against data known to be in the filter
cat data-1.txt data-2.txt | poppy bench filter.pop

Funding

The NGSOTI project is dedicated to training the next generation of Security Operation Center (SOC) operators, focusing on the human aspect of cybersecurity. It underscores the significance of providing SOC operators with the necessary skills and open-source tools to address challenges such as detection engineering, incident response, and threat intelligence analysis. Involving key partners such as CIRCL, Restena, Tenzir, and the University of Luxembourg, the project aims to establish a real operational infrastructure for practical training. This initiative integrates academic curricula with industry insights, offering hands-on experience in cyber ranges.

NGSOTI is co-funded under Digital Europe Programme (DEP) via the ECCC (European cybersecurity competence network and competence centre).

poppy's People

Contributors

adulau avatar qjerome avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

poppy's Issues

[Question] false-positive propability behaviour and benchmark

src: https://www.misp-project.org/2024/03/25/Poppy-a-new-bloom-filter-format-and-project.html

Benchmarks
Dataset size: varies between 10MB and 7GB
Data type: sha1 strings (40 bytes wide)
False positive probability: 0.001 (0.1%)

The 10MB datasets contains around 200,000 entries.

  1. How many datasets did you test?
  2. Have you tested any smaller datasets?
  3. the FPP depends on what values? I would guess, that this is defined at the creation based on the selected capacity and the FPP. Is this FPP then the upper limit or the lower limit? Or more like a wish and it gets maybe fulfilled.
  4. As previously guessed: Is there a way to estimate what the FPP will be depending on the actual used capacity ?

I used the bench option to test against a PSS created bloomfilter:

Preparing dataset
Benchmarking filter: multi/v2/filterC_0.01.pop
	version                                  : 2
	capacity (desired number of elements)    : 5
	n (estimated number of elements)         : 5
	fpp (false positive probability)         : 0.01
	Length of data                           : 0.0B
	Size of bloom filter                     : 4.0KB

Query performance:
	dataset size: 640.0B (0MB)
	test size: 5

	condition: 90% of queried values are in filter
	query duration: 97ns
	query speed: 51546391.8 queries/s -> 6292.3 MB/s
	fp rate =   0

[...]
	condition: 60% of queried values are in filter
	query duration: 99ns
	query speed: 50505050.5 queries/s -> 6165.2 MB/s
	fp rate = NaN
[...]
  • the fp rate = NaN is a bug?
  • the fp rate = xxx should be a float or an integer?
  • condition: 90% of queried values are in filter should contain the absolute number of queried values

bug in v2::BloomFilter when empty

contains_bytes must return false if len is 0

    pub fn contains_bytes<D: AsRef<[u8]>>(&self, data: D) -> bool {
        let it = self.index_iter().init_with_data(data);

        let h = it.bucket_hash();

        if !self.index_cache.is_empty()
            && !self
                .index_cache
                .get_nth_bit(h as usize & (self.index_cache.bit_len() - 1))
        {
            return false;
        }

        let ibucket = {
            if self.has_optimal_buckets() {
                h & (self.buckets.len() as u64 - 1)
            } else {
///// Bug is here
                h % self.buckets.len() as u64
            }
        };

Creating filters, large files and memory usage

While running the following command to create a Poppy filter from a file of 146GB:

./poppy -v -j 2 create -p 0.001 rockyou2024.pop rockyou2024.txt

The memory is exhausted and the process is killed by the Kernel:

Out of memory: Killed process 2467222 (poppy) total-vm:17534144kB, anon-rss:14868572kB, file-rss:0kB, shmem-rss:0kB, UID:1006 pgtables:34088kB oom_score_adj:0

It seems the issue is larger than the just the counting part. Creating a Poppy filter with the following
parameters kill a 16GB server ./poppy create -c 9948575739 -p 0.001 rockyou2024.pop

I bet it's not possible to create Bloom filter on system with less memory than the total filter size.

handle false positive probability correctly

#[clap(short, long, default_value = "0.01")]

I would guess that the false positive probability is between 1 and 0, but that is not clear nor checked after setting this parameter.
Maybe it is useful to clamp the value (https://doc.rust-lang.org/stable/core/primitive.f64.html#method.clamp)

I misunderstood this value as a percent value and this leads to funny results.

user@system:   wc -l result_*
     37 result_0.0001.txt
     37 result_0.001.txt
     37 result_0.01.txt
     37 result_0.1.txt
 712976 result_1.0.txt
    458 result_2.0.txt
    458 result_3.0.txt
     70 result_5.0.txt
     37 result_10.0.txt
     37 result_15.0.txt

different behaviour on format version

I used this bash script to evalute IPv4 bruteforce for different FPR. The dataset.txt are two different sets of IPv4 List. I changed it accordingly between the runs.

wc -l ../testdata*
         5 ../testdata.txt
 255102848 ../testdataRand.txt

selective information dump:

build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v1/filterR_0.001.pop 
File: v1/filterR_0.001.pop
        version                                  : 1
        capacity (desired number of elements)    : 255102848
        n (estimated number of elements)         : 255103777
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 437.2MB
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v2/filterR_0.001.pop 
File: v2/filterR_0.001.pop
        version                                  : 2
        capacity (desired number of elements)    : 255102848
        n (estimated number of elements)         : 255041379
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 437.3MB
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v1/filter_0.001.pop 
File: v1/filter_0.001.pop
        version                                  : 1
        capacity (desired number of elements)    : 5
        n (estimated number of elements)         : 4
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 16.0B
build@build:~/bloomfilter-test$ /home/build/.cargo/bin/poppy --verbose show  v2/filter_0.001.pop 
File: v2/filter_0.001.pop
        version                                  : 2
        capacity (desired number of elements)    : 5
        n (estimated number of elements)         : 5
        fpp (false positive probability)         : 0.001
        Length of data                           : 0.0B
        Size of bloom filter                     : 4.0KB
bruteforce ipv4 script
#!/usr/bin/env bash
version=1
id=R
for p in 0.000000000001 0.00000000001 0.0000000001 0.000000001 0.00000001 0.0000001 0.000001 0.00001 0.0001 0.001 0.01 0.1 1.0;
do
  echo  "${p}"
  #echo "creating filter"
  # creating a filter with a desired capacity `-c` and false positive probability `-p`
  #/home/build/.cargo/bin/poppy -j 0 create --version ${version} -p ${p} v${version}/filter_${p}.pop testdataRand.txt
  #cat testdata_rand.txt | /home/build/.cargo/bin/poppy insert filter_${p}.pop
  # showing information about the filter we just created
  /home/build/.cargo/bin/poppy show v${version}/filter${id}_${p}.pop > v${version}/result${id}_${p}.txt
  
  echo "bruteforce ipv4"
  ./ipv4 | /home/build/.cargo/bin/poppy check v${version}/filter${id}_${p}.pop 2>/dev/null | wc -l >> v${version}/result${id}_${p}.txt
done

But I got different results for using version 1 and version 2. I would say, the implementation for version 1 is somehow broken:

testdata Random

version 1:

tabs 30; for file in v1/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
v1/resultR_0.000000000001.txt 2279827179                                                                                                  
v1/resultR_0.00000000001.txt  2292496090
v1/resultR_0.0000000001.txt   2307695711
v1/resultR_0.000000001.txt    2279826672
v1/resultR_0.00000001.txt     2297276457
v1/resultR_0.0000001.txt      2319482144
v1/resultR_0.000001.txt       2279912424
v1/resultR_0.00001.txt        2307682256
v1/resultR_0.0001.txt         2348721215
v1/resultR_0.001.txt          2279909914
v1/resultR_0.01.txt           2348745809
v1/resultR_0.1.txt            2541550760
v1/resultR_1.0.txt            0

version 2:

tabs 30; for file in v2/resultR_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
v2/resultR_0.000000000001.txt 255104969                                                                                                   
v2/resultR_0.00000000001.txt  255105149
v2/resultR_0.0000000001.txt   255105478
v2/resultR_0.000000001.txt    255105775
v2/resultR_0.00000001.txt     255106167
v2/resultR_0.0000001.txt      255106974
v2/resultR_0.000001.txt       255111509
v2/resultR_0.00001.txt        255150265
v2/resultR_0.0001.txt         255524983
v2/resultR_0.001.txt          259186885
v2/resultR_0.01.txt           295735913
v2/resultR_0.1.txt            669617261
v2/resultR_1.0.txt            4294967296
testdata 5 IPv4

version 1:

tabs 30; for file in v1/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
v1/result_0.000000000001.txt  2169947840                                                                                                   
v1/result_0.00000000001.txt   2090326774
v1/result_0.0000000001.txt    2300182147
v1/result_0.000000001.txt     2237356373
v1/result_0.00000001.txt      2226226152
v1/result_0.0000001.txt       2134588848
v1/result_0.000001.txt        2162538453
v1/result_0.00001.txt         2273858034
v1/result_0.0001.txt          2350928224
v1/result_0.001.txt           1996217247
v1/result_0.01.txt            2375923684                                                                                                   
v1/result_0.1.txt             2614344677
v1/result_1.0.txt             0

version 2:

tabs 30; for file in v2/result_*; do echo -n ${file}; echo -ne "\t"; tail -n1 ${file}; done; tabs -8
v2/result_0.000000000001.txt  30                                                                                                          
v2/result_0.00000000001.txt   30
v2/result_0.0000000001.txt    30
v2/result_0.000000001.txt     30
v2/result_0.00000001.txt      30
v2/result_0.0000001.txt       30
v2/result_0.000001.txt        30
v2/result_0.00001.txt         30
v2/result_0.0001.txt          30
v2/result_0.001.txt           30
v2/result_0.01.txt            30
v2/result_0.1.txt             30
v2/result_1.0.txt             712969

filter size
stat -c "%s %n" v1/filter*
1833881816 v1/filterR_0.000000000001.pop
1681058336 v1/filterR_0.00000000001.pop
1528234856 v1/filterR_0.0000000001.pop
1375411376 v1/filterR_0.000000001.pop
1222587896 v1/filterR_0.00000001.pop
1069764416 v1/filterR_0.0000001.pop
916940936 v1/filterR_0.000001.pop
764117456 v1/filterR_0.00001.pop
611293976 v1/filterR_0.0001.pop
458470496 v1/filterR_0.001.pop
305647016 v1/filterR_0.01.pop
152823536 v1/filterR_0.1.pop
48 v1/filterR_1.0.pop
88 v1/filter_0.000000000001.pop
88 v1/filter_0.00000000001.pop
80 v1/filter_0.0000000001.pop
80 v1/filter_0.000000001.pop
72 v1/filter_0.00000001.pop
72 v1/filter_0.0000001.pop
72 v1/filter_0.000001.pop
64 v1/filter_0.00001.pop
64 v1/filter_0.0001.pop
64 v1/filter_0.001.pop
56 v1/filter_0.01.pop
56 v1/filter_0.1.pop
48 v1/filter_1.0.pop

I guess that the filter is correctly created, but I didn't check furthermore.

stat -c "%s %n" v2/filter*
1836384312 v2/filterR_0.000000000001.pop
1682612280 v2/filterR_0.00000000001.pop
1529872440 v2/filterR_0.0000000001.pop
1376682040 v2/filterR_0.000000001.pop
1223540792 v2/filterR_0.00000001.pop
1070596152 v2/filterR_0.0000001.pop
917385272 v2/filterR_0.000001.pop
764379192 v2/filterR_0.00001.pop
611414072 v2/filterR_0.0001.pop
458494008 v2/filterR_0.001.pop
305709112 v2/filterR_0.01.pop
152834104 v2/filterR_0.1.pop
4152 v2/filterR_1.0.pop
4152 v2/filter_0.000000000001.pop
4152 v2/filter_0.00000000001.pop
4152 v2/filter_0.0000000001.pop
4152 v2/filter_0.000000001.pop
4152 v2/filter_0.00000001.pop
4152 v2/filter_0.0000001.pop
4152 v2/filter_0.000001.pop
4152 v2/filter_0.00001.pop
4152 v2/filter_0.0001.pop
4152 v2/filter_0.001.pop
4152 v2/filter_0.01.pop
4152 v2/filter_0.1.pop
4152 v2/filter_1.0.pop

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.