Giter Site home page Giter Site logo

pietempesti98 / bloomfilter_spark Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 7.28 MB

Implementation of a bloom filter using the spark framework made by Stefano Dugo, Gabriele Marino, Pietro Tempesti and Benedetta Tessa

Python 100.00%
cloud-computing project-work spark mapreduce

bloomfilter_spark's Introduction

Bloom Filter - Spark

Implementation of a Bloom Filter using the Spark framework in Python. All the documentation for this project can be found here

How to run the algorithm

Note: execute all the commands below inside the bloomfilter_spark folder

Pre-installation

In order to run the algorithm on a cluster, you have to install the same python environment on all the cluster's machines.

These are the steps to create a virtual environment with all the needed packages:

  1. Only on debian/ubuntu install python venv

    sudo apt-get install python3-venv

  2. create a virtual environment and activate it

    python -m venv pyspark_venv
    source pyspark_venv/bin/activate
    
  3. install all the needed dependencies using pip

    pip3 install pyspark mmh3 venv-pack

  4. zip the virtual environment in a .tar.gz archive

    venv-pack -o pyspark_venv.tar.gz

Spark submission

export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment main.py [input path] [output path] [false positive rate] 

spark-submit example

spark-submit --archives pyspark_venv.tar.gz#environment main.py data/data.txt output 0.01

Note: remember to delete the output folder before the next execution using hadoop fs -rm -r [output path]

The input file

We have tested the bloom filter with the title.ratings IMDb dataset (available here). The dataset is a .tsv file, with an header line containing the structure of the data:

tconst averageRating numVotes

We transform this file in a .txt file without the header.

tt0000001	5.7	1882
tt0000002	5.9	250
tt0000003	6.5	1663
tt0000004	5.8	163
tt0000005	6.2	2487
tt0000006	5.2	166
tt0000007	5.4	773
tt0000008	5.4	2024
tt0000009	5.3	194
tt0000010	6.9	6803
tt0000011	5.3	346
tt0000012	7.4	11692
tt0000013	5.7	1801
...             ...     ...

In the setup phase, in which we compute m for each bloom filter, we also pre-process this dataset, removing the numVotes column and rounding the values in the averageRating column. You can find the data.txt file (and a reduced version of the same file used for testing purposes data1.txt) in this folder.

Outputs

The algorithm generates two output folders: bloom_filters, with a bloom filter for each vote, and fp_rates, with the number of false positives and the false positive rate computed for each bloom filter.

You can find these folders inside the <output path> specified as input parameter; for example, if you specify output as output path, the output structure will be:

output
| 
|_ bloom_filters
|
|_ fp_rates

Output printed by the application

This is an example of the output printed by the execution of the application, using 0.01 as p value

***** results *****


vote 1 --> false positives: 12481, false positive rate: 0.01002832292147922

vote 2 --> false positives: 12872, false positive rate: 0.010385712383864707

vote 3 --> false positives: 12228, false positive rate: 0.009938215314994612

vote 4 --> false positives: 11999, false positive rate: 0.01002203360667924

vote 5 --> false positives: 11349, false positive rate: 0.009860438606627667

vote 6 --> false positives: 9953, false positive rate: 0.009960091585208669

vote 7 --> false positives: 9192, false positive rate: 0.010166252289397545

vote 8 --> false positives: 8837, false positive rate: 0.010116007262187402

vote 9 --> false positives: 11843, false positive rate: 0.010266826525048092

vote 10 --> false positives: 12337, false positive rate: 0.010021908963666214

The Hadoop implementation

You can find our implementation of the bloom filter using Hadoop in this repository

bloomfilter_spark's People

Contributors

btessa99 avatar pietempesti98 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

btessa99

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.