Giter Site home page Giter Site logo

simonprickett / python-probabilistic-data-structures Goto Github PK

View Code? Open in Web Editor NEW
13.0 3.0 2.0 1.84 MB

Exploring Probabilistic Data Structures in Python - my 2021 Pycon USA and Australia and Pycon MEA 2022 talk.

Home Page: https://simonprickett.dev/no-maybe-and-good-enough-probabilistic-data-structures-in-python/

License: MIT License

Python 100.00%
bloom-filter hyperloglog probabilistic-data-structures pycon python redis

python-probabilistic-data-structures's Introduction

No, Maybe and Close Enough!

Exploring Probabilistic Data Structures in Python - my 2021 Pycon USA and Australia talk. I've updated the code slightly for Pycon MEA Dubai 2022 - it now uses the latest redis-py Redis client.

Title Page

If you'd like to see the slides for the 2021 version of this talk, they're here (PDF). Watch the 2021 video here or read the transcript on my website. I gave a shorter version of this talk in person for Pycon MEA Dubai 2022 - watch that here.

This repository contains supporting code to run the examples from my talk. The example code uses in memory probabilistic data structures with the hyperloglog and pyprobables libraries. It also uses Redis with the RedisBloom module: this is provided for you as part of Redis Stack in a Docker container.

The two probabilistic data structures examined in this code base are:

  • HyperLogLog - an algorithm for estimating the cardinality of a set (the count distinct problem)
  • Bloom Filter - a data structure used to determine whether an element may be a member of a set

Setup

To run the example code and Redis server, you will need both Python 3 (I've tested this with 3.8.6) and Docker. Once you have these, clone the repo, create a virtual environment, and start the Docker container in the background:

$ git clone https://github.com/simonprickett/python-probabilistic-data-structures.git
$ cd python-probabilistic-data-structures
$ python3 -m venv venv
$ . venv/bin/activate
(venv) $ pip install -r requirements.txt
(venv) $ docker compose up -d

The Docker container uses port 6379. If you have another process (example: an existing Redis server instance) listening on that port, stop that process before starting the container.

Running the Examples

Counting Sheep

These examples use a Python set, then a Redis set to count sheep. They give an accurate count, but at the cost of memory usage. Moving the count to a Redis set offloads the memory usage problem to Redis, and solves the problem of allowing multiple instances of the Python code to work together to count sheep.

Counting sheep in memory with a Python set:

(venv) $ cd counting_sheep_python
(venv) $ python how_many.py
There are 6 sheep.

Have I seen this sheep in memory with a Python set:

(venv) $ python have_i_seen_this_one.py
I have seen sheep 1934.
I have not seen sheep 1283.

Counting sheep with a Redis set:

(venv) $ cd ../counting_sheep_redis
(venv) $ python how_many.py
There are 6 sheep

Have I seen this sheep with a Redis set:

(venv) $ python have_i_seen_this_one.py
I have seen sheep 1934.
I have not seen sheep 1283.

Approximating Sheep

These examples use the HyperLogLog to approximate a count of sheep seen, and the Bloom Filter to determine whether or not a particular sheep has been seen. They use both in memory Python implementations of the probabilistic data structures, and the Redis equivalents.

Approximate count with Python in memory HyperLogLog, compares count with an in memory set:

(venv) $ cd ../approximating_sheep_python
(venv) $ python how_many.py
There are 100000 sheep (set).
There are 100075 sheep (hyperloglog).

Have I (maybe) seen this sheep with Python in memory Bloom Filter:

(venv) $ python have_i_see_this_one.py
I might have seen sheep 9018.
I have not seen sheep 454991.

Approximate count with Redis HyperLogLog:

(venv) $ cd ../approximating_sheep_redis
(venv) $ python how_many.py
There are 100000 sheep (set: 4673012).
There are 99565 sheep (hyperloglog: 12366).

Have I (maybe) seen this sheep with Redis Bloom Filter:

(venv) $ python have_i_seen_this_one.py
I might have seen sheep 9018.
I have not seen sheep 454991.

python-probabilistic-data-structures's People

Contributors

simonprickett avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.