Giter Site home page Giter Site logo

dynatrace-research / set-sketch-paper Goto Github PK

View Code? Open in Web Editor NEW
48.0 2.0 5.0 24.22 MB

SetSketch: Filling the Gap between MinHash and HyperLogLog

C++ 71.67% Python 28.33%
sketch hyperloglog hyperloglog-sketches minhash minhash-lsh-algorithm minhash-sketches minhash-similarity jaccard-similarity jaccard intersection

set-sketch-paper's Introduction

SetSketch: Filling the Gap between MinHash and HyperLogLog

This repository contains the source code to reproduce all the results and figures presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" which was accepted at VLDB 2021. An extended paper version that includes mathematical proofs and additional results is available on arXiv.

Abstract

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

Steps to reproduce the results and figures on Windows 10

  1. Install Windows Subsystem for Linux (WSL) with Ubuntu 20.04 LTS

  2. Install required packages:

    sudo apt update && sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive texlive-latex-extra texlive-fonts-extra
    
  3. Clone repository including submodules:

    git clone --recursive https://github.com/dynatrace-research/set-sketch-paper.git
    
  4. Switch to set-sketch-paper directory:

    cd set-sketch-paper
    
  5. Run simulations (takes several hours):

    gradle runCardinalityTest runJointTest runPerformanceTest
    
  6. Generate all figures:

    gradle pdfFigures
    

set-sketch-paper's People

Contributors

oertl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

set-sketch-paper's Issues

Questions and opinions on the the most recent MinHash and SetSketch

Hello Otmar,

I am writing to ask your opinion on the most recent MinHash, called circulant permutation MinHash as published here (https://proceedings.mlr.press/v162/li22m.html) since when we were implementing the Joint MLE estimator of SetSketch, due to the fact that the true cardinalities of set is not known (e.g., in our case, genomes, vary quite a lot), we have to use the cardinality estimator first (equation (12)) and then use it for Joint MLE (section 3.2) via the Brent’s method (we were using the golden-section search (https://en.wikipedia.org/wiki/Golden-section_search) since it is strictly concave within an interval). Our implementation and my understanding agreed that the variance of the Joint MLE estimator is limited by the RMSE of cardinality estimation (e.g., approximating sqrt(1.079/m) or worse), despite being much more space efficient. We care about small Jaccard like those around 0.01 or smaller because in genome analysis this is important range where other methods will just be too expensive. For large Jaccard, there is no doubt that Setsketch is both space efficient and accurate enough. For small Jaccard, in the end, we went to densified Minhash (the optimal densification paper, Shrivastava 2017, strictly sqrt(J*(1-J)/m)) and figured out that there is an even better one as mentioned in the beginning, strictly smaller than sqrt(J*(1-J)/m) while being computationally as efficient as Shrivastava 2017 as claimed in the paper. It could be very useful/accurate for small Jaccard. However, the second permutation, to randomly choose hash values from non-empty bins to empty bins cannot be avoided (e.g., must use the original permutation based processing as in the Broder 1998 paper, not hashing based method to approximate permutation), which means the permutation vector, can be very long in practice, e.g., 2^16 to 2^18 et.al., depending on the bucket size K when data dimension is very large, e.g., 2^30. So not so memory efficient, but only one such permutation is needed, and we need to circular shift the permutation vector K times (K is the number of buckets as in one permutation hashing), one after another, so pervious permutation can be released from memory after the shifted permutation is computed. Still, not sure how this will greatly affect computational efficiency because I tested that the circular shift of such a large permutation vector is also very expensive. Just want to hear your opinions on the circular permutation method since you are more familiar with computational optimization of those probabilistic structures. Sorry for the long description. Let me know if I am not clear.

Thank you very much,

Best,

Jianshu

HyperLogLogLog one log more for SetSketch

Hello Ertl,

I am wondering whether the new hyperlogloglog idea mentioned here: https://dl.acm.org/doi/abs/10.1145/3534678.3539246, could be applied in SetSeketch/MLE in HLL (when b=0.001), after a quick reading, it seems they just compress the skethes further more, and allow overflow but handled the overflow very carefully. Space is even smaller for this idea (40% smaller). I have not read the code carefully but it seems all hyperloglog properties are preserved.

Looking forward to your reply.

Thanks,

Jianshu

Similarity search with set-sketch-paper

Hi,

I was using HLL --cmp for some similarity searches between two data sets (Sequencing data vs. Reference Genome). I would like to try the same with set-sketch. Is there anyway to run a similarity search using set-sketch?

Thanks

best,
Ahmad

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.