dynatrace-research / set-sketch-paper Goto Github PK

SetSketch: Filling the Gap between MinHash and HyperLogLog

C++ 71.67% Python 28.33%

sketch hyperloglog hyperloglog-sketches minhash minhash-lsh-algorithm minhash-sketches minhash-similarity jaccard-similarity jaccard intersection

set-sketch-paper's Introduction

SetSketch: Filling the Gap between MinHash and HyperLogLog

This repository contains the source code to reproduce all the results and figures presented in the paper "SetSketch: Filling the Gap between MinHash and HyperLogLog" which was accepted at VLDB 2021. An extended paper version that includes mathematical proofs and additional results is available on arXiv.

Abstract

MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast comparison of sets as it allows estimating the Jaccard similarity and other joint quantities. This work presents a new data structure called SetSketch that is able to continuously fill the gap between both use cases. Its commutative and idempotent insert operation and its mergeable state make it suitable for distributed environments. Fast, robust, and easy-to-implement estimators for cardinality and joint quantities, as well as the ability to use SetSketch for similarity search, enable versatile applications. The presented joint estimator can also be applied to other data structures such as MinHash, HyperLogLog, or HyperMinHash, where it even performs better than the corresponding state-of-the-art estimators in many cases.

Steps to reproduce the results and figures on Windows 10

Install Windows Subsystem for Linux (WSL) with Ubuntu 20.04 LTS

Install required packages:

sudo apt update && sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive texlive-latex-extra texlive-fonts-extra

Clone repository including submodules:

git clone --recursive https://github.com/dynatrace-research/set-sketch-paper.git

Switch to set-sketch-paper directory:
```
cd set-sketch-paper
```

Run simulations (takes several hours):

gradle runCardinalityTest runJointTest runPerformanceTest

Generate all figures:
```
gradle pdfFigures
```

set-sketch-paper's People

Contributors

Stargazers

Watchers

Forkers

jimwangzx yuqiz2020 pombredanne jianshu93 satanson

set-sketch-paper's Issues

Questions and opinions on the the most recent MinHash and SetSketch

Hello Otmar,

I am writing to ask your opinion on the most recent MinHash, called circulant permutation MinHash as published here (https://proceedings.mlr.press/v162/li22m.html) since when we were implementing the Joint MLE estimator of SetSketch, due to the fact that the true cardinalities of set is not known (e.g., in our case, genomes, vary quite a lot), we have to use the cardinality estimator first (equation (12)) and then use it for Joint MLE (section 3.2) via the Brent’s method (we were using the golden-section search (https://en.wikipedia.org/wiki/Golden-section_search) since it is strictly concave within an interval). Our implementation and my understanding agreed that the variance of the Joint MLE estimator is limited by the RMSE of cardinality estimation (e.g., approximating sqrt(1.079/m) or worse), despite being much more space efficient. We care about small Jaccard like those around 0.01 or smaller because in genome analysis this is important range where other methods will just be too expensive. For large Jaccard, there is no doubt that Setsketch is both space efficient and accurate enough. For small Jaccard, in the end, we went to densified Minhash (the optimal densification paper, Shrivastava 2017, strictly sqrt(J*(1-J)/m)) and figured out that there is an even better one as mentioned in the beginning, strictly smaller than sqrt(J*(1-J)/m) while being computationally as efficient as Shrivastava 2017 as claimed in the paper. It could be very useful/accurate for small Jaccard. However, the second permutation, to randomly choose hash values from non-empty bins to empty bins cannot be avoided (e.g., must use the original permutation based processing as in the Broder 1998 paper, not hashing based method to approximate permutation), which means the permutation vector, can be very long in practice, e.g., 2^16 to 2^18 et.al., depending on the bucket size K when data dimension is very large, e.g., 2^30. So not so memory efficient, but only one such permutation is needed, and we need to circular shift the permutation vector K times (K is the number of buckets as in one permutation hashing), one after another, so pervious permutation can be released from memory after the shifted permutation is computed. Still, not sure how this will greatly affect computational efficiency because I tested that the circular shift of such a large permutation vector is also very expensive. Just want to hear your opinions on the circular permutation method since you are more familiar with computational optimization of those probabilistic structures. Sorry for the long description. Let me know if I am not clear.

Thank you very much,

Best,

Jianshu

HyperLogLogLog one log more for SetSketch

Hello Ertl,

I am wondering whether the new hyperlogloglog idea mentioned here: https://dl.acm.org/doi/abs/10.1145/3534678.3539246, could be applied in SetSeketch/MLE in HLL (when b=0.001), after a quick reading, it seems they just compress the skethes further more, and allow overflow but handled the overflow very carefully. Space is even smaller for this idea (40% smaller). I have not read the code carefully but it seems all hyperloglog properties are preserved.

Looking forward to your reply.

Thanks,

Jianshu

lower bound of variance for HyperLogLog sketch estimators

Similarity search with set-sketch-paper

Hi,

I was using HLL --cmp for some similarity searches between two data sets (Sequencing data vs. Reference Genome). I would like to try the same with set-sketch. Is there anyway to run a similarity search using set-sketch?

Thanks

best,
Ahmad

dynatrace-research / set-sketch-paper Goto Github PK

set-sketch-paper's Introduction

SetSketch: Filling the Gap between MinHash and HyperLogLog

Abstract

Steps to reproduce the results and figures on Windows 10

set-sketch-paper's People

Contributors

Stargazers

Watchers

Forkers

set-sketch-paper's Issues

Questions and opinions on the the most recent MinHash and SetSketch

HyperLogLogLog one log more for SetSketch

lower bound of variance for HyperLogLog sketch estimators

Similarity search with set-sketch-paper

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent