The near-set-reconciliation from pmnoxx

I like to make stuff work in Rust 🦀.

I created following repos:

projected-hash-map - implementation of projected hash map over hash set.
rust-memory-analyzer - tool used to investigate memory usage in Rust code.

I constributed to, for example:

near-set-reconciliation's People

Contributors

Watchers

near-set-reconciliation's Issues

Code review results

This is what I found after a quick review:

In blt.rs:

You should not use DefaultHasher, it doesn't guarantee consistency across program instances.
This check if i != 0 on line 130 seems to be incorrect.
Even if it is not possible to recover all elements, it may still make sense to use the elements recovered so far to reduce the number of elements that will need to be recovered in the future.
The approach to generate the list of indices (in generate_idx) is suboptimal, it may be more efficient to derive the indices from a single (sufficiently long) hash value.
The values used to derive indices and to xor with xor_hash should be independent.
Element size and hash size should probably be configurable.
Consider removing count, this will make the sketch smaller and the code a bit simpler, but add some computation overhead (which will be small with a properly chosen hash function).

Difference size estimation algorithm

The following algorithm can be used to efficiently estimate the size of the difference between two sets:

Suppose that both peers sample their sets using the same predicate that selects an element with probability p. If the size of the symmetric difference between the sets is n, then the probability that the samples will be identical is (1-p)ⁿ. This probability roughly behaves as follows: if np >> 1, this probability is close to zero, if np << 1, it is close to one, otherwise it is somewhere in between.

This can be used to estimate the size of the difference like this: the peers choose a sequence of the values p, for example, 1/2ⁱ for some range of integers i. For each p, they sample the set multiple times using different predicates that select an element with probability p, compute the hash of each of the sampled sets, and send the hashes to the other peer. Now, each peer can compare their hashes with those of the other peer. For low p, most of the hashes will be identical. For high p, most of the hashes will probably be different. For some intermediate value of p, about half of the hashes will be different. The exact fraction can be used to estimate the size more precisely.

Some implementation notes:

The hashes can be small. In fact, for many parameter values, the highest precision is achieved (for the same amount of space) by using hashes that are just one bit long.
A hash of a set can be computed by XORing hashes of its elements. With small hashes, different samples should use different element hashes.
One long hash can be interpreted as multiple short ones by taking different bits.
The same hash can also be used for element selection (but the bits used for element selection should not be the same as those used for set hashing).

Recommend Projects

pmnoxx / near-set-reconciliation Goto Github PK

near-set-reconciliation's Introduction

near-set-reconciliation's People

Contributors

Watchers

near-set-reconciliation's Issues

Code review results

Difference size estimation algorithm

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent