Thanks for sharing your implementation here -- I have been studying it for a while now

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, I opened at PR <a class="issue-link js-issue-link" data-error-te

should the "initial" hash also be "randomized" ? about frozen HOT 7 CLOSED

serge-sans-paille commented on July 24, 2024

should the "initial" hash also be "randomized" ?

from frozen.

Comments (7)

serge-sans-paille commented on July 24, 2024

@cbeck88 thansk a lot for sharing your thoughts. i'm a bit busy right now but I'll definitively try to answer this week.

from frozen.

serge-sans-paille commented on July 24, 2024

An educated guess : instead of iteratively try new incremented values, we could use a PRNG ?

from frozen.

serge-sans-paille commented on July 24, 2024

@cbeck88 #23 is an attempt to randomize the search. I still use a fixed seed to start the search alternatives are:

providing an extra template parameter to the container
using __LINE__
??

What do you think?

from frozen.

serge-sans-paille commented on July 24, 2024

Do you think it would be a good idea to try to simplify elsa so that it can be run twice as fast?

I switched for a more standard hash function (murmur3 finalizer) and it yields better performance while still providing good distribution.

from frozen.

cbeck88 commented on July 24, 2024

I opened a WIP / experimental PR to show you my ideas on this, will write more later!

from frozen.

cbeck88 commented on July 24, 2024

@serge-sans-paille:

So here is roughly my concern: I'm concerned that if the initial input has a pathological distribution, the algorithm will fail to complete and will time out.

Here's an example input distribution I had in mind: Suppose you are using bit operations to pack multiple bits into an enum. And you want to associate strings to each bit for purpose of diagnostic messages.

enum BitMask {
  a = 1,
  b = 2,
  c = 4,
  d = 8,
  e = 16,
  f = 32,
};

Or for another example where I think it would go badly, suppose that all of the enum values are a multiple of 32.

enum MyEnum {
  foo = 32,
  bar = 64,
  baz = 96,
  qaz = 128,
};

Now, we want to be able to associate the enums with string representations of their names. We could implement an enum2string function using switch, which would be fast. If we also want string2enum... we could also implement that as a bunch of if/else. But often, the programmers want to be able to iterate over all of the valid enum values. So we end up with a bunch of extern const std::map<const MyEnum, const std::string> in the code. We'd like to replace all that stuff with frozen::map.

However, what is going to happen in the pmh tables when I try to do

frozen::make_unordered_map<int, frozen::string>({{32, "foo"}, {64, "bar"}, {96, "baz"}, {128, "qaz"}, ...});

If there are only say, 16 strings or fewer, then M is going to be maybe 16, maybe 32. Either way, the "unseeded" version of elsa returns just the input. So elsa{}(32) == 32, elsa{}(64) == 64, elsa{}(96) == 96. And all of these numbers, mod M is going to be zero.

So in the first pass of the make_pmh_tables thing, where we put every item into one of the M buckets, they are actually all going to end up in the 0 bucket.

Why is that bad? It's bad because in the second part of the algorithm, where we choose a random seed for each bucket, the chance that a random seed is good gets exponentially smaller as the bucket size increases.

    // Repeatedly try different values of d until we find a hash function
    // that places all items in the bucket into free slots
    std::size_t d = 1;
    cvector<std::size_t, M> slots;

    while (slots.size() < bsize) {
      auto slot = hash(key(it[bucket[slots.size()] - 1]), d) % M;

      if (values[slot] != 0 || !all_different_from(slots, slot)) {
        slots.clear();
        d++;
        continue;
      }

      slots.push_back(slot);
    }

(That is: I think the running time of this loop is exponential in bsize. There is also some dependency on, how many items have already been mapped, but when bsize is very large, like say half the elements, I think it should always be a bad case, and exponential bad in bsize.)

If it were okay for the buckets to be arbitrarily large, then we could just skip the entire "first table" part. A "naive" way to try to do perfect hashing would be, just choose random numbers d over and over until one of them manages to hash everything perfectly, and then write that down as the seed. The thing is, that will not be linear time -- it will be exponential time, IIUC, in n where n is the number of elements.

The linear-time perfect hashing paper, referred to from the original blog post, makes a statistical argument that, if I understand correctly, has two parts:

If the distribution of bucket sizes from the initial hash is "good enough", then the greedy choosing of seeds (construction of secondary table) takes place in linear time
With high probability, the initial distribution of bucket sizes is "good enough".

This is why they say it's linear time even though it is probabilistic -- you could start the algorithm and measure with a stop watch, and if it doesn't succeed after 100N steps or something, you could start over and with high probability you won't ever have to restart more than a few times.

The thing is that we don't have the luxury of starting over, if the second part fails to finish, we'll just get a compilation failure.

So to try to improve the success rate, and make sure the algorithm can handle even pathological inputs, I want to try to find a "hash looks good" condition, so we can try a few "initial" hashes and only proceed to building the second table if the bucket distribution looks good. (Since otherwise, building the second table will probably fail.)

In the commits I sent over, I messed around with "sum of squares of bucket sizes" as a metric for success. Because, in a "good" hash, most of the buckets should be like a constant, or log n or so at worst -- and overall I would expect sum of buckets squared to be like O(n). And in an extremely imbalanced hash, where one bucket has all the elements, the sum of bucket squares should be like O(n^2). So I just chose some limit like 10n and say, anything that exceeds this threshold is rejected.

I think the sum-of-squares thing is not very rigorous though. The best thing would be to read the paper in detail and see what condition they use. I think the "rigorous" answer would be something like, sum of c^{bucket size} for some constant c. That makes sense to me because I think the running time of the second part is indeed going to be exponential in the size of each bucket, and we do it for each bucket. I'm really not sure what c is supposed to be though.

It's still not completely rigorous also because in the paper they assume a random hash function model, and we aren't using "random hash functions". But there's only so much you can do. We are giving the hash functions pretty large seeds (64 bits) and if the tables are usually not that big then it should work almost the same I think.

Also: I think that using a seed for the "initial hash" will fix all of these pathological input cases, because we should be able to quickly find a seed that has a "pretty good" bucket size distribution and then the algorithm is likely to succeed. If the sum of the squares of the bucket sizes is O(n), it means that the majority of the elements are in buckets of size O(1), and there can't be more than a few buckets as large as sqrt(n).

When I have some free time again I'm going to try creating some test cases along these lines, and see if those tests fail to compile in master and if they do compile after my PR, but I didn't get around to it yet.

from frozen.

cbeck88 commented on July 24, 2024

Hi,

I opened at PR #26 to show what I think is a "pathological input" where the current master fails, and which shows the advantage of the do { ... } while loop in the PR #24

Basically even if the inputs (deterministically) cause a bad hash in the first round, with the initial seed, we can detect that and try a new seed before trying to build the second table. Or so the idea goes.

from frozen.

should the "initial" hash also be "randomized" ? about frozen HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent