For extremely bad hash, overflow can happen. Investigate how to resolve this gracefull

Investigate non-throwing overflow resolution about robin-hood-hashing HOT 18 OPEN

martinus commented on July 22, 2024 4

Investigate non-throwing overflow resolution

from robin-hood-hashing.

Comments (18)

martinus commented on July 22, 2024 1

I use a marker that counts the distance to the original bucket. I currently throw an exception when that marker would overflow. Instead, I could give the maximum value of the marker a special meaning where I switch to a slow linear search instead. That way there should be almost no performance penalty in the normal case, and I could still handle the extreme case.

from robin-hood-hashing.

martinus commented on July 22, 2024 1

well, it's always robin hood hashing, even when there are lots of collisions. Collision handling is always a linear search. When you remove the colliding elements there will be less linear search.

from robin-hood-hashing.

felixguendling commented on July 22, 2024

and without losing performance

If this is not possible I could imagine a solution that lets the user choose either pure performance (throw on overflow, as it is now) or a bit less performance but with graceful error handling. Of course, if it is possible to have both (graceful overflow handling and performance) this option is obsolete.

from robin-hood-hashing.

asbai commented on July 22, 2024

I think we can achieve this by sacrificing time or space efficiency only when the above extreme situations occur. This will ensure that we maintain optimal performance in most cases, but will continue to work in extreme conditions.

from robin-hood-hashing.

asbai commented on July 22, 2024

I could give the maximum value of the marker a special meaning where I switch to a slow linear search instead

Does this mean that in that case, all items in the container will degrade into linear probing? If so, is this container likely to automatically promote it self back to robin hood hashing in the future?

from robin-hood-hashing.

asbai commented on July 22, 2024

If I understand correctly, you means there will be two linear search algorithms, one is we currently used (the fast version), and the other slower one is used to solve the problem of overflow (marker reaches the maximum), right?

So we can now determine that in the new implementation, the faster linear search algorithm may be replaced by the slower one. Then:

Does this replacement occur globally in the specified container or only in some parts of that container where the conflict is more intense?
Is there a reverse process for this replacement? i.e.: If a container conflict is no longer so intense, is it possible for the linear search algorithm to be replaced back to the fast one automatically?

from robin-hood-hashing.

tblut commented on July 22, 2024

Is there any update on this? I'm currently using your excellent implementation for sparse grids. The problem is that sometimes I get an overflow exception, which unfortunately forces me to use the much slower and memory hungry, but perfectly stable, std::unordered_map again ...
I would be very happy if you made some optional setting that makes the map 100% stable. Even if it makes it a little bit slower, it would still be much faster than the STL version!

from robin-hood-hashing.

martinus commented on July 22, 2024

What hash are you using? With a reasonably good hash this should never happen. Can you post just the hash code?

from robin-hood-hashing.

tblut commented on July 22, 2024

I'm using your default hash-function with values computed as follows:

uint32_t cellX = static_cast<uint32_t>((point.x - bounds.min.x) / bounds.size.x * gridResolution);
uint32_t cellY = static_cast<uint32_t>((point.y - bounds.min.y) / bounds.size.y * gridResolution);
uint32_t cellZ = static_cast<uint32_t>((point.z - bounds.min.z) / bounds.size.z * gridResolution);
uint32_t cellIndex = cellX | (cellY << 10) | (cellZ << 20);

auto& entry = grid[cellIndex]; // grid is a robin_hood::unordered_map<uint32_t, ...>

Edit: I just remembered that I "fixed" the exceptions by changing my key to the code above. I used this before and it would cause the exception more often:
cellIndex = cellX + cellY * gridResolution + cellZ * gridResolution * gridResolution

Also pretty interesting is that the overflow occurs when I simply copy all the elements from one map into another:

for (const auto& entry : source.grid) {
    target.grid[entry.first] = entry.second;
}

from robin-hood-hashing.

martinus commented on July 22, 2024

Can you share a reproducer? What compiler are you using, and do you compiler for 64 or 32 bit?

from robin-hood-hashing.

tblut commented on July 22, 2024

I wrote you an email with the link to the minimal reproducer. I'm using Visual Studio 16.4.2 and only compile for 64 bit. Thanks!

from robin-hood-hashing.

martinus commented on July 22, 2024

Thanks for the reproducer! I could reproduce it, here's what's the issue:

for (const auto& entry : sourceNode.grid) {
	targetNode.grid[entry.first] = entry.second;
}

You iterate a large hashmap, and insert all the entries into targetNode.grid map. All hashmaps use the same hash function, so it's basically iterating from lowest to highest hash value.

When inserting, since the targetNode.grid is small, all the elements that are inserted will first cause a collision (because their hashes are so small), so eventually robin_hood throws in the towel and causes an exception. std::unordered_map doesn't throw but insertion will be slow because it's boiling down to a linear search.

There is no ideal way to fix this problem, in your case you can do this in PointCloudConverter.cpp:

void PointCloudConverter::MergeOctreesInternal(Node& targetNode, Node& sourceNode,
		uint32_t depth, uint32_t partitionDepth, uint32_t partitionIndex) {
	targetNode.grid.reserve(targetNode.grid.size() + sourceNode.grid.size());
	for (const auto& entry : sourceNode.grid) {
		targetNode.grid[entry.first] = entry.second;
	}

With that reserve() call the map has already plenty of size for all the data to insert, so it doesn't feel the effect of hash collision due to linear insertion. With that change it works for me.

from robin-hood-hashing.

martinus commented on July 22, 2024

On second thought, this is really an issue of bad quality of the hash. If you replace robin_hood's hash_int with this code, this issue also doesn't happen (then there is also no need for reserve()):

inline size_t hash_int(uint64_t obj) noexcept {
#if ROBIN_HOOD(HAS_UMUL128)
    uint64_t h;
    uint64_t l = detail::umul128(obj ^ UINT64_C(0xa0761d6478bd642f), UINT64_C(0xe7037ed1a0b428db), &h);
    return h ^ l;
#elif ROBIN_HOOD(BITNESS) == 32
    uint64_t const r = obj * UINT64_C(0xca4bcaa75ec3f625);
    auto h = static_cast<uint32_t>(r >> 32U);
    auto l = static_cast<uint32_t>(r);
    return h + l;
#else
    // murmurhash 3 finalizer
    uint64_t h = obj;
    h ^= h >> 33;
    h *= 0xff51afd7ed558ccd;
    h ^= h >> 33;
    h *= 0xc4ceb9fe1a85ec53;
    h ^= h >> 33;
    return static_cast<size_t>(h);
#endif
}

from robin-hood-hashing.

tblut commented on July 22, 2024

Finally came around to trying your other hash function and it seems to work for all the datasets that I have. In my opinion, a fallback to a slower linear search should still be added for absolute stability. The issue is that if my program needs to run for multiple hours, I need to be sure that it won't crash due to a very rare sequence of hashs.

from robin-hood-hashing.

martinus commented on July 22, 2024

You are of course right, a linear search fallback would be important to have. I don't have the time currently for this though. But I'm at least working on an improved hash.

from robin-hood-hashing.

martinus commented on July 22, 2024

I believe I've (finally) a simple & fast reproducer:

TEST_CASE("overflow_finder_simple") {
    robin_hood::unordered_set<uint64_t> set;

    for (uint64_t i = 0; i < UINT64_C(1000000); ++i) {
        auto h = robin_hood::hash<uint64_t>{}(i);
        if (UINT64_C(0) == (h & UINT64_C(0x1ff))) {
            set.insert(i);
        }
    }
}

This shows that the lower bits of the hash are very bad, which is what can lead to the overflow.

mitigation strategy: create an initial very simple mixing hash that's part of the robin_hood table, that basically multiplies the value with a random odd number, and this is then passed on to the robin_hood hash. When overflow happens, add an odd constant to the mix multiplier, and rehash the map.

Advantages/disadvantages of this approach:

no need to refactor the infobyte array, which would produce some slowdowns
Only a single multiplication is needed at hashing => not much performance slowdown.
allows to use the hash's address as initial multipler, to be a bit more resilient against overflow attacks
implementation of rehashing is necessary - but can probably be implemented as a copy of the map.

from robin-hood-hashing.

roncapat commented on July 22, 2024

Sorry to step in, but I have trouble with hashing of grid indexes (so, two ints, I'm working on a discrete spatial grid). Should I be able to fix the error by swapping the hash_int as proposed above? @martinus

I have map overflows exceptions. I tried the above solution but seems to slow down my algorithm.

Do you know more specific and efficient hashing functions for a continuous bidimensional range of integers (AKA grid coordinates)?

from robin-hood-hashing.

cculianu commented on July 22, 2024

I have been getting this exception extremely rarely when inserting sha256 hashed data -- truly random data really, into the hash table (I just take the first 64 bits of the 256 bit number)!!

Is this still some sort of known issue? I have seen this on so-called "good" hash functions, but extremely rarely.

from robin-hood-hashing.

Investigate non-throwing overflow resolution about robin-hood-hashing HOT 18 OPEN

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent