<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Not sure if this fork wants to try this instead: <a class="issue-link js-issue-lin

Thanks for the troubleshooting <a class="user-mention notranslate" data-hovercard-type

we have to resort to an O(n) hash algorithm <p dir="aut

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Cool <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

I have made a example hash-test.lua for test: <div class="snippet-clipboa

lj_str_new hash conflict is serious when length larger than 128 about luajit2 HOT 17 CLOSED

openresty commented on July 23, 2024

lj_str_new hash conflict is serious when length larger than 128

from luajit2.

Comments (17)

bungle commented on July 23, 2024 1

Not sure if this fork wants to try this instead:
LuaJIT/LuaJIT#294

from luajit2.

agentzh commented on July 23, 2024 1

We won't consider O(n) hashing in this branch either. I'd suggest avoid using Lua strings in the first place for large string buffers (like introducing a string.buffer builtin data structure or using FFI C buffers).

from luajit2.

dndx commented on July 23, 2024

Thanks for the troubleshooting @stone-wind, it is a valid concern for your specific use case, unfortunately properly fixing it inside OpenResty would be harder.

The original string hash algorithm in LuaJIT is very fast, it has constant time complexity for strings of arbitrary length. The CRC32 variant https://github.com/openresty/luajit2 uses is intend to make hash collision attack harder by randomize the positions the hash function samples while calculating the hash, it is slower than the original hash function, but nevertheless still constant time.

To actually address your issue of large amount of strings with same length and almost identical content, we have to resort to an O(n) hash algorithm, because we don't have ways to know beforehand where the differences in the strings are.

The CRC32 hash implementation does use every byte for CRC32 calculation for sizes up to 127 bytes. So if you can change your program to only read 127 bytes of data from the socket at once it should not experience this much collision attacks.

If that is not acceptable, another method to fix it is that we may consider allowing user to compile LuaJIT to use all bytes in the string for CRC32 calculation, downgrading the algorithm to O(n) but more resilient against collisions in very similar long strings.

Also a word of caution of using LuaJIT/LuaJIT#294, we never tested it in OpenResty and can not guarantee it does not introduce bugs in the LuaJIT core.

from luajit2.

lukego commented on July 23, 2024

we have to resort to an O(n) hash algorithm

Just $0.02 but my intuition is that constant factors will be at least as important as asymptotic cost:

Throughput: How many bytes can we hash per CPU cycle?
Fixed cost: How many cycles is the per-string cost, excluding the per-byte cost?

Does anybody know what is the current state of the art hashing algorithm and how it performs in these ways? Or know what numbers would be acceptable for OpenResty? Or what numbers we achieve today?

I am especially wondering whether an afternoon of Googling might turn up a .h file containing a good all-round hash function that solves the problem for all workloads.

Here is the little bit of hard data I've been able to turn up right now:

CRC32 performance is limited in hardware to 8 bytes per cycle. On Intel CPUs the 64-bit CRC32 instruction takes three cycles to complete, and the CPU can execute three of them in parallel, and so the throughput limit is 64 bits per cycle. (Source: Agner).
Our current CRC32 implementation issues two CRC32 instructions in parallel and so it should be limited to 5.33 bytes per cycle of throughput i.e. completes max 2 x 64-bit CRC instructions every three cycles.
Meow Hash claims to achieve 16 bytes per cycle of throughput. That would be 2x the hardware limit for CRC32 and 3x the limit of our current routine. (Just picked that one from a random google search.)

These are only upper bounds on throughput. Maybe actual performance will be more dominated by fixed costs or memory access. I'm just trying to frame the problem in quantitative terms and establish what level of performance we want and what we should realistically be able to achieve. That might help to decide whether it's better to hash whole strings or if we really have to sample them.

from luajit2.

lukego commented on July 23, 2024

Just an aside thought:

Could an interesting feature be string creation without interning?

I am thinking that the cost of hashing and interning strings of binary data probably outweighs the benefit. The interning would be valuable if the string were already present in the table, or was going to be used as a table key, or was going to be used for equality tests against other strings. However none of these things seems to be true for strings containing e.g. raw binary data received from a socket.

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

This would introduce a certain amount of new complexity - probably too much - but it would also simplify some things e.g. not having to worry about the cost of hashing/interning buffers of binary data, and not having to account for hashing while benchmarking applications e.g. accidentally sending test data that will intern much better/worse than live traffic.

(I suppose this is the same distinction in many other languages where some string data is interned and others is not and go by names like strings, string buffers, symbols, binaries, etc.)

EDIT: Could also be a step towards being able to use string module functions to operate on strings allocated in FFI memory. I have the feeling when I write OpenResty code that really I'd prefer for binary data to be stored in C memory but that's in conflict with the idiom of operating on it using Lua string routines.

from luajit2.

spacewander commented on July 23, 2024

@lukego
Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len).
It is possible to implement string module equivalent with the FFI char buffer. Just do the old C job. And we can also implement a FFI version of the ngx.re API.

from luajit2.

fsfod commented on July 23, 2024

So the idea would be to create those strings using a special reserved hash value (-1) that means they have not been hashed, have not been interned, cannot be used as table keys (error), cannot be tested for equality using pointer value. However, they can be used for I/O, they can be used for string library functions, and data extracted from them will be ordinary strings.

I already made a string buffer system that works in place of strings for some of the string library functions and experimented with them pretending to be strings object.
I also tried to extend it to work in the FFI system so it could work with binary data.

from luajit2.

lukego commented on July 23, 2024

Do you mean a FFI version of the cosocket API? Something like socket:recv(bufp, len)

That would be the direct approach and probably a better idea if it's feasible to migrate to that.

I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins? It might be easy to implement on the VM side but definitely introduces new complexity for users.

(I'm thinking about Erlang where on the C side they had efficient strings but on the Erlang side they had inefficient ones and they invented a new type - "binary" - that worked as a good compromise.)

from luajit2.

lukego commented on July 23, 2024

Cool @fsfod!

from luajit2.

stone-wind commented on July 23, 2024

I have made a example hash-test.lua for test:


local ori = "I was musing here about an indirect approach of adding a special sub-type of Lua strings that are non-interned and more suitable for binary data while mostly compatible with existing APIs. This would avoid the problem on this issue #60 because you would skip the checksum on binary network data, and that would reduce the pressure to settle for a collision-prone checksum algorithm on other strings. Everybody wins?"
local cnt = 1000
local rm = {}
for i = 1, cnt do
	rm[i] = tostring(math.random(100000000000, 1000000000000)) .. "-" 
	rm[i] = rm[i] .. rm[i] .. rm[i]
end
local s = os.clock()
local pre,cur = s
local raw = {}
for i = 1, cnt do
	raw[i] = ori .. rm[i] .. "The return and break statements can only be written as the last statement of a block."
	if i > cnt - 5 then
		cur = os.clock()
		print(cur - pre)
		pre = cur
	end
end
local e = os.clock()
print("total:", e-s)

when cnt is 2000:

0.154724  
0.000195  
0.000195  
0.00017900000000001  
0.000196  
total:	0.155491

when cnt is 20000:

14.531935
0.0015210000000003
0.0015210000000003
0.0014779999999988
0.0015999999999998
total:	14.53806

Obvious conflict will become more frequency when cnt is larger. Then i change the hash function and test again:
when cnt is 1000:

0.00488
0.000135
1.2999999999999e-05
9.0000000000003e-06
2.9000000000001e-05
total:	0.005086

when cnt is 20000:

0.030095
2.4000000000003e-05
4.9999999999981e-06
3.9999999999971e-06
3.000000000003e-06
total:	0.030134

when cnt is 200000:

0.163455  
2.9000000000001e-05  
4.000000000004e-06  
2.9999999999752e-06  
3.0000000000308e-06  
total:	0.163497

Obvious better, you can verify it by youself. The new hash func is wyhash.
I changed the lj_str_hash at lj_str_hash_x64.h

#include "wyhash.h"
static LJ_AINLINE uint32_t lj_str_hash(const char* str, size_t len)
{
	static uint64_t seed;
	if(!seed) seed = wygrand();
	return wyhash(str, len, seed);
}

Wyhash is better than crc32_hw. it do not need SSE. Simple(header only, less c code) , fastest and solid, less conflict. below is the performance data from smhasher:

Hash function	MiB/sec	cycles/hash	Quality problems
crc32	544.84	92.65	insecure, 8589.93x collisions, distrib
crc32_hw	9310.60	24.63	insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc32_hw1	30145.28	30.98	insecure, 100% bias, collisions, distrib, machine-specific (x86 SSE4.2)
crc64_hw	10181.36	23.46	insecure, 100% bias, collisions, distrib, machine-specific (x86_64 SSE4.2)
wyhash	16528.58	17.67
wyhash32lo	15933.47	17.77

luajit had done so much work on interning, i think we should add a function for string to get the hash value , thus we can fully use the results. the hash value is useful. wyhash is also a good PRNG.

from luajit2.

funny-falcon commented on July 23, 2024

@lukego LuaJIT/LuaJIT#294 is used in Tarantool's LuaJIT fork in production for several years. (Since this version of patch were created. In fact Nick - patch co-author who helped me to debug patch and suggested some improvements - were member of Tarantool core team at that moment)
https://github.com/tarantool/LuaJIT
https://tarantool.io

from luajit2.

funny-falcon commented on July 23, 2024

Hash function used in patch is not much essential. You could replace it. Main part is collision detection and switch to O(n) hashing.

from luajit2.

funny-falcon commented on July 23, 2024

I mean "dynamically switch to O(n) hashing if long collision chain were detected". 99% of time hashing remains to be constant-time ( O(1) ). And only if bad things happens and a lot of strings with same hash value and length is inserted, then O(n) hashing is used only for strings with similar hash value (filtered with Bloom Filter). And if such strings collected with GC, then filter is cleared either.

from luajit2.

agentzh commented on July 23, 2024

This looks very complex and I don't think we should go down this route. My hunch is that it would be very difficult to debug when bad things happen in such a sophisticated implementation.

from luajit2.

funny-falcon commented on July 23, 2024

@agentzh did you read the patch? It is not that sophisticated.

Sorry for being noisy. I shut up myself at this point.

from luajit2.

agentzh commented on July 23, 2024

@funny-falcon Yes, I read the patch. It's quite far reaching, even touched the GC part.

from luajit2.

wymhuster commented on July 23, 2024

  chunk_num = 16;
  chunk_sz = len / chunk_num;
  chunk_sz_log2 = log2_floor(chunk_sz);

  pos1 = get_random_pos_unsafe(chunk_sz_log2, 0);
  pos2 = get_random_pos_unsafe(chunk_sz_log2, 1);

  h1 = lj_crc32_u32(0, len ^ seed);
  h2 = 0;

  /* loop over 14 chunks, 2 chunks at a time */
  for (i = 0, chunk_ptr = str; i < (chunk_num / 2 - 1);
       chunk_ptr += chunk_sz, i++) {

    v = *cast_uint64p(chunk_ptr + pos1);
    h1 = lj_crc32_u64(h1, v);

    v = *cast_uint64p(chunk_ptr + chunk_sz + pos2);
    h2 = lj_crc32_u64(h2, v);
  }

every loop just move one chunk，it's a bug here?

from luajit2.

lj_str_new hash conflict is serious when length larger than 128 about luajit2 HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent