I believe I was the first person to implement HForró, and I noticed our implementation

cc <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

HForro doesn't align with HChaCha20/HSalsa20 about forro HOT 8 OPEN

jedisct1 commented on May 25, 2024

HForro doesn't align with HChaCha20/HSalsa20

from forro.

Comments (8)

samuel-lucas6 commented on May 25, 2024

cc @ericlagergren since hash() seems to follow this implementation too.

from forro.

iagora commented on May 25, 2024

Hi! I am one of the authors of Forró, and I am happy to see the interest in Forró.

You are correct regarding the words used in the output; the only issue pertains to the ordering.

At the beginning of this year, we initiated a fork of libsodium with the aim of incorporating Forró in order to conduct performance measurements alongside the chacha implementation in libsodium, while we worked on formalizing HForró14, XForró, and similar components. The resulting implementation can be found here, and it is similar to what you have, with only a difference in the order. I believe the correct representation would be as follows:

mem.writeIntLittle(u32, out[0..4], x[4]);
mem.writeIntLittle(u32, out[4..8], x[5]);
mem.writeIntLittle(u32, out[8..12], x[6]);
mem.writeIntLittle(u32, out[12..16], x[7]);
mem.writeIntLittle(u32, out[16..20], x[12]);
mem.writeIntLittle(u32, out[20..24], x[13]);
mem.writeIntLittle(u32, out[24..28], x[14]);
mem.writeIntLittle(u32, out[28..32], x[15]);

from forro.

samuel-lucas6 commented on May 25, 2024

Hi @iagora, nice to have you here.

I understand the reasoning for your order, but I believe I am correct. From the XChaCha draft:

Once the 20 ChaCha rounds have been completed, the first 128 bits and last 128 bits of the ChaCha state (both little-endian) are concatenated, and this 256-bit subkey is returned.

And from the XSalsa20 paper (p. 5):

(x0, x5, x10, x15) is the Salsa20 constant,
(x1, x2, x3, x4, x11, x12, x13, x14) is a 256-bit key, and
(x6, x7, x8, x9) is a 128-bit nonce.

HSalsa20 then outputs the 256-bit block (z0, z5, z10, z15, z6, z7, z8, z9). The indices 0, 5, 10, 15, 6, 7, 8, 9 here were not chosen arbitrarily; the choice is important for the security proof later in this paper.

Both output the constant then nonce words, and with HChaCha20, the counter is before the nonce in the state like Forró.

My produced test vector was:

output = 9754128339bd105377908eb53d7f238e7b3732cc48383052d35fd94c943db866
key = 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f
nonce = 000000090000004a0000000031415927

from forro.

jedisct1 commented on May 25, 2024

Obviously, that doesn't make any security difference. (4,5,6,7),(12,13,14,15) might be a micro optimization on some platforms, but since the paper doesn't cover HForro, maybe being consistent with Salsa/ChaCha would indeed be better.

from forro.

iagora commented on May 25, 2024

I was more worried with the possibility that in the first round of XForró, we'd operate two words that have recently been together in the same subround, we observed that it can stiffle diffusion (for example,(4,7,6,5),(14,13,12,15) would put us in a weird position, where it would take a few rounds to recover), and Forró's whole thing is "more diffusion in less operation" (unfortunately, not less cycles in some platforms as you have seen). (4,5,6,7),(12,13,14,15) didn't create that problem, so we stuck with it during our studies, but I brought your concerns (@samuel-lucas6) to @murcoutinho, and since (6,7,14,15),(4,5,12,13) doesn't create this problem, we agreed that it's less confusing and more friendly to follow what has been established before us, as @jedisct1 pointed out, so we will adjust.

Cheers.

P.S.: And regarding performance, if any of you are interested (including @ericlagergren), I'd be happy to help. Currently, the "more diffusion in less operations" approach only guarantees better performance in in-order execution or more limited devices. Unrolling rounds work, but it seems to be limited by the decoded instructions buffer size, and it differs from microarchitecture to microarchitecture. The annoying part would be how to distribute it.

P.P.S.: Sorry about the mistaken PR at the end of may (@jedisct1).

from forro.

samuel-lucas6 commented on May 25, 2024

I was more worried with the possibility that in the first round of XForró, we'd operate two words that have recently been together in the same subround, we observed that it can stiffle diffusion

My bad, I definitely didn't understand the reasoning for your order.

I brought your concerns (samuel-lucas6) to murcoutinho, and since (6,7,14,15),(4,5,12,13) doesn't create this problem, we agreed that it's less confusing and more friendly to follow what has been established before us

Thank you for investigating that. I don't mind either way as long as it's secure and everybody is interoperable, but I agree with that logic.

P.S.: And regarding performance, if any of you are interested (including ericlagergren), I'd be happy to help

I'd happily take you up on that as I know very little about optimising implementations. However, I'm quite busy writing my dissertation at the moment.

from forro.

jedisct1 commented on May 25, 2024

On Apple M1 and with this implementation, the best performance is achieved by inlining all the rounds. Inlining only 2 decreases performance.

So, yeah, this is very CPU-dependent.

On AMD, prefetching the next cache lines can something helps a lot.

from forro.

iagora commented on May 25, 2024

I expressed myself poorly regarding the unroll, I indeed inline everything. What I manually unrolled was two executions. Say you want a 1024 bit stream, I run both counters in the same scope, so that the OOO executor can fetch from the next execution, since they'd have no direct dependency, the instruction is pretty much guaranteed to retire if it's prefetched, that's what in the paper we call Xote. When I finish work today, I'll make some time to get this code running Xote, if you'd entertain running it in your M1, I'd be curious to know if the performs better.

from forro.

HForro doesn't align with HChaCha20/HSalsa20 about forro HOT 8 OPEN

Comments (8)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent