Giter Site home page Giter Site logo

Comments (8)

samuel-lucas6 avatar samuel-lucas6 commented on May 25, 2024

cc @ericlagergren since hash() seems to follow this implementation too.

from forro.

iagora avatar iagora commented on May 25, 2024

Hi! I am one of the authors of Forró, and I am happy to see the interest in Forró.

You are correct regarding the words used in the output; the only issue pertains to the ordering.

At the beginning of this year, we initiated a fork of libsodium with the aim of incorporating Forró in order to conduct performance measurements alongside the chacha implementation in libsodium, while we worked on formalizing HForró14, XForró, and similar components. The resulting implementation can be found here, and it is similar to what you have, with only a difference in the order. I believe the correct representation would be as follows:

mem.writeIntLittle(u32, out[0..4], x[4]);
mem.writeIntLittle(u32, out[4..8], x[5]);
mem.writeIntLittle(u32, out[8..12], x[6]);
mem.writeIntLittle(u32, out[12..16], x[7]);
mem.writeIntLittle(u32, out[16..20], x[12]);
mem.writeIntLittle(u32, out[20..24], x[13]);
mem.writeIntLittle(u32, out[24..28], x[14]);
mem.writeIntLittle(u32, out[28..32], x[15]);

from forro.

samuel-lucas6 avatar samuel-lucas6 commented on May 25, 2024

Hi @iagora, nice to have you here.

I understand the reasoning for your order, but I believe I am correct. From the XChaCha draft:

Once the 20 ChaCha rounds have been completed, the first 128 bits and last 128 bits of the ChaCha state (both little-endian) are concatenated, and this 256-bit subkey is returned.

And from the XSalsa20 paper (p. 5):

(x0, x5, x10, x15) is the Salsa20 constant,
(x1, x2, x3, x4, x11, x12, x13, x14) is a 256-bit key, and
(x6, x7, x8, x9) is a 128-bit nonce.

HSalsa20 then outputs the 256-bit block (z0, z5, z10, z15, z6, z7, z8, z9). The indices 0, 5, 10, 15, 6, 7, 8, 9 here were not chosen arbitrarily; the choice is important for the security proof later in this paper.

Both output the constant then nonce words, and with HChaCha20, the counter is before the nonce in the state like Forró.

My produced test vector was:

output = 9754128339bd105377908eb53d7f238e7b3732cc48383052d35fd94c943db866
key = 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f
nonce = 000000090000004a0000000031415927

from forro.

jedisct1 avatar jedisct1 commented on May 25, 2024

Obviously, that doesn't make any security difference. (4,5,6,7),(12,13,14,15) might be a micro optimization on some platforms, but since the paper doesn't cover HForro, maybe being consistent with Salsa/ChaCha would indeed be better.

from forro.

iagora avatar iagora commented on May 25, 2024

I was more worried with the possibility that in the first round of XForró, we'd operate two words that have recently been together in the same subround, we observed that it can stiffle diffusion (for example,(4,7,6,5),(14,13,12,15) would put us in a weird position, where it would take a few rounds to recover), and Forró's whole thing is "more diffusion in less operation" (unfortunately, not less cycles in some platforms as you have seen). (4,5,6,7),(12,13,14,15) didn't create that problem, so we stuck with it during our studies, but I brought your concerns (@samuel-lucas6) to @murcoutinho, and since (6,7,14,15),(4,5,12,13) doesn't create this problem, we agreed that it's less confusing and more friendly to follow what has been established before us, as @jedisct1 pointed out, so we will adjust.

Cheers.

P.S.: And regarding performance, if any of you are interested (including @ericlagergren), I'd be happy to help. Currently, the "more diffusion in less operations" approach only guarantees better performance in in-order execution or more limited devices. Unrolling rounds work, but it seems to be limited by the decoded instructions buffer size, and it differs from microarchitecture to microarchitecture. The annoying part would be how to distribute it.

P.P.S.: Sorry about the mistaken PR at the end of may (@jedisct1).

from forro.

samuel-lucas6 avatar samuel-lucas6 commented on May 25, 2024

I was more worried with the possibility that in the first round of XForró, we'd operate two words that have recently been together in the same subround, we observed that it can stiffle diffusion

My bad, I definitely didn't understand the reasoning for your order.

I brought your concerns (samuel-lucas6) to murcoutinho, and since (6,7,14,15),(4,5,12,13) doesn't create this problem, we agreed that it's less confusing and more friendly to follow what has been established before us

Thank you for investigating that. I don't mind either way as long as it's secure and everybody is interoperable, but I agree with that logic.

P.S.: And regarding performance, if any of you are interested (including ericlagergren), I'd be happy to help

I'd happily take you up on that as I know very little about optimising implementations. However, I'm quite busy writing my dissertation at the moment.

from forro.

jedisct1 avatar jedisct1 commented on May 25, 2024

On Apple M1 and with this implementation, the best performance is achieved by inlining all the rounds. Inlining only 2 decreases performance.

So, yeah, this is very CPU-dependent.

On AMD, prefetching the next cache lines can something helps a lot.

from forro.

iagora avatar iagora commented on May 25, 2024

I expressed myself poorly regarding the unroll, I indeed inline everything. What I manually unrolled was two executions. Say you want a 1024 bit stream, I run both counters in the same scope, so that the OOO executor can fetch from the next execution, since they'd have no direct dependency, the instruction is pretty much guaranteed to retire if it's prefetched, that's what in the paper we call Xote. When I finish work today, I'll make some time to get this code running Xote, if you'd entertain running it in your M1, I'd be curious to know if the performs better.

from forro.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.