blake3-team / blake3 Goto Github PK
View Code? Open in Web Editor NEWthe official Rust and C implementations of the BLAKE3 cryptographic hash function
License: Apache License 2.0
the official Rust and C implementations of the BLAKE3 cryptographic hash function
License: Apache License 2.0
I read the paper that blake3 can do up to 4gb of output. In c, is this achieved through batching calls of blake3_hasher_finalize
? if so what is the largest size that function would handle?
A bit unrelated. does Rust uses some kind of streaming for doing this? I'm not familiar with rust, so a simple high level explanation on how the code does it would be great.
Thanks!
It'd be really nice for BLAKE3 to follow RustCrypto's digest crate to allow for BLAKE3 to substitute other functions in generic code.
Hasher
is a poor name here, Blake3
would be cleaner. I expect to see many instances of use blake3::Hasher as Blake3
. A type alias suffices to avoid a breaking change type Hasher = Blake3
These traits can be implemented without a breaking change. However, it'd be cleaner to only implement digest's traits and any other breaking changes could be batched. The release is early and much more usage will come with the digest traits implemented.
Great work BLAKE3-team!
I've extracted a minimal test case reproducing a very strange performance cliff in my application - strange because it can be worked around in a way that suggests that the buffering algorithm in blake3_hasher_update isn't working properly.
I'm using the C port in my app, I haven't had a chance to try and get the Rust version going and see if it has the same issue, but from reading the two the buffering algorithm looks the same to me.
To produce the problem, we need to start out by hashing something small, eg. a size_t integer (in my real application this would be the length of the following string). Then follow with a big buffer.
It's not surprising that this would be measurably slower than just hashing the big buffer without the integer first, since it requires retaining a partial block of data. What is strange is that the entire operation on the big buffer is much slower, not just the hashing of the first full block of data.
And even stranger, you can restore back essentially full performance by breaking the blake3_hasher_update call for the big block of data into two, on the boundary of 4096 bytes minus however many bytes you've already hashed. (Yes, 4096 - not BLAKE3_CHUNK_LEN, 1024, like any sensible bug would have chosen.)
In fact making more calls to blake3_hasher_update makes the whole thing 3x faster even though the same data is processed in the end. Obviously, if this is really expected, we should be doing that inside blake3_hasher_update.
Code to reproduce:
#include "blake3.h"
#include <stdlib.h>
#include <strings.h>
#include <stdio.h>
#include <time.h>
int main() {
blake3_hasher hasher;
uint8_t output[BLAKE3_OUT_LEN];
size_t reps = 1000; // arbitrary
size_t small_value = 0;
size_t buf_size = 1024*1024; // 1mb; fairly arbitrary, 256kb gives the same result, as does 1024*1024 - sizeof(size_t)
unsigned char *buf = malloc(buf_size);
bzero(buf, buf_size);
// variant 1: hash a size_t, then hash the remainder of the 1mb buffer
clock_t start1 = clock();
for (int rep = 0; rep < reps; rep++) {
blake3_hasher_init(&hasher);
blake3_hasher_update(&hasher, &small_value, sizeof(small_value));
blake3_hasher_update(&hasher, buf, buf_size);
blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
}
clock_t end1 = clock();
printf("variant 1 ran in %f, hash ", ((float)(end1 - start1))/CLOCKS_PER_SEC);
for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
printf("%02x", output[i]);
}
printf("\n");
// variant 2: hash a size_t, then hash 4k minus the size of a size_t, then hash the remainder of the 1mb buffer
clock_t start2 = clock();
for (int rep = 0; rep < reps; rep++) {
blake3_hasher_init(&hasher);
blake3_hasher_update(&hasher, &small_value, sizeof(small_value));
size_t magic_amount_to_update = 4096 - sizeof(small_value); // <-- 4096 is the magic number; 1024 (BLAKE3_CHUNK_LEN) does not produce a significant speedup; 2048 produces a partial speedup
blake3_hasher_update(&hasher, buf, magic_amount_to_update); // <-- new line
blake3_hasher_update(&hasher, buf + magic_amount_to_update, buf_size - magic_amount_to_update); // <-- remainder of the full buffer
blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
}
clock_t end2 = clock();
printf("variant 2 ran in %f, hash ", ((float)(end2 - start2))/CLOCKS_PER_SEC);
for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
printf("%02x", output[i]);
}
printf("\n");
return 0;
}
On my Macbook Pro I get:
variant 1 ran in 0.946282, hash 80b378d1f0bc7ee2e9c96dc30108187a03779c8dfd951047bf6c99ce88be2ff6
variant 2 ran in 0.302671, hash 80b378d1f0bc7ee2e9c96dc30108187a03779c8dfd951047bf6c99ce88be2ff6
The comment in test_vectors.json says:
For derive_key, the test input is used as the input key, and the context string is 'BLAKE3 2019-12-27 16:29:52 example context'.
BLAKE3/test_vectors/test_vectors.json
Line 2 in 80260dc
However, the test vectors were actually computed with the context string BLAKE3 2019-12-27 16:29:52 test vectors context
. Citations:
BLAKE3/test_vectors/src/lib.rs
Line 32 in 2fac744
BLAKE3/test_vectors/src/lib.rs
Lines 93 to 97 in 2fac744
By following the source code rather than the comment, I am able to reproduce the test vectors with my independent implementation of BLAKE3.
NEON is always supported on AArch64, so there's no need to make it conditional on the c_neon
flag. I don't think build.rs
can enable features, so we might have to do this with target_arch = "aarch64" OR feature = "c_neon"
everywhere, which is doable but a bit unfortunate. (Not so different from how we have to say x86 OR x86_64
all over the place.)
I'd love to get some more information on the methodology used to produce the benchmark data in the README. In particular, which sha1 and sha256 implementations did you benchmark against, and what hardware extensions did those implementations support?
Thank you for the incredible work on BLAKE3; I'm looking forward to using it, especially the parallel hashing support.
Bitdefender is removing b3sum and preventing a successful build using cargo install b3sum
.
Error:
The file \cargo-installJYtDLo\release\build\anyhow-abe5dfc7615735a9\build_script_build-abe5dfc7615735a9.exe is infected with Gen:Variant.Razy.609843 and was moved to quarantine. It is recommended that you run a System Scan to make sure your system is clean.
13556be fixed a break in SSE4.1 assembly implementations. Notably, that break was only visible in CI when we happened to get machines that didn't support AVX-512. It would be nice to add a set of features like "no_avx512"
that we could toggle in CI, to make sure we get complete coverage of all the instruction sets the machine supports.
When trying to compile with cargo install b3sum
on CentOS 7, it fails with the following error message:
cargo:warning=cc: error: unrecognized command line option ‘-mavx512f’
cargo:warning=cc: error: unrecognized command line option ‘-mavx512vl’
exit code: 1
Full output:
error: failed to run custom build command for `blake3 v0.1.0`
Caused by:
process didn't exit successfully: `/tmp/cargo-installxeuQpU/release/build/blake3-46d96c605b25f4dd/build-script-build` (exit code: 1)
--- stdout
TARGET = Some("x86_64-unknown-linux-gnu")
OPT_LEVEL = Some("3")
HOST = Some("x86_64-unknown-linux-gnu")
CC_x86_64-unknown-linux-gnu = None
CC_x86_64_unknown_linux_gnu = None
HOST_CC = None
CC = None
CFLAGS_x86_64-unknown-linux-gnu = None
CFLAGS_x86_64_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-Wall" "-Wextra" "-std=c11" "-mavx512f" "-mavx512vl" "-o" "/tmp/cargo-installxeuQpU/release/build/blake3-5036c51346845a8c/out/c/blake3_avx512.o" "-c" "c/blake3_avx512.c"
cargo:warning=cc: error: unrecognized command line option ‘-mavx512f’
cargo:warning=cc: error: unrecognized command line option ‘-mavx512vl’
exit code: 1
--- stderr
error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-Wall" "-Wextra" "-std=c11" "-mavx512f" "-mavx512vl" "-o" "/tmp/cargo-installxeuQpU/release/build/blake3-5036c51346845a8c/out/c/blake3_avx512.o" "-c" "c/blake3_avx512.c" with args "cc" did not execute successfully (status code exit code: 1).
warning: build failed, waiting for other jobs to finish...
error: failed to compile `b3sum v0.1.0`, intermediate artifacts can be found at `/tmp/cargo-installxeuQpU`
Caused by:
build failed
The gcc version is:
$ cc --version
cc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compiling with cargo install --no-default-features --features rayon b3sum
works.
Hello,
I see these warnings when using the new code and compiling for arm/aarch64 (I'm porting blake3 to openssl locally).
../../dist/crypto/blake3/blake3.c:264:8: warning: no previous prototype for 'blake3_compress_subtree_wide' [-Wmissing-prototypes]
size_t blake3_compress_subtree_wide(const uint8_t *input, size_t input_len,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_dispatch.c:292:8: warning: no previous prototype for 'blake3_simd_degree' [-Wmissing-prototypes]
size_t blake3_simd_degree() {
^~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_dispatch.c:153:5: warning: 'get_cpu_features' defined but not used [-Wunused-function]
get_cpu_features() {
^~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:97:6: warning: no previous prototype for 'blake3_compress_in_place_portable' [-Wmissing-prototypes]
void blake3_compress_in_place_portable(uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:113:6: warning: no previous prototype for 'blake3_compress_xof_portable' [-Wmissing-prototypes]
void blake3_compress_xof_portable(const uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:158:6: warning: no previous prototype for 'blake3_hash_many_portable' [-Wmissing-prototypes]
void blake3_hash_many_portable(const uint8_t *const *inputs, size_t num_inputs,
^~~~~~~~~~~~~~~~~~~~~~~~~
In function 'compress_parents_parallel',
inlined from 'compress_subtree_to_parent_node' at ../../dist/crypto/blake3/blake3.c:349:9,
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:238:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(&out[parents_array_len * BLAKE3_OUT_LEN],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&child_chaining_values[2 * parents_array_len * BLAKE3_OUT_LEN],
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BLAKE3_OUT_LEN);
~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_parents_parallel',
inlined from 'compress_subtree_to_parent_node' at ../../dist/crypto/blake3/blake3.c:349:9,
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:238:5: warning: 'memcpy' offset 64 is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(&out[parents_array_len * BLAKE3_OUT_LEN],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&child_chaining_values[2 * parents_array_len * BLAKE3_OUT_LEN],
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BLAKE3_OUT_LEN);
~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 96] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
../../dist/crypto/blake3/blake3_neon.c:231:6: warning: no previous prototype for 'blake3_hash4_neon' [-Wmissing-prototypes]
void blake3_hash4_neon(const uint8_t *const *inputs, size_t blocks,
^~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_neon.c:326:6: warning: no previous prototype for 'blake3_hash_many_neon' [-Wmissing-prototypes]
void blake3_hash_many_neon(const uint8_t *const *inputs, size_t num_inputs,
^~~~~~~~~~~~~~~~~~~~~
Right now there is only the top-level rayon
feature, so multi-threading is all-or-nothing. It would be better to allow each caller to choose, and perhaps to provide other executors besides Rayon. We could keep the default governed by the rayon
feature, or we could make the default always single-threaded, not sure yet.
This might allow us to support a flag in b3sum
that controls multithreading at runtime, for example.
@pcwalton mentioned one approach here: https://twitter.com/pcwalton/status/1215349922267987970
There is no reason for UTF-8 here. Please s/str/[u8]/
here. Alternatively provide another function to avoid the breaking change. Though, as I've mentioned in #11 I'm not against a breaking change for consistency with other crates, especially this early.
https://docs.rs/blake3/0.1.0/blake3/struct.Hasher.html#method.new_derive_key
pub fn new_derive_key(context: &str) -> Self
If you run cargo clippy it seems to make some good recommendations. Things like:
1.You multiply 0*4 in many lines. The answer would be 0. It might be faster to replace these.
2. You multiply items by 1, you don't need the 1.
3. You do some masking with a 0 (answer is only 0)
4. Some passing of small items might be faster if one does not use a reference.
The license file says CC0 or Apache-2.0, but the Cargo.toml only lists Apache-2.0
You can use OR to make it an option, so CC0 OR Apache-2.0
When trying to build the assembly variant of the latest master with clang 7 it fails:
$ make test_asm CC=clang-7
clang-7 -O3 -Wall -Wextra -std=c11 -pedantic -DBLAKE3_TESTING -fsanitize=address,undefined blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41-x86_64-unix.S blake3_avx2-x86_64-unix.S blake3_avx512-x86_64-unix.S -o blake3
blake3_sse41-x86_64-unix.S:1380:9: error: invalid operand for instruction
jne 1b
^
blake3_sse41-x86_64-unix.S:1636:9: error: invalid operand for instruction
jmp 1b
...
Clang 8 works as expected.
$ for i in {1..1000};do echo $i > f$i;done
$ hyperfine 'md5sum f{1..1000}' 'b3sum f{1..1000}'
Benchmark #1: md5sum f{1..1000}
Time (mean ± σ): 11.2 ms ± 0.6 ms [User: 5.9 ms, System: 5.2 ms]
Range (min … max): 10.3 ms … 13.5 ms 196 runs
Benchmark #2: b3sum f{1..1000}
Time (mean ± σ): 13.7 ms ± 0.8 ms [User: 6.2 ms, System: 7.3 ms]
Range (min … max): 12.8 ms … 20.7 ms 168 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Summary
'md5sum f{1..1000}' ran
1.22 ± 0.09 times faster than 'b3sum f{1..1000}'
I guess the reason is that read
is more effective than mmap
on small files.
Strace output (md5sum):
openat(AT_FDCWD, "f998", O_RDONLY) = 3
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0
read(3, "998\n", 32768) = 4
read(3, "", 28672) = 0
lseek(3, 0, SEEK_CUR) = 4
close(3) = 0
Strace output (b3sum):
openat(AT_FDCWD, "f998", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=4, ...}) = 0
mmap(NULL, 4, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd03e7d2000
munmap(0x7fd03e7d2000, 4) = 0
close(3) = 0
b3sum 0.1.1
md5sum (GNU coreutils) 8.31
Hello.
In main.c memset was call on memory before this block was checked on NULL after it get from malloc function. One of solution placed it after this check was done.
Next pull request can be applied to do this.
Thank you.
P.S.
Also out_len can be check before malloc, but that is different story.
We've added these for x86, but not for NEON. They might help with very large inputs.
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
% make
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_sse41.c -o blake3_sse41.o -msse4.1 -D BLAKE3_USE_SSE41
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx2.c -o blake3_avx2.o -mavx2 -D BLAKE3_USE_SSE41 -D BLAKE3_USE_AVX2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl -D BLAKE3_USE_SSE41 -D BLAKE3_USE_AVX2 -D BLAKE3_USE_AVX512
gcc -O3 -Wall -Wextra -std=c11 -pedantic blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41.o blake3_avx2.o blake3_avx512.o -o blake3
/tmp/ccJZjf2t.o: In function `main':
main.c:(.text.startup+0x1f9): undefined reference to `get_cpu_features'
main.c:(.text.startup+0x245): undefined reference to `g_cpu_features'
collect2: error: ld returned 1 exit status
Makefile:6: recipe for target 'all' failed
make: *** [all] Error 1
Similar story with clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final)
gcc -shared -O3 -o libblake3.so blake3.c blake3_dispatch.c blake3_portable.c \
> blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S blake3_avx512_x86-64_unix.S
/usr/bin/ld: /tmp/ccoz9HaM.o: relocation R_X86_64_PC32 against symbol `blake3_simd_degree' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
I'm getting the error above when trying to compile the share library. including -fPIC
solves it, but not sure if that will result in intended behavior. a small clarification on this would be good
Currently I use this fake OID value (next available one after blake2 family) but we probably need an official one to be defined
1 3 6 1 4 1 1722 12 2 3 8 : BLAKE3 : blake3
Reference: http://www.oid-info.com/cgi-bin/display?oid=1+3+6+1+4+1+1722+12+2&action=display
I am NodeJs developer and would like to use this for my NodeJs project. Is there any way to use it?
I'm no C coder, but perhaps we can achieve this together?
https://www.percona.com/blog/2019/04/05/writing-postgresql-extensions-is-fun-c-language/
I built a bunch of constant-time fuzzing targets for blake3 using sidefuzz
The fuzzing targets can be found here: http://github.com/phayes/sidefuzz-targets
Results are as follows:
So everything looks good except that I couldn't properly fuzz keyed_hash
in relation to the context since it takes a string, and sidefuzz produces &[u8]
s as it's fuzzing inputs.
So to fuzz keyed_hash
in relation to context
we would either need to resolve #13 or I would need to improve sidefuzz to have a function that provides a string.
This probably makes sense for most callers. These both build faster and perform better. Using the fastest implementations by default will also make people less confused when they do their own benchmarks. Instead of the current "c"
feature, we can expose a "pure"
feature, for applications that want to remove the dependency on the C/assembler toolchain. In both cases, top level applications should generally be the ones to decide which implementations get included, and other libraries should generally avoid making that decision.
Would it be possible to export the fast mt versions from Rust to C/C++, to compare against the single-threaded C version? I'm pretty sure many C/C++ projects would want to use the faster rust versions.
I'd also want to compare the C versions against the rust versions in smhasher.
When trying to build the C implementation of BLAKE3 on Ubuntu 16.04 LTS (used by Travis) the build fails when compiling blake3_avx512.c:
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl
/tmp/ccaq5stm.s: Assembler messages:
/tmp/ccaq5stm.s:3763: Error: invalid register operand for `vmovdqu'
/tmp/ccaq5stm.s:3765: Error: invalid register operand for `vmovdqu'
Makefile:40: recipe for target 'blake3_avx512.o' failed
make: *** [blake3_avx512.o] Error 1
Looking at the generated assembler code, GCC generates vmovdqu %ymm17, (%rax)
which is an invalid instruction as far as I can tell (VEX encoded instead of EVEX and can thus only access ymm0-ymm15). So it looks to be a compiler bug. But, if I add -mavx512bw
GCC instead uses vmovdqu8 %ymm17, (%rax)
which compiles.
I assume that get_cpu_features()
should be updated to if -mavx512bw
is to be used, but other than that, is there any downside with using it?
This error is seen in the CI of ccache, see ccache/ccache#519.
I know that Blake3 hashes for 7 rounds, and I know that is believed to be secure enough, but it sounds like just barely so. I would like to be extra careful as I'm working on an application that needs paranoid levels of security.
Would be really nice if there were a simple way to just hash for a variable number of rounds, and I could simply add a few more, is that possible? If not, you might want to consider adding it. Fights off everyone who says Blake3 isn't as secure as Blake2 and seems easy enough to add.
The other option is to simply double or triple hash.. is there any reason that is not as secure as 14 or 21 rounds?
When enabling this option I get a lot of hits in the various C code
../../dist/crypto/blake3/blake3_avx512.c:292:6: warning: no previous prototype for 'blake3_compress_xof_avx512' [-Wmissing-prototypes]
void blake3_compress_xof_avx512(const uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~
Would it be possible to make sure all functions are either local to a module and static or defined in a header?
b3sum runs incredibly slow.
PC:
Windows 10 Pro
i7-6700
32GB RAM
b3sum built using: cargo install b3sum
(v0.1.2)
Test File: 2.88GB
Results:
md5sum: 10 seconds
b2sum: 5 seconds
b3sum: 4 Minutes 37 Seconds
Will this be useful in scenarios where hash calculations are very frequent?
When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3
Hi, I have this compilation error for C version:
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_sse41.c -o blake3_sse41.o -msse4.1
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx2.c -o blake3_avx2.o -mavx2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl
gcc -O3 -Wall -Wextra -std=c11 -pedantic blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41.o blake3_avx2.o blake3_avx512.o -o blake3
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccCfPl6z.o: warning: relocation against g_cpu_features' in read-only section
.text.startup'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccCfPl6z.o: in function main': main.c:(.text.startup+0x22b): undefined reference to
get_cpu_features'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: main.c:(.text.startup+0x275): undefined reference to `g_cpu_features'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: creating a DT_TEXTREL in object
collect2: error: ld returned 1 exit status
make: *** [Makefile:35: all] Error 1
Please could you provide the option to output the hash as the raw bytes?
@sneves has come up with a huge optimization for longer inputs, using memory prefetches. This issue is a note-to-self until I have time to get a proper commit together:
diff --git a/src/avx2.rs b/src/avx2.rs
index 841b6e1..a1cb3d0 100644
--- a/src/avx2.rs
+++ b/src/avx2.rs
@@ -261,6 +261,16 @@ unsafe fn transpose_msg_vecs(inputs: &[*const u8; DEGREE], block_offset: usize)
loadu(inputs[6].add(block_offset + 1 * 4 * DEGREE)),
loadu(inputs[7].add(block_offset + 1 * 4 * DEGREE)),
];
+
+ _mm_prefetch(inputs[0].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[1].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[2].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[3].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[4].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[5].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[6].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+ _mm_prefetch(inputs[7].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+
let squares = mut_array_refs!(&mut vecs, DEGREE, DEGREE);
transpose_vecs(squares.0);
transpose_vecs(squares.1);
Adding -c, --check would make b3sum a drop in replacement for b2sum in some scripts.
Hello.
There is a series of issues at source files that prevent to compile project at Microsoft C compiler:
One of vision how to resolve that all can be found in this request.
Thank you.
P.S.
Please note, here we talking about compile issue. Linking at MSVC environment have same issue like at gcc that already discussed here.
There have been quite a few performance improvements since we originally published the paper and the big red bar chart. We should redo them.
e.g. rustc 1.44.0-nightly (f509b26a7 2020-03-18)
error[E0658]: use of unstable library feature 'stdsimd'
--> src/platform.rs:304:12
|
304 | if is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512vl") {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: see issue #27731 <https://github.com/rust-lang/rust/issues/27731> for more information
= help: add `#![feature(stdsimd)]` to the crate attributes to enable
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error: aborting due to previous error
For more information about this error, try `rustc --explain E0658`.
error: could not compile `blake3`.
Problem:
b3sum --no-mmap
is using a tiny read buffer size (8KB) and syscalls read(2)
and futex(2)
aggressively. I find that hashing at only 110 MB/s maxes out all 24 cores, doing about 10k syscalls per second (tail -f strace.log | pv -l >/dev/null
).
top
CPU% MEM% TIME+ Command
2025 0.0 22:58.81 b3sum --no-mmap idsab.tar.zst
strace.log
futex(0x563438b1804c, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\3\352\t\225\213r\371\3421\231\361\303\306\237\252\362\274\212>\33\t\f&\240u\31\252\331\310\242\265T"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "9\22\373\5\17\357\35\0364\35\371{\324\220\275J~B0$\273\332\264\20\v\352\304\23\301l\316\320"..., 8192) = 8192
futex(0x563438b18048, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\225/\247\216\367Q\375,E\357\305\367\323\321\2\223f\266\n\222\257\305\260qLJ\f\247\370\202\6\306"..., 8192) = 8192
futex(0x563438b1402c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "7\365\30n\370K\f\265\200\363r7\35E\321\347_\226\351-\331\24\232\200\352\233\316W\20\2121\264"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "\234\235\250\352\315\343*\t\215\205o(\231\230\325\f\222\257o\302A3\315z\240}\364'\327\33Wj"..., 8192) = 8192
... and so on
Environment:
cargo install b3sum
(as of 2020-02-03)cargo 1.41.0 (626f0f40e 2019-12-03)
5.4.0-9-generic
The spec and docs for derive_key
are very clear that the context string should be hardcoded, without runtime input of any kind. The fundamental purpose of derive_key
is domain separation, and hardcoded context strings make separation more reliable: as long as your is different from mine, we're guaranteed not to interact in any way.
However, many applications do want to incorporate dynamic data into their derived keys somehow. For a silly example, maybe I have a disk encryption key, and I want to derive a separate subkey for each drive on my machine. I need to use each drive label as associated data somehow. To make this more interesting, let's say that I want a separate subkey for each partition too, so there are two distinct pieces of associated data. How should I do this?
Well, here's how I would do it:
Start with by using derive_key
with a hardcoded context string, which describes what we're about to do. For example, "Jack's disk encrypt-o-matic 2020-01-24 11:47:57 drive + partition subkeys"
. The output is a 256-bit intermediate subkey.
Mix in the disk label. The most natural way to do this is with keyed_hash
. The key is the output of step 1, and the input is the disk label. The output is another 256-bit intermediate subkey.
Mix in the partition label. Again we'll use keyed_hash
, with the key being the output of step 2. The output here is our final derived subkey, presumably also 256 bits.
Now, what I'd like to do is generalize this sort of thing. Some observations:
I think using keyed_hash
is the most elegant way to do steps 2 and 3, but I want to note that it doesn't super duper matter what we do. We could concatenate the key and the associated data and just use hash
. Because of the domain separation provided in step 1, we don't have to worry about whether our approach is going to interact badly with some other approach that someone else takes. Our keys will never interact.
Concatenating things is a little tricky, because you have to worry about ambiguities in the concatenation boundary. This doesn't really matter if one of the things you're concatenating is a fixed length, but it's something to be aware of. This is one of the reasons I prefer keyed_hash
: there's no possible ambiguity between what's the key and what's the input.
One real world hazard that could come up here is if the application considers some associated data "optional". For example, maybe some disk has no partitions, so we don't have a partition label to use in step 3. What we don't want to do is skip steps entirely, because that could introduce another type of ambiguity: which step did we skip? So there should be some kind of ruling about optional AD parameters. Maybe they can be empty, but not missing? We could also try to use a different context string at every step, but that seems awfully heavyweight.
So anyway, I'm thinking about generalizing the scheme above into some kind of broad advice for generating subkeys with associated data. I'll use this issue for keeping track of my thoughts. Hopefully other folks can chime in with theirs.
how can i move on with my life
error[E0658]: use of unstable library feature 'stdsimd'
--> /home/oussama/.cargo/registry/src/github.com-1ecc6299db9ec823/blake3-0.1.4/src/platform.rs:266:12
|
266 | if is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512vl") {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: for more information, see https://github.com/rust-lang/rust/issues/27731
= help: add `#![feature(stdsimd)]` to the crate attributes to enable
= note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)
error: aborting due to previous error
For more information about this error, try `rustc --explain E0658`.
error: could not compile `blake3`.
The inline assembly used by 32-bit x86 code needs to save and restore the EBX register because position independent code needs it unfortunately (https://ewontfix.com/18/). Since the CPUID instruction also uses the EBX register to return information, the inline assembly has to copy that data into another register (ESI in this case) before it restores the original EBX.
However, both the cpuid() and the cpuidex() functions contain a typo, and they return whatever is in the EBP register (essentially a random value) instead of EBX. This breaks run-time detection of AVX2 and newer CPU features.
The fix is very simple:
diff --git a/c/blake3_dispatch.c b/c/blake3_dispatch.c
index 7daf43e..b97b7c0 100644
--- a/c/blake3_dispatch.c
+++ b/c/blake3_dispatch.c
@@ -97,7 +97,7 @@ static void cpuid(uint32_t out[4], uint32_t id) {
__cpuid((int *)out, id);
#else
#if defined(__i386__) || defined(_M_IX86)
- __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebp, %%esi\npopl %%ebx"
+ __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebx, %%esi\npopl %%ebx"
: "=a"(out[0]), "=S"(out[1]), "=c"(out[2]), "=d"(out[3])
: "a"(id));
#else
@@ -113,7 +113,7 @@ static void cpuidex(uint32_t out[4], uint32_t id, uint32_t sid) {
__cpuidex((int *)out, id, sid);
#else
#if defined(__i386__) || defined(_M_IX86)
- __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebp, %%esi\npopl %%ebx"
+ __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebx, %%esi\npopl %%ebx"
: "=a"(out[0]), "=S"(out[1]), "=c"(out[2]), "=d"(out[3])
: "a"(id), "c"(sid));
#else
What are the speed and security difference between BLAKE3 and KangarooTwelve and K14 by Team Keccak?
Main page: https://keccak.team/kangarootwelve.html
Specification: https://keccak.team/files/KangarooTwelve.pdf
Base implementation: https://github.com/XKCP/XKCP/tree/master/Standalone/KangarooTwelve/Python
Rust Implementation: https://github.com/XKCP/XKCP/tree/master/Standalone/KangarooTwelve/Rust
RFC standard: https://datatracker.ietf.org/doc/draft-viguier-kangarootwelve/
As a side joke, I made this.
There once were two empires in Hashlandia:
The Keccak Empire led by Bertoni, Daemen, Peeters and VanAssche,
And the BLAKE Empire led by Neves and Aumasson.
A long time ago, there was a war of SHA3, and Keccak was victorious, but it will not last long.
In retaliation, King Neves decided to fire back with the invasion of BLAKE II,
And as King Bertoni sees his empire falling apart because of it, Kangaroo XII has been staged.
After King Neves noticed their schemes, BLAKE III is coming.
This might be a good idea for educational purposes.
The current implementations of loadu_128
and storeu_128
use memcpy
to read uint32x4_t
vectors of e.g. hash input, which implicitly assumes the machine byte order is little-endian. My understanding is that that's true of most but not all ARM devices. We should figure out how to either make the code endianness-independent, or how to produce a compile-time error on big-endian ARM.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.