blake3-team / blake3 Goto Github PK

View Code? Open in Web Editor NEW

4.7K 4.7K 312.0 1.75 MB

the official Rust and C implementations of the BLAKE3 cryptographic hash function

License: Apache License 2.0

Rust 27.52% C 12.09% Shell 0.15% Python 0.25% Assembly 59.51% CMake 0.48%

blake3's People

Contributors

Stargazers

Watchers

Forkers

georgio guidovranken awnumar alexw91 backwardn eticzon studyandthink sunguo slanterns-fork rurban bilalesi rdpli rbs-pli cesarb baublesbythebaynet d3vil7 hhy5277 lasgonen0 abhisheksinha1506 ellerbrock unsecureio fake-cheater isgasho saonam thomascherickal jason-2020 awesome-archive dev0p0 phayes anniyanvr shivlondon mu-l bhargavmahajan rsrrrrrr reenigneca uishi erijo tis-cv thevice devenlu binaryflesh divinity76 dowlandaiello 5l1v3r1 spoobnix mihirluthra omac777 404-not-find kamkudla cosito-bonito letoram inanna-malick codesmoker acidburn0zzz k0001 luro02 deniod xnox sanjay-a-menon jakub-zwolakowski clayne hywan c0axial embarkstudios marciopocebon daryloid sys-inventors alexxroche mkrupcale nvzqz ameg-yag holajiawei boneyard93501 rrozestw jonjonw age-rs bhgomes petemud kovacs-levent alexnakouzi catap assoorga1 paulgrandperrin swkim101 6a-62 joseffrank spkorhonen gerhobbelt mediakraken-dep jorickert goodengineer stanpall xtblock hartl3y94 megastefan22 daviessm vucoda mitchelltesla spartanjohn-117 eaconde

blake3's Issues

nightly and beta tests failing with --features="c" due to upstream bug

rust-lang/rust#68905

largest output for blake3_hasher_finalize in c

I read the paper that blake3 can do up to 4gb of output. In c, is this achieved through batching calls of blake3_hasher_finalize? if so what is the largest size that function would handle?

A bit unrelated. does Rust uses some kind of streaming for doing this? I'm not familiar with rust, so a simple high level explanation on how the code does it would be great.

Thanks!

impl digest::* for Blake3?

It'd be really nice for BLAKE3 to follow RustCrypto's digest crate to allow for BLAKE3 to substitute other functions in generic code.

~~Hasher is a poor name here, Blake3 would be cleaner. I expect to see many instances of use blake3::Hasher as Blake3. A type alias suffices to avoid a breaking change type Hasher = Blake3~~

These traits can be implemented without a breaking change. However, it'd be cleaner to only implement digest's traits and any other breaking changes could be batched. The release is early and much more usage will come with the digest traits implemented.

Great work BLAKE3-team!

Performance bug: runs 3x faster if a call to blake3_hasher_update is broken into two calls

I've extracted a minimal test case reproducing a very strange performance cliff in my application - strange because it can be worked around in a way that suggests that the buffering algorithm in blake3_hasher_update isn't working properly.

I'm using the C port in my app, I haven't had a chance to try and get the Rust version going and see if it has the same issue, but from reading the two the buffering algorithm looks the same to me.

To produce the problem, we need to start out by hashing something small, eg. a size_t integer (in my real application this would be the length of the following string). Then follow with a big buffer.

It's not surprising that this would be measurably slower than just hashing the big buffer without the integer first, since it requires retaining a partial block of data. What is strange is that the entire operation on the big buffer is much slower, not just the hashing of the first full block of data.

And even stranger, you can restore back essentially full performance by breaking the blake3_hasher_update call for the big block of data into two, on the boundary of 4096 bytes minus however many bytes you've already hashed. (Yes, 4096 - not BLAKE3_CHUNK_LEN, 1024, like any sensible bug would have chosen.)

In fact making more calls to blake3_hasher_update makes the whole thing 3x faster even though the same data is processed in the end. Obviously, if this is really expected, we should be doing that inside blake3_hasher_update.

Code to reproduce:

#include "blake3.h"
#include <stdlib.h>
#include <strings.h>
#include <stdio.h>
#include <time.h>

int main() {
  blake3_hasher hasher;
  uint8_t output[BLAKE3_OUT_LEN];
  size_t reps = 1000; // arbitrary

  size_t small_value = 0;
  size_t buf_size = 1024*1024; // 1mb; fairly arbitrary, 256kb gives the same result, as does 1024*1024 - sizeof(size_t)
  unsigned char *buf = malloc(buf_size);
  bzero(buf, buf_size);

  // variant 1: hash a size_t, then hash the remainder of the 1mb buffer
  clock_t start1 = clock();
  for (int rep = 0; rep < reps; rep++) {
    blake3_hasher_init(&hasher);
    blake3_hasher_update(&hasher, &small_value, sizeof(small_value));
    blake3_hasher_update(&hasher, buf, buf_size);
    blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
  }
  clock_t end1 = clock();
  printf("variant 1 ran in %f, hash ", ((float)(end1 - start1))/CLOCKS_PER_SEC);
  for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
    printf("%02x", output[i]);
  }
  printf("\n");

  // variant 2: hash a size_t, then hash 4k minus the size of a size_t, then hash the remainder of the 1mb buffer
  clock_t start2 = clock();
  for (int rep = 0; rep < reps; rep++) {
    blake3_hasher_init(&hasher);
    blake3_hasher_update(&hasher, &small_value, sizeof(small_value));
    size_t magic_amount_to_update = 4096 - sizeof(small_value); // <-- 4096 is the magic number; 1024 (BLAKE3_CHUNK_LEN) does not produce a significant speedup; 2048 produces a partial speedup
    blake3_hasher_update(&hasher, buf, magic_amount_to_update); // <-- new line
    blake3_hasher_update(&hasher, buf + magic_amount_to_update, buf_size - magic_amount_to_update); // <-- remainder of the full buffer
    blake3_hasher_finalize(&hasher, output, BLAKE3_OUT_LEN);
  }
  clock_t end2 = clock();
  printf("variant 2 ran in %f, hash ", ((float)(end2 - start2))/CLOCKS_PER_SEC);
  for (size_t i = 0; i < BLAKE3_OUT_LEN; i++) {
    printf("%02x", output[i]);
  }
  printf("\n");

  return 0;
}

On my Macbook Pro I get:

variant 1 ran in 0.946282, hash 80b378d1f0bc7ee2e9c96dc30108187a03779c8dfd951047bf6c99ce88be2ff6
variant 2 ran in 0.302671, hash 80b378d1f0bc7ee2e9c96dc30108187a03779c8dfd951047bf6c99ce88be2ff6

Typo in test vector comment

The comment in test_vectors.json says:

For derive_key, the test input is used as the input key, and the context string is 'BLAKE3 2019-12-27 16:29:52 example context'.

BLAKE3/test_vectors/test_vectors.json

Line 2 in 80260dc

    
           "_comment": "Each test is an input length and three outputs, one for each of the hash, keyed_hash, and derive_key modes. The input in each case is filled with a 251-byte-long repeating pattern: 0, 1, 2, ..., 249, 250, 0, 1, ... The key used with keyed_hash is the 32-byte ASCII string given in the 'key' field below. For derive_key, the test input is used as the input key, and the context string is 'BLAKE3 2019-12-27 16:29:52 example context'. (As good practice for following the security requirements of derive_key, test runners should make that context string a hardcoded constant, and we do not provided it in machine-readable form.) Outputs are encoded as hexadecimal. Each case is an extended output, and implementations should also check that the first 32 bytes match their default-length output.",

However, the test vectors were actually computed with the context string BLAKE3 2019-12-27 16:29:52 test vectors context. Citations:

BLAKE3/test_vectors/src/lib.rs

Line 32 in 2fac744

    
           pub const TEST_CONTEXT: &str = "BLAKE3 2019-12-27 16:29:52 test vectors context";

BLAKE3/test_vectors/src/lib.rs

Lines 93 to 97 in 2fac744

    
           let mut derive_key_out = [0; OUTPUT_LEN]; 
        
           blake3::Hasher::new_derive_key(TEST_CONTEXT) 
        
               .update(&input) 
        
               .finalize_xof() 
        
               .fill(&mut derive_key_out);

By following the source code rather than the comment, I am able to reproduce the test vectors with my independent implementation of BLAKE3.

enable c_neon by default on aarch64

NEON is always supported on AArch64, so there's no need to make it conditional on the c_neon flag. I don't think build.rs can enable features, so we might have to do this with target_arch = "aarch64" OR feature = "c_neon" everywhere, which is doable but a bit unfortunate. (Not so different from how we have to say x86 OR x86_64 all over the place.)

Please clarify if sha1 and sha256 benchmarks used sha-ni extensions

I'd love to get some more information on the methodology used to produce the benchmark data in the README. In particular, which sha1 and sha256 implementations did you benchmark against, and what hardware extensions did those implementations support?

Thank you for the incredible work on BLAKE3; I'm looking forward to using it, especially the parallel hashing support.

Build Failing, Bitdefender Detects Gen:Variant.Razy.609843

Bitdefender is removing b3sum and preventing a successful build using cargo install b3sum.

Error:
The file \cargo-installJYtDLo\release\build\anyhow-abe5dfc7615735a9\build_script_build-abe5dfc7615735a9.exe is infected with Gen:Variant.Razy.609843 and was moved to quarantine. It is recommended that you run a System Scan to make sure your system is clean.

improve instruction set test coverage

13556be fixed a break in SSE4.1 assembly implementations. Notably, that break was only visible in CI when we happened to get machines that didn't support AVX-512. It would be nice to add a set of features like "no_avx512" that we could toggle in CI, to make sure we get complete coverage of all the instruction sets the machine supports.

b3sum fails to compile on CentOS 7 (unrecognized command line option)

When trying to compile with cargo install b3sum on CentOS 7, it fails with the following error message:

cargo:warning=cc: error: unrecognized command line option ‘-mavx512f’
cargo:warning=cc: error: unrecognized command line option ‘-mavx512vl’
exit code: 1

Full output:

error: failed to run custom build command for `blake3 v0.1.0`

Caused by:
  process didn't exit successfully: `/tmp/cargo-installxeuQpU/release/build/blake3-46d96c605b25f4dd/build-script-build` (exit code: 1)
--- stdout
TARGET = Some("x86_64-unknown-linux-gnu")
OPT_LEVEL = Some("3")
HOST = Some("x86_64-unknown-linux-gnu")
CC_x86_64-unknown-linux-gnu = None
CC_x86_64_unknown_linux_gnu = None
HOST_CC = None
CC = None
CFLAGS_x86_64-unknown-linux-gnu = None
CFLAGS_x86_64_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
CRATE_CC_NO_DEFAULTS = None
DEBUG = Some("false")
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-Wall" "-Wextra" "-std=c11" "-mavx512f" "-mavx512vl" "-o" "/tmp/cargo-installxeuQpU/release/build/blake3-5036c51346845a8c/out/c/blake3_avx512.o" "-c" "c/blake3_avx512.c"
cargo:warning=cc: error: unrecognized command line option ‘-mavx512f’
cargo:warning=cc: error: unrecognized command line option ‘-mavx512vl’
exit code: 1

--- stderr


error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-Wall" "-Wextra" "-std=c11" "-mavx512f" "-mavx512vl" "-o" "/tmp/cargo-installxeuQpU/release/build/blake3-5036c51346845a8c/out/c/blake3_avx512.o" "-c" "c/blake3_avx512.c" with args "cc" did not execute successfully (status code exit code: 1).



warning: build failed, waiting for other jobs to finish...
error: failed to compile `b3sum v0.1.0`, intermediate artifacts can be found at `/tmp/cargo-installxeuQpU`

Caused by:
  build failed

The gcc version is:

$ cc --version
cc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Compiling with cargo install --no-default-features --features rayon b3sum works.

blake3 compiler warnings

Hello,

I see these warnings when using the new code and compiling for arm/aarch64 (I'm porting blake3 to openssl locally).

../../dist/crypto/blake3/blake3.c:264:8: warning: no previous prototype for 'blake3_compress_subtree_wide' [-Wmissing-prototypes]
size_t blake3_compress_subtree_wide(const uint8_t *input, size_t input_len,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_dispatch.c:292:8: warning: no previous prototype for 'blake3_simd_degree' [-Wmissing-prototypes]
size_t blake3_simd_degree() {
^~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_dispatch.c:153:5: warning: 'get_cpu_features' defined but not used [-Wunused-function]
get_cpu_features() {
^~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:97:6: warning: no previous prototype for 'blake3_compress_in_place_portable' [-Wmissing-prototypes]
void blake3_compress_in_place_portable(uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:113:6: warning: no previous prototype for 'blake3_compress_xof_portable' [-Wmissing-prototypes]
void blake3_compress_xof_portable(const uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_portable.c:158:6: warning: no previous prototype for 'blake3_hash_many_portable' [-Wmissing-prototypes]
void blake3_hash_many_portable(const uint8_t *const *inputs, size_t num_inputs,
^~~~~~~~~~~~~~~~~~~~~~~~~
In function 'compress_parents_parallel',
inlined from 'compress_subtree_to_parent_node' at ../../dist/crypto/blake3/blake3.c:349:9,
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:238:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(&out[parents_array_len * BLAKE3_OUT_LEN],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&child_chaining_values[2 * parents_array_len * BLAKE3_OUT_LEN],
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BLAKE3_OUT_LEN);
~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 64] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_parents_parallel',
inlined from 'compress_subtree_to_parent_node' at ../../dist/crypto/blake3/blake3.c:349:9,
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:238:5: warning: 'memcpy' offset 64 is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(&out[parents_array_len * BLAKE3_OUT_LEN],
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
&child_chaining_values[2 * parents_array_len * BLAKE3_OUT_LEN],
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BLAKE3_OUT_LEN);
~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
In function 'compress_subtree_to_parent_node',
inlined from 'blake3_hasher_update.part.0' at ../../dist/crypto/blake3/blake3.c:522:7:
../../dist/crypto/blake3/blake3.c:350:5: warning: 'memcpy' forming offset [33, 96] is out of the bounds [0, 32] of object 'out_array' with type 'uint8_t[32]' {aka 'unsigned char[32]'} [-Warray-bounds]
memcpy(cv_array, out_array, num_cvs * BLAKE3_OUT_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3.c: In function 'blake3_hasher_update.part.0':
../../dist/crypto/blake3/blake3.c:346:11: note: 'out_array' declared here
uint8_t out_array[MAX_SIMD_DEGREE_OR_2 * BLAKE3_OUT_LEN / 2];
^~~~~~~~~
../../dist/crypto/blake3/blake3_neon.c:231:6: warning: no previous prototype for 'blake3_hash4_neon' [-Wmissing-prototypes]
void blake3_hash4_neon(const uint8_t *const *inputs, size_t blocks,
^~~~~~~~~~~~~~~~~
../../dist/crypto/blake3/blake3_neon.c:326:6: warning: no previous prototype for 'blake3_hash_many_neon' [-Wmissing-prototypes]
void blake3_hash_many_neon(const uint8_t *const *inputs, size_t num_inputs,
^~~~~~~~~~~~~~~~~~~~~

allow explicit control over single-threaded vs multi-threaded

Right now there is only the top-level rayon feature, so multi-threading is all-or-nothing. It would be better to allow each caller to choose, and perhaps to provide other executors besides Rayon. We could keep the default governed by the rayon feature, or we could make the default always single-threaded, not sure yet.

This might allow us to support a flag in b3sum that controls multithreading at runtime, for example.

@pcwalton mentioned one approach here: https://twitter.com/pcwalton/status/1215349922267987970

Hasher::new_derive_key accepts a str, why not [u8]?

There is no reason for UTF-8 here. Please s/str/[u8]/ here. Alternatively provide another function to avoid the breaking change. Though, as I've mentioned in #11 I'm not against a breaking change for consistency with other crates, especially this early.

https://docs.rs/blake3/0.1.0/blake3/struct.Hasher.html#method.new_derive_key

pub fn new_derive_key(context: &str) -> Self

add NEON support to blake3_dispatch.c

Cargo Clippy makes some good reccomendations

If you run cargo clippy it seems to make some good recommendations. Things like:
1.You multiply 0*4 in many lines. The answer would be 0. It might be faster to replace these.
2. You multiply items by 1, you don't need the 1.
3. You do some masking with a 0 (answer is only 0)
4. Some passing of small items might be faster if one does not use a reference.

Cargo.toml license entry doesn't match LICENSE file

The license file says CC0 or Apache-2.0, but the Cargo.toml only lists Apache-2.0

You can use OR to make it an option, so CC0 OR Apache-2.0

Build fails with clang 7: invalid operand for instruction

When trying to build the assembly variant of the latest master with clang 7 it fails:

$ make test_asm CC=clang-7
clang-7 -O3 -Wall -Wextra -std=c11 -pedantic -DBLAKE3_TESTING -fsanitize=address,undefined  blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41-x86_64-unix.S blake3_avx2-x86_64-unix.S blake3_avx512-x86_64-unix.S -o blake3
blake3_sse41-x86_64-unix.S:1380:9: error: invalid operand for instruction
        jne 1b
        ^
blake3_sse41-x86_64-unix.S:1636:9: error: invalid operand for instruction
        jmp 1b
...

Clang 8 works as expected.

md5sum is slightly faster than b3sum on tiny files

$ for i in {1..1000};do echo $i > f$i;done
$ hyperfine 'md5sum f{1..1000}' 'b3sum f{1..1000}'
Benchmark #1: md5sum f{1..1000}
  Time (mean ± σ):      11.2 ms ±   0.6 ms    [User: 5.9 ms, System: 5.2 ms]
  Range (min … max):    10.3 ms …  13.5 ms    196 runs
 
Benchmark #2: b3sum f{1..1000}
  Time (mean ± σ):      13.7 ms ±   0.8 ms    [User: 6.2 ms, System: 7.3 ms]
  Range (min … max):    12.8 ms …  20.7 ms    168 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Summary
  'md5sum f{1..1000}' ran
    1.22 ± 0.09 times faster than 'b3sum f{1..1000}'

I guess the reason is that read is more effective than mmap on small files.

Strace output (md5sum):

openat(AT_FDCWD, "f998", O_RDONLY)      = 3
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0
read(3, "998\n", 32768)                 = 4
read(3, "", 28672)                      = 0
lseek(3, 0, SEEK_CUR)                   = 4
close(3)                                = 0

Strace output (b3sum):

openat(AT_FDCWD, "f998", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=4, ...}) = 0
mmap(NULL, 4, PROT_READ, MAP_SHARED, 3, 0) = 0x7fd03e7d2000
munmap(0x7fd03e7d2000, 4)               = 0
close(3)                                = 0

b3sum 0.1.1
md5sum (GNU coreutils) 8.31

[C/main.c] function memset used before return of malloc was done.

Hello.

In main.c memset was call on memory before this block was checked on NULL after it get from malloc function. One of solution placed it after this check was done.

Next pull request can be applied to do this.

Thank you.

P.S.
Also out_len can be check before malloc, but that is different story.

prefetch optimizations for the NEON implementation?

We've added these for x86, but not for NEON. They might help with very large inputs.

non-test C implementation fails to build

gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

% make
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_sse41.c -o blake3_sse41.o -msse4.1 -D BLAKE3_USE_SSE41
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx2.c -o blake3_avx2.o -mavx2 -D BLAKE3_USE_SSE41 -D BLAKE3_USE_AVX2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl -D BLAKE3_USE_SSE41 -D BLAKE3_USE_AVX2 -D BLAKE3_USE_AVX512
gcc -O3 -Wall -Wextra -std=c11 -pedantic blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41.o blake3_avx2.o blake3_avx512.o -o blake3
/tmp/ccJZjf2t.o: In function `main':
main.c:(.text.startup+0x1f9): undefined reference to `get_cpu_features'
main.c:(.text.startup+0x245): undefined reference to `g_cpu_features'
collect2: error: ld returned 1 exit status
Makefile:6: recipe for target 'all' failed
make: *** [all] Error 1

Similar story with clang version 8.0.0-3~ubuntu18.04.2 (tags/RELEASE_800/final)

build error with c lib

gcc -shared -O3 -o libblake3.so blake3.c blake3_dispatch.c blake3_portable.c \
>     blake3_sse41_x86-64_unix.S blake3_avx2_x86-64_unix.S blake3_avx512_x86-64_unix.S
/usr/bin/ld: /tmp/ccoz9HaM.o: relocation R_X86_64_PC32 against symbol `blake3_simd_degree' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status

I'm getting the error above when trying to compile the share library. including -fPIC solves it, but not sure if that will result in intended behavior. a small clarification on this would be good

Add miri to CI

OID Value

Currently I use this fake OID value (next available one after blake2 family) but we probably need an official one to be defined

1 3 6 1 4 1 1722 12 2 3 8 : BLAKE3 : blake3

Reference: http://www.oid-info.com/cgi-bin/display?oid=1+3+6+1+4+1+1722+12+2&action=display

help: Is there any way to use this awesome lib in NodeJs?

I am NodeJs developer and would like to use this for my NodeJs project. Is there any way to use it?

PostgreSQL extension

I'm no C coder, but perhaps we can achieve this together?

https://www.percona.com/blog/2019/04/05/writing-postgresql-extensions-is-fun-c-language/

Constant-Time fuzzing results

I built a bunch of constant-time fuzzing targets for blake3 using sidefuzz

The fuzzing targets can be found here: http://github.com/phayes/sidefuzz-targets

Results are as follows:

hash (input): constant time ✔️
keyed_hash (input): constant time ✔️
keyed_hash (key): constant time ✔️
keyed_hash (key): constant time ✔️
reference_impl_hash (input): constant time ✔️
keyed_hash (key_material): constant time ✔️
keyed_hash (context): fuzzing-failed

So everything looks good except that I couldn't properly fuzz keyed_hash in relation to the context since it takes a string, and sidefuzz produces &[u8]s as it's fuzzing inputs.

So to fuzz keyed_hash in relation to context we would either need to resolve #13 or I would need to improve sidefuzz to have a function that provides a string.

Deny warnings in ci?

-D warnings as per https://www.reddit.com/r/rust/comments/f5xpib/psa_denywarnings_is_actively_harmful/fi1k83a

use assembly implementations by default

This probably makes sense for most callers. These both build faster and perform better. Using the fastest implementations by default will also make people less confused when they do their own benchmarks. Instead of the current "c" feature, we can expose a "pure" feature, for applications that want to remove the dependency on the C/assembler toolchain. In both cases, top level applications should generally be the ones to decide which implementations get included, and other libraries should generally avoid making that decision.

Callable from C (no_mangle pub extern as in c_neon)

Would it be possible to export the fast mt versions from Rust to C/C++, to compare against the single-threaded C version? I'm pretty sure many C/C++ projects would want to use the faster rust versions.

I'd also want to compare the C versions against the rust versions in smhasher.

Build fails on GCC 5.4 with "invalid register operand for `vmovdqu'"

When trying to build the C implementation of BLAKE3 on Ubuntu 16.04 LTS (used by Travis) the build fails when compiling blake3_avx512.c:

gcc -O3 -Wall -Wextra -std=c11 -pedantic  -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl
/tmp/ccaq5stm.s: Assembler messages:
/tmp/ccaq5stm.s:3763: Error: invalid register operand for `vmovdqu'
/tmp/ccaq5stm.s:3765: Error: invalid register operand for `vmovdqu'
Makefile:40: recipe for target 'blake3_avx512.o' failed
make: *** [blake3_avx512.o] Error 1

Looking at the generated assembler code, GCC generates vmovdqu %ymm17, (%rax) which is an invalid instruction as far as I can tell (VEX encoded instead of EVEX and can thus only access ymm0-ymm15). So it looks to be a compiler bug. But, if I add -mavx512bw GCC instead uses vmovdqu8 %ymm17, (%rax) which compiles.

I assume that get_cpu_features() should be updated to if -mavx512bw is to be used, but other than that, is there any downside with using it?

This error is seen in the CI of ccache, see ccache/ccache#519.

How to modify number of rounds we hash?

I know that Blake3 hashes for 7 rounds, and I know that is believed to be secure enough, but it sounds like just barely so. I would like to be extra careful as I'm working on an application that needs paranoid levels of security.

Would be really nice if there were a simple way to just hash for a variable number of rounds, and I could simply add a few more, is that possible? If not, you might want to consider adding it. Fights off everyone who says Blake3 isn't as secure as Blake2 and seems easy enough to add.

The other option is to simply double or triple hash.. is there any reason that is not as secure as 14 or 21 rounds?

Compiler warnings: -Wmissing-prototypes

When enabling this option I get a lot of hits in the various C code

../../dist/crypto/blake3/blake3_avx512.c:292:6: warning: no previous prototype for 'blake3_compress_xof_avx512' [-Wmissing-prototypes]
void blake3_compress_xof_avx512(const uint32_t cv[8],
^~~~~~~~~~~~~~~~~~~~~~~~~~

Would it be possible to make sure all functions are either local to a module and static or defined in a header?

b3sum incredibly slow

b3sum runs incredibly slow.

PC:
Windows 10 Pro
i7-6700
32GB RAM

b3sum built using: cargo install b3sum (v0.1.2)

Test File: 2.88GB

Results:

md5sum: 10 seconds
b2sum: 5 seconds
b3sum: 4 Minutes 37 Seconds

Add method `reset` for `Hasher` ?

Will this be useful in scenarios where hash calculations are very frequent?

b3sum has poor performance for large files on spinning disks, when multi-threading is enabled

When using blake2b I get around 13sec when hashing 941MB file using blake2_bin and rash crates but I get about 30sec when hashing the same file using b3sum crate. The environment is rust 1.40 on Ubuntu 18.04 using Core i3

Compilation error for C

Hi, I have this compilation error for C version:

gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_sse41.c -o blake3_sse41.o -msse4.1
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx2.c -o blake3_avx2.o -mavx2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl
gcc -O3 -Wall -Wextra -std=c11 -pedantic blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse41.o blake3_avx2.o blake3_avx512.o -o blake3
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccCfPl6z.o: warning: relocation against g_cpu_features' in read-only section .text.startup'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccCfPl6z.o: in function main': main.c:(.text.startup+0x22b): undefined reference to get_cpu_features'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: main.c:(.text.startup+0x275): undefined reference to `g_cpu_features'
/usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: warning: creating a DT_TEXTREL in object
collect2: error: ld returned 1 exit status
make: *** [Makefile:35: all] Error 1

Allow b3sum to output raw

Please could you provide the option to output the hash as the raw bytes?

substantial prefetch optimization

@sneves has come up with a huge optimization for longer inputs, using memory prefetches. This issue is a note-to-self until I have time to get a proper commit together:

diff --git a/src/avx2.rs b/src/avx2.rs
index 841b6e1..a1cb3d0 100644
--- a/src/avx2.rs
+++ b/src/avx2.rs
@@ -261,6 +261,16 @@ unsafe fn transpose_msg_vecs(inputs: &[*const u8; DEGREE], block_offset: usize)
         loadu(inputs[6].add(block_offset + 1 * 4 * DEGREE)),
         loadu(inputs[7].add(block_offset + 1 * 4 * DEGREE)),
     ];
+    
+    _mm_prefetch(inputs[0].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[1].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[2].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[3].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[4].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[5].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[6].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    _mm_prefetch(inputs[7].add(block_offset + 0 * 4 * DEGREE + 256) as * const i8, _MM_HINT_T0);
+    
     let squares = mut_array_refs!(&mut vecs, DEGREE, DEGREE);
     transpose_vecs(squares.0);
     transpose_vecs(squares.1);

Enhancement: b3sum add -c --check flag for verification

Adding -c, --check would make b3sum a drop in replacement for b2sum in some scripts.

[MSVC] support of Microsoft C compiler is missed.

Hello.

There is a series of issues at source files that prevent to compile project at Microsoft C compiler:

including of POSIX header;
applying of unary minus operator to the unsigned value;
including of header that not applying to the all target platforms (ARM/x86/...).

One of vision how to resolve that all can be found in this request.

Thank you.

P.S.
Please note, here we talking about compile issue. Linking at MSVC environment have same issue like at gcc that already discussed here.

regenerate the graphs

There have been quite a few performance improvements since we originally published the paper and the big red bar chart. We should redo them.

build fails on recent versions of rustc nightly

e.g. rustc 1.44.0-nightly (f509b26a7 2020-03-18)

error[E0658]: use of unstable library feature 'stdsimd'
   --> src/platform.rs:304:12
    |
304 |         if is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512vl") {
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: see issue #27731 <https://github.com/rust-lang/rust/issues/27731> for more information
    = help: add `#![feature(stdsimd)]` to the crate attributes to enable
    = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to previous error

For more information about this error, try `rustc --explain E0658`.
error: could not compile `blake3`.

Excessive futex syscalling with no-mmap

Problem:
b3sum --no-mmap is using a tiny read buffer size (8KB) and syscalls read(2) and futex(2) aggressively. I find that hashing at only 110 MB/s maxes out all 24 cores, doing about 10k syscalls per second (tail -f strace.log | pv -l >/dev/null).

top

CPU% MEM%   TIME+  Command                                                                                                                                                                                                           
2025  0.0 22:58.81 b3sum --no-mmap idsab.tar.zst

strace.log

futex(0x563438b1804c, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\3\352\t\225\213r\371\3421\231\361\303\306\237\252\362\274\212>\33\t\f&\240u\31\252\331\310\242\265T"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "9\22\373\5\17\357\35\0364\35\371{\324\220\275J~B0$\273\332\264\20\v\352\304\23\301l\316\320"..., 8192) = 8192
futex(0x563438b18048, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x563438b17ff0, FUTEX_WAKE_PRIVATE, 1) = 0
read(3, "\225/\247\216\367Q\375,E\357\305\367\323\321\2\223f\266\n\222\257\305\260qLJ\f\247\370\202\6\306"..., 8192) = 8192
futex(0x563438b1402c, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "7\365\30n\370K\f\265\200\363r7\35E\321\347_\226\351-\331\24\232\200\352\233\316W\20\2121\264"..., 8192) = 8192
futex(0x563438b14028, FUTEX_WAKE_PRIVATE, 2147483647) = 2
futex(0x563438b13fd0, FUTEX_WAKE_PRIVATE, 1) = 1
read(3, "\234\235\250\352\315\343*\t\215\205o(\231\230\325\f\222\257o\302A3\315z\240}\364'\327\33Wj"..., 8192) = 8192
... and so on

Environment:

Built with cargo install b3sum (as of 2020-02-03)
Stable cargo 1.41.0 (626f0f40e 2019-12-03)
ZFS on Linux
Ubuntu 20.04
Kernel 5.4.0-9-generic

an "official" recommended way to mix associated data with a derived key

The spec and docs for derive_key are very clear that the context string should be hardcoded, without runtime input of any kind. The fundamental purpose of derive_key is domain separation, and hardcoded context strings make separation more reliable: as long as your is different from mine, we're guaranteed not to interact in any way.

However, many applications do want to incorporate dynamic data into their derived keys somehow. For a silly example, maybe I have a disk encryption key, and I want to derive a separate subkey for each drive on my machine. I need to use each drive label as associated data somehow. To make this more interesting, let's say that I want a separate subkey for each partition too, so there are two distinct pieces of associated data. How should I do this?

Well, here's how I would do it:

Start with by using derive_key with a hardcoded context string, which describes what we're about to do. For example, "Jack's disk encrypt-o-matic 2020-01-24 11:47:57 drive + partition subkeys". The output is a 256-bit intermediate subkey.
Mix in the disk label. The most natural way to do this is with keyed_hash. The key is the output of step 1, and the input is the disk label. The output is another 256-bit intermediate subkey.
Mix in the partition label. Again we'll use keyed_hash, with the key being the output of step 2. The output here is our final derived subkey, presumably also 256 bits.

Now, what I'd like to do is generalize this sort of thing. Some observations:

I think using keyed_hash is the most elegant way to do steps 2 and 3, but I want to note that it doesn't super duper matter what we do. We could concatenate the key and the associated data and just use hash. Because of the domain separation provided in step 1, we don't have to worry about whether our approach is going to interact badly with some other approach that someone else takes. Our keys will never interact.

Concatenating things is a little tricky, because you have to worry about ambiguities in the concatenation boundary. This doesn't really matter if one of the things you're concatenating is a fixed length, but it's something to be aware of. This is one of the reasons I prefer keyed_hash: there's no possible ambiguity between what's the key and what's the input.

One real world hazard that could come up here is if the application considers some associated data "optional". For example, maybe some disk has no partitions, so we don't have a partition label to use in step 3. What we don't want to do is skip steps entirely, because that could introduce another type of ambiguity: which step did we skip? So there should be some kind of ruling about optional AD parameters. Maybe they can be empty, but not missing? We could also try to use a different context string at every step, but that seems awfully heavyweight.

So anyway, I'm thinking about generalizing the scheme above into some kind of broad advice for generating subkeys with associated data. I'll use this issue for keeping track of my thoughts. Hopefully other folks can chime in with theirs.

use of unstable library feature 'stdsimd'

how can i move on with my life

error[E0658]: use of unstable library feature 'stdsimd'
   --> /home/oussama/.cargo/registry/src/github.com-1ecc6299db9ec823/blake3-0.1.4/src/platform.rs:266:12
    |
266 |         if is_x86_feature_detected!("avx512f") && is_x86_feature_detected!("avx512vl") {
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: for more information, see https://github.com/rust-lang/rust/issues/27731
    = help: add `#![feature(stdsimd)]` to the crate attributes to enable
    = note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)

error: aborting due to previous error

For more information about this error, try `rustc --explain E0658`.
error: could not compile `blake3`.

32-bit x86 CPUID code returns EBP instead of EBX

The inline assembly used by 32-bit x86 code needs to save and restore the EBX register because position independent code needs it unfortunately (https://ewontfix.com/18/). Since the CPUID instruction also uses the EBX register to return information, the inline assembly has to copy that data into another register (ESI in this case) before it restores the original EBX.

However, both the cpuid() and the cpuidex() functions contain a typo, and they return whatever is in the EBP register (essentially a random value) instead of EBX. This breaks run-time detection of AVX2 and newer CPU features.

The fix is very simple:

diff --git a/c/blake3_dispatch.c b/c/blake3_dispatch.c
index 7daf43e..b97b7c0 100644
--- a/c/blake3_dispatch.c
+++ b/c/blake3_dispatch.c
@@ -97,7 +97,7 @@ static void cpuid(uint32_t out[4], uint32_t id) {
   __cpuid((int *)out, id);
 #else
 #if defined(__i386__) || defined(_M_IX86)
-  __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebp, %%esi\npopl %%ebx"
+  __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebx, %%esi\npopl %%ebx"
                        : "=a"(out[0]), "=S"(out[1]), "=c"(out[2]), "=d"(out[3])
                        : "a"(id));
 #else
@@ -113,7 +113,7 @@ static void cpuidex(uint32_t out[4], uint32_t id, uint32_t sid) {
   __cpuidex((int *)out, id, sid);
 #else
 #if defined(__i386__) || defined(_M_IX86)
-  __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebp, %%esi\npopl %%ebx"
+  __asm__ __volatile__("pushl %%ebx\ncpuid\nmovl %%ebx, %%esi\npopl %%ebx"
                        : "=a"(out[0]), "=S"(out[1]), "=c"(out[2]), "=d"(out[3])
                        : "a"(id), "c"(sid));
 #else

Comparison between this and KangarooTwelve and M14

What are the speed and security difference between BLAKE3 and KangarooTwelve and K14 by Team Keccak?
Main page: https://keccak.team/kangarootwelve.html
Specification: https://keccak.team/files/KangarooTwelve.pdf
Base implementation: https://github.com/XKCP/XKCP/tree/master/Standalone/KangarooTwelve/Python
Rust Implementation: https://github.com/XKCP/XKCP/tree/master/Standalone/KangarooTwelve/Rust
RFC standard: https://datatracker.ietf.org/doc/draft-viguier-kangarootwelve/

As a side joke, I made this.

There once were two empires in Hashlandia:
The Keccak Empire led by Bertoni, Daemen, Peeters and  VanAssche,
And the BLAKE Empire led by Neves and Aumasson.
A long time ago, there was a war of SHA3, and Keccak was victorious, but it will not last long.
In retaliation, King Neves decided to fire back with the invasion of BLAKE II,
And as King Bertoni sees his empire falling apart because of it, Kangaroo XII has been staged.
After King Neves noticed their schemes, BLAKE III is coming.

Python and/or JS implementations

This might be a good idea for educational purposes.

fix blake3_neon.c on big-endian ARM

The current implementations of loadu_128 and storeu_128 use memcpy to read uint32x4_t vectors of e.g. hash input, which implicitly assumes the machine byte order is little-endian. My understanding is that that's true of most but not all ARM devices. We should figure out how to either make the code endianness-independent, or how to produce a compile-time error on big-endian ARM.

	let mut derive_key_out = [0; OUTPUT_LEN];
	blake3::Hasher::new_derive_key(TEST_CONTEXT)
	.update(&input)
	.finalize_xof()
	.fill(&mut derive_key_out);