Giter Site home page Giter Site logo

cothan / neon-sha3_2x Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 3.0 371 KB

NEON ARMv8 SHA3_2x: 2 times SHA3 or SHAKE128/256 in 01 call. Use In Post-Quantum Cryptography Submission

License: Apache License 2.0

C 96.52% Makefile 2.28% C++ 1.21%
armv8-assembly arm-neon sha3 shake128 shake256 sha3-256 sha3-512

neon-sha3_2x's Introduction

NEON ARMv8 SHA3_2x

Update

This package is now support ARMv8.2-sha3 instruction The result improve significantly when use SHA-3 instruction.

Apple M1

SHA-3 Enabled
2022-05-06T06:04:41-04:00
Running ./benchmark
Run on (8 X 24.1207 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.07, 1.91, 1.75
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        156 ns          156 ns      4135039
BM_F1600          218 ns          218 ns      3166704
SHA-3 Disable
2022-05-06T06:09:10-04:00
Running ./benchmark
Run on (8 X 24.0697 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.40, 2.16, 1.90
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_F1600x2        355 ns          355 ns      1948260
BM_F1600          217 ns          217 ns      3220716

Anyway it’s still faster than 2 times Keccak-F1600.

NEON ARMv8 Keccak2x Implementation.

Since there is no SIMD128 for ARMv8, so I decide to implement one.

The result is not impressive, due to 2 reasons:

SHA3 uses native bit-wise operation like AND, NOT, XOR, those operation only take about 1 cycle in CPU, therefore:

  • No pipeline happens

  • No significant improvement if SIMD bitwidth is 128-bit, ARMv8 native register width is 64-bit, I suppose frequency in NEON mode is slower than Scalar mode. (I don’t know the term for this, please let me know)

This code can be faster than this benchmark if:

  • SIMD register bitwidth is wider: e.g 256, 512, …​

  • Frequency in NEON mode is at least > 0.5 * (Scalar frequency)

What is inside this package?

NEON (ASIMD) ARMv8 implementation of:

  • KeccakP-1600

  • SHAKE128 : Absorb, Squeeze

  • SHAKE256 : Absorb, Squeeze

  • SHA3_256

  • SHA3_512

Result

System Information

Here is my benchmark on ARMv8 Raspberry Pi 64-bit Majaro:

OS
Distributor ID: Manjaro-ARM
Description:    Manjaro ARM Linux
Release:        20.10
CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
CPU max MHz:                     1900.0000
CPU min MHz:                     600.0000
Flags:                           fp asimd evtstrm crc32 cpuid

I overclocked Raspberry Pi to 1900 Mhz. The default CPU frequency is 1500 Mhz.

Result

All benchmarks were run via this command:

make all
taskset 0x1 ./benchmark_SHAKE128_256_1000.bin

taskset command pin process to only 1 CPU, avoid switching CPU cost

Table 1. Result

Output Length

Input Length

FIPS202x2 NEON

FIPS202x2 C

42

672

487

514

294

336

390

413

1008

42

586

606

2772

1008

2228

2287

3318

504

2230

2286

4074

1008

3004

3099

The result above iterate 1000 time. As set in #define TESTS 1000

You can view the full result, iterate 1,000 or 1,000,000 times in: data/

Graph

If the data/ is confuse to you, here is some graphs:

shake128

shake256

  • The orange line is the differences between C reference code and NEON implementation

  • The green line is average of 24 samples for C_ref - NEON

    • Orange: C_ref - NEON

    • Green: average of C_ref - NEON

You can notice that in some case, C Ref is better than NEON. For small output length, NEON is better than C Ref at about 5%.

Conclusion

The Keccak2x NEON version is always faster than 2 times Keccak C version. See bench() function

  • If you only call Keccak once, use C version, it’s faster

  • If you call Keccak multiple times, use NEON version, it saves sometimes.

neon-sha3_2x's People

Contributors

cothan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.