0xade1a1de / cryptopt Goto Github PK

CryptOpt: Verified Compilation with Randomized Program Search for Cryptographic Primitives

Home Page: https://0xade1a1de.github.io/CryptOpt/

License: Apache License 2.0

Dockerfile 0.01% Makefile 0.06% TypeScript 86.39% Shell 0.01% C 0.27% sed 0.02% TeX 0.50% JavaScript 0.01% Assembly 12.67% Gnuplot 0.05%

cryptopt's People

Contributors

Stargazers

Watchers

Forkers

samuel-tian jamestiotio adaphath andres-erbsen mquaresma dearborn-open-ai kanin-kearpimy john-gong-math davidnevadoc harutokimura stardev116

cryptopt's Issues

Add the cycle goal to the assemblyfile

Add the cycle goal to the assemblyfile as a comment

Memory and storage requirements?

Do you have a good sense of how much RAM and disk CryptOpt should use? I haphazardly attempted a week-long CryptOpt run and saw it reach memory exhaustion on day 2. Here's an output from the rare instance that got killed by node instead of Linux:

fiat_curve25519_solinas_square| run|  6|bs  156|#inst: 103|cycl_     12|G  26 cycl _  0|B  26 cycl _  0|L  64|l/g 2.5098| P|P[  -1/   0/   1/  -1]|D[FL/ 48/ 13/ 5
7]|47.2M(43%) 115/s                                                                                                                          
<--- Last few GCs --->                                                                                                                       
                                                                                                                                             
[212:0x67a1c50] 513566914 ms: Scavenge 3885.3 (4129.6) -> 3878.9 (4129.6) MB, 3.2 / 0.0 ms  (average mu = 0.629, current mu = 0.642) task;   
[212:0x67a1c50] 513566933 ms: Scavenge 3886.7 (4129.6) -> 3879.1 (4129.6) MB, 3.4 / 0.0 ms  (average mu = 0.629, current mu = 0.642) allocation failure; 
[212:0x67a1c50] 513566952 ms: Scavenge 3886.0 (4129.6) -> 3879.2 (4145.6) MB, 3.7 / 0.0 ms  (average mu = 0.629, current mu = 0.642) task;

Does 4GB/process look like expected memory usage to you? I could provision that much, but I can't think of a reason why it would be needed.

JS stacktrace:

``` FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory 1: 0xb6b850 node::Abort() [node] 2: 0xa806a6 [node] 3: 0xd52140 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node] 4: 0xd524e7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node] 5: 0xf2fbe5 [node] 6: 0xf420cd v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node] 7: 0xfb1124 v8::internal::ScavengeJob::Task::RunInternal() [node] 8: 0xe2187b non-virtual thunk to v8::internal::CancelableTask::Run() [node] 9: 0xbd6684 [node] 10: 0xbd9aee node::PerIsolatePlatformData::FlushForegroundTasksInternal() [node] 11: 0x1652906 [node] 12: 0x1664e44 [node] 13: 0x165326e uv_run [node] 14: 0xaafa2d node::SpinEventLoop(node::Environment*) [node] 15: 0xbb11f4 node::NodeMainInstance::Run() [node] 16: 0xb26c44 node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*) [node] 17: 0xb2a83f node::Start(int, char**) [node] 18: 0x7fde10f7ed90 [/lib/x86_64-linux-gnu/libc.so.6] 19: 0x7fde10f7ee40 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6] 20: 0xaad7ee _start [node] /root/CryptOpt/CryptOpt: line 31: 212 Aborted ```

I believe here is the invocation I used:

for i in $(seq 0 "$(("$(nproc)"-1))"); do
  tmux new-window -n "c$i"
  tmux send-keys "taskset -c $i ~/CryptOpt/CryptOpt --no-proof --resultDir /mnt/results --curve curve25519_solinas --method square --framePointer save --evals $((175*60*60*24*9))" C-m
  sleep "0.$RANDOM"
done

Additionally, the results directory seems to have acquired 49GB of csv files (and some asm and json). Are these something I'd may want to look at, or perhaps I should not be collecting them at all?

Target request: x86_64 without ADX

It would be nice to use CryptOpt to generate plain x86_64 code that does not depend on the ADX extension, to serve as a fallback from CryptOpt-optimized fast assembly in distributed binaries. ~~This is a requirement for deployment in BoringSSL, and~~ I hear it may be relevant to adoption of mit-plv/fiat-crypto#1582 as well.

I am thinking of use of CryptOpt in this context as primarily an assurance benefit, though if it's decently fast still, even better.

I would be happy to do the work for adapting CryptOpt here if you think that this would be a good first project to hack on in the CryptOpt codebase.

Native usage on Debian testing? (nodejs segfault)

Hi,

I am trying to run CryptOpt on natively on Debian testing because reasons. I'd like to know whether you think this has a chance of working. The current state is that I think I have managed to get both AssemblyLine and CryptOpt to build and link. asmline runs, but running CryptOpt exits with a segmentation fault:

Last lines of ./CryptOpt output with DEBUG=1 --verbose:

executing cmd gcc -march=native -mtune=native -O3 -fPIC -shared -o /tmp/CryptOpt.cache/uczayywaxs/libcheckfunctions-s1678729657975-bfiat-p4149251.so /home/andreser/CryptOpt/dist/data/fiat-bridge/.cache/85ea43d3500c431ba0ae9e2f56985f9990aa34fa42fe048a10d26e52b40f2df2.c with opts {"shell":"/usr/bin/bash"}
./CryptOpt: line 31: 4149251 Segmentation fault      PATH="$(realpath ./bins/node/bin):${PATH}" /usr/bin/env node "./dist/CryptOpt.js" "${@}"

gdb says:

Thread 1 "node" received signal SIGSEGV, Segmentation fault.
0x00007ffff7e25fd6 in ms_initialize () from /home/andreser/CryptOpt/node_modules/measuresuite/build/Release/measuresuite.node
(gdb) bt
#0  0x00007ffff7e25fd6 in ms_initialize ()
   from /home/andreser/CryptOpt/node_modules/measuresuite/build/Release/measuresuite.node
#1  0x00007ffff7e25b1b in init ()
   from /home/andreser/CryptOpt/node_modules/measuresuite/build/Release/measuresuite.node

more backtrace


#2  0x0000000000b10d7d in v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo const&) ()
#3  0x0000000000db0230 in v8::internal::MaybeHandle v8::internal::(anonymous namespace)::HandleApiCallHelper(v8::internal::Isolate*, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::BuiltinArguments) ()
#4  0x0000000000db176f in v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) ()
#5  0x00000000016ef579 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit ()
#6  0x00000000016734d0 in Builtins_InterpreterEntryTrampoline ()
#7  0x00002e0bd55015b9 in ?? ()
#8  0x000015cb6c393811 in ?? ()
#9  0x0000000800000000 in ?? ()
#10 0x00002e0bd5501689 in ?? ()
#11 0x00003ebecb2fef29 in ?? ()
#12 0x0000000500000000 in ?? ()
#13 0x0000000100000000 in ?? ()
#14 0x0000000100000000 in ?? ()
#15 0x0000000100000000 in ?? ()
#16 0x0000000100000000 in ?? ()
#17 0x0000000500000000 in ?? ()
#18 0x00003ebecb2fef29 in ?? ()
#19 0x000015cb6c393811 in ?? ()
#20 0x0000064757026351 in ?? ()
#21 0x0000009400000000 in ?? ()
#22 0x000027d6971beb29 in ?? ()
#23 0x0000000000000008 in ?? ()
#24 0x00003a9153c02699 in ?? ()
#25 0x0000064757026351 in ?? ()
#26 0x00007fffffff9220 in ?? ()
#27 0x0000000001670e62 in Builtins_JSConstructStubGeneric ()
Backtrace stopped: frame did not save the PC

(gdb) disas $pc


Dump of assembler code for function ms_initialize:
   0x00007ffff7e25f30 <+0>:     push   %rbp
   0x00007ffff7e25f31 <+1>:     xor    %eax,%eax
   0x00007ffff7e25f33 <+3>:     mov    %rsp,%rbp
   0x00007ffff7e25f36 <+6>:     push   %r15
   0x00007ffff7e25f38 <+8>:     mov    %ecx,%r15d
   0x00007ffff7e25f3b <+11>:    push   %r14
   0x00007ffff7e25f3d <+13>:    mov    %edx,%r14d
   0x00007ffff7e25f40 <+16>:    push   %r13
   0x00007ffff7e25f42 <+18>:    mov    %esi,%r13d
   0x00007ffff7e25f45 <+21>:    push   %r12
   0x00007ffff7e25f47 <+23>:    mov    %rdi,%r12
   0x00007ffff7e25f4a <+26>:    push   %rbx
   0x00007ffff7e25f4b <+27>:    sub    $0x8,%rsp
   0x00007ffff7e25f4f <+31>:    call   0x7ffff7e24400 
   0x00007ffff7e25f54 <+36>:    mov    %rax,(%r12)
   0x00007ffff7e25f58 <+40>:    test   %rax,%rax
   0x00007ffff7e25f5b <+43>:    je     0x7ffff7e25f6f 
   0x00007ffff7e25f5d <+45>:    mov    %r13d,%esi
   0x00007ffff7e25f60 <+48>:    mov    %rax,%rdi
   0x00007ffff7e25f63 <+51>:    mov    %rax,%rbx
   0x00007ffff7e25f66 <+54>:    call   0x7ffff7e24580 
   0x00007ffff7e25f6b <+59>:    test   %eax,%eax
   0x00007ffff7e25f6d <+61>:    je     0x7ffff7e25f88 
   0x00007ffff7e25f6f <+63>:    mov    $0x1,%eax
   0x00007ffff7e25f74 <+68>:    add    $0x8,%rsp
   0x00007ffff7e25f78 <+72>:    pop    %rbx
   0x00007ffff7e25f79 <+73>:    pop    %r12
   0x00007ffff7e25f7b <+75>:    pop    %r13
   0x00007ffff7e25f7d <+77>:    pop    %r14
   0x00007ffff7e25f7f <+79>:    pop    %r15
   0x00007ffff7e25f81 <+81>:    pop    %rbp
   0x00007ffff7e25f82 <+82>:    ret
   0x00007ffff7e25f83 <+83>:    nopl   0x0(%rax,%rax,1)
   0x00007ffff7e25f88 <+88>:    mov    %r14d,%esi
   0x00007ffff7e25f8b <+91>:    mov    %rbx,%rdi
   0x00007ffff7e25f8e <+94>:    call   0x7ffff7e24540 
   0x00007ffff7e25f93 <+99>:    test   %eax,%eax
   0x00007ffff7e25f95 <+101>:   jne    0x7ffff7e25f6f 
   0x00007ffff7e25f97 <+103>:   mov    %r15d,%esi
   0x00007ffff7e25f9a <+106>:   mov    %rbx,%rdi
   0x00007ffff7e25f9d <+109>:   call   0x7ffff7e242b0 
   0x00007ffff7e25fa2 <+114>:   test   %eax,%eax
   0x00007ffff7e25fa4 <+116>:   jne    0x7ffff7e25f6f 
   0x00007ffff7e25fa6 <+118>:   mov    %rbx,%rdi
   0x00007ffff7e25fa9 <+121>:   call   0x7ffff7e24460 
   0x00007ffff7e25fae <+126>:   test   %eax,%eax
   0x00007ffff7e25fb0 <+128>:   jne    0x7ffff7e25f6f 
   0x00007ffff7e25fb2 <+130>:   mov    %rbx,%rdi
   0x00007ffff7e25fb5 <+133>:   call   0x7ffff7e241f0 
   0x00007ffff7e25fba <+138>:   test   %eax,%eax
   0x00007ffff7e25fbc <+140>:   jne    0x7ffff7e25f6f 
   0x00007ffff7e25fbe <+142>:   mov    %rbx,%rdi
   0x00007ffff7e25fc1 <+145>:   call   0x7ffff7e246d0 
   0x00007ffff7e25fc6 <+150>:   test   %eax,%eax
   0x00007ffff7e25fc8 <+152>:   jne    0x7ffff7e25f6f 
   0x00007ffff7e25fca <+154>:   mov    %rbx,%rdi
   0x00007ffff7e25fcd <+157>:   call   0x7ffff7e24560 
   0x00007ffff7e25fd2 <+162>:   test   %eax,%eax
   0x00007ffff7e25fd4 <+164>:   jne    0x7ffff7e25f6f 
=> 0x00007ffff7e25fd6 <+166>:   movl   $0x0,0x88(%rbx)
   0x00007ffff7e25fe0 <+176>:   jmp    0x7ffff7e25f74

(gdb) info registers


rax            0x0                 0
rbx            0x80eefa3fe0        553765191648
rcx            0x7ffff7b20733      140737349027635
rdx            0xffffffffffffff00  -256
rsi            0x1000              4096
rdi            0x0                 0
rbp            0x7fffffff8e60      0x7fffffff8e60
rsp            0x7fffffff8e30      0x7fffffff8e30
r8             0xfffffff9          4294967289
r9             0x0                 0
r10            0x1                 1
r11            0x246               582
r12            0x7fffffff8e88      140737488326280
r13            0x5                 5
r14            0x1                 1
r15            0x1                 1
rip            0x7ffff7e25fd6      0x7ffff7e25fd6 
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

node-segfault-handler log, read after the same gdb execution:


-----[ Native Stacktraces ]-----
[pc=0x00007ffff7fb569e, sp=0x00007fffffff8860] in segfault_handler(int)+0x4e
[pc=0x00007ffff7a5af90, sp=0x00007fffffff8880] in __sigaction+0x40
[pc=0x00007ffff7e25fd6, sp=0x00007fffffff8e30] in ms_initialize+0xa6
[pc=0x00007ffff7e25b1b, sp=0x00007fffffff8e70] in init+0xab
[pc=0x0000000000b10d7d, sp=0x00007fffffff8ed0] in v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo const&)+0x7d
[pc=0x0000000000db0230, sp=0x00007fffffff8f30] in v8::internal::MaybeHandle v8::internal::(anonymous namespace)::HandleApiCallHelper(v8::internal::Isolate*, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::Handle, v8::internal::BuiltinArguments)+0x380
[pc=0x0000000000db176f, sp=0x00007fffffff9070] in v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*)+0xaf
[pc=0x00000000016ef579, sp=0x00007fffffff90e0] in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit+0x39
[pc=0x00000000016734d0, sp=0x00007fffffff90f0] in Builtins_InterpreterEntryTrampoline+0xd0
[pc=0x0000000001670e62, sp=0x00007fffffff9100] in Builtins_JSConstructStubGeneric+0x122
[pc=0x000000000178d3ff, sp=0x00007fffffff9110] in Builtins_ConstructHandler+0x2bf
[pc=0x00000000016734d0, sp=0x00007fffffff9120] in Builtins_InterpreterEntryTrampoline+0xd0
[pc=0x00000000016734d0, sp=0x00007fffffff9130] in Builtins_InterpreterEntryTrampoline+0xd0
[pc=0x00000000016734d0, sp=0x00007fffffff9140] in Builtins_InterpreterEntryTrampoline+0xd0
[pc=0x0000000001670e62, sp=0x00007fffffff9150] in Builtins_JSConstructStubGeneric+0x122
[pc=0x000000000178d3ff, sp=0x00007fffffff9160] in Builtins_ConstructHandler+0x2bf
[pc=0x00000000016734d0, sp=0x00007fffffff9170] in Builtins_InterpreterEntryTrampoline+0xd0
[pc=0x00000000016f96d7, sp=0x00007fffffff9180] in Builtins_AsyncModuleEvaluate+0x97
[pc=0x0000000001671adc, sp=0x00007fffffff9190] in Builtins_JSEntryTrampoline+0x5c
[pc=0x0000000001671803, sp=0x00007fffffff91a0] in Builtins_JSEntry+0x83
[pc=0x0000000000e9293e, sp=0x00007fffffff91b0] in v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&)+0x13e
[pc=0x0000000000e93a10, sp=0x00007fffffff9a50] in v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [clone .constprop.84]+0x50
[pc=0x0000000000e94012, sp=0x00007fffffff9ab0] in v8::internal::Execution::TryCall(v8::internal::Isolate*, v8::internal::Handle, v8::internal::Handle, int, v8::internal::Handle*, v8::internal::Execution::MessageHandling, v8::internal::MaybeHandle*, bool)+0x62
[pc=0x00000000011c3611, sp=0x00007fffffff9b40] in v8::internal::SourceTextModule::InnerExecuteAsyncModule(v8::internal::Isolate*, v8::internal::Handle, v8::internal::Handle)+0xd1
[pc=0x00000000011c38df, sp=0x00007fffffff9ba0] in v8::internal::SourceTextModule::ExecuteAsyncModule(v8::internal::Isolate*, v8::internal::Handle)+0x1af
[pc=0x00000000011c5785, sp=0x00007fffffff9c10] in v8::internal::SourceTextModule::AsyncModuleExecutionFulfilled(v8::internal::Isolate*, v8::internal::Handle)+0x135
[pc=0x0000000000dbdc2b, sp=0x00007fffffff9ce0] in v8::internal::Builtin_CallAsyncModuleFulfilled(int, unsigned long*, v8::internal::Isolate*)+0x3b
[pc=0x00000000016ef579, sp=0x00007fffffff9d20] in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit+0x39
[pc=0x000000000173fa71, sp=0x00007fffffff9d30] in Builtins_PromiseFulfillReactionJob+0x31
[pc=0x000000000169893b, sp=0x00007fffffff9d40] in Builtins_RunMicrotasks+0x27b
[pc=0x0000000001671a03, sp=0x00007fffffff9d50] in Builtins_JSRunMicrotasksEntry+0x83
[pc=0x0000000000e92e0a, sp=0x00007fffffff9d60] in v8::internal::(anonymous namespace)::Invoke(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&)+0x60a
[pc=0x0000000000e93a10, sp=0x00007fffffff9fb0] in v8::internal::(anonymous namespace)::InvokeWithTryCatch(v8::internal::Isolate*, v8::internal::(anonymous namespace)::InvokeParams const&) [clone .constprop.84]+0x50
[pc=0x0000000000e9411a, sp=0x00007fffffffa010] in v8::internal::Execution::TryRunMicrotasks(v8::internal::Isolate*, v8::internal::MicrotaskQueue*, v8::internal::MaybeHandle*)+0x5a
[pc=0x0000000000ec161a, sp=0x00007fffffffa080] in v8::internal::MicrotaskQueue::RunMicrotasks(v8::internal::Isolate*) [clone .part.50]+0x8a
[pc=0x0000000000ec19f2, sp=0x00007fffffffa130] in v8::internal::MicrotaskQueue::PerformCheckpoint(v8::Isolate*)+0x42
[pc=0x0000000000aaec29, sp=0x00007fffffffa160] in node::InternalCallbackScope::Close()+0x109
[pc=0x0000000000aaee01, sp=0x00007fffffffa1b0] in node::InternalCallbackScope::~InternalCallbackScope()+0x11
[pc=0x0000000000b70e5b, sp=0x00007fffffffa1d0] in node::fs::FileHandle::CloseReq::Resolve()+0x9b
[pc=0x0000000000b73420, sp=0x00007fffffffa250] in node::fs::FileHandle::ClosePromise()::{lambda(uv_fs_s*)#1}::_FUN(uv_fs_s*)+0x230
[pc=0x0000000000b681f4, sp=0x00007fffffffa2d0] in node::MakeLibuvRequestCallback::Wrapper(uv_fs_s*)+0x54
[pc=0x000000000164e11d, sp=0x00007fffffffa300] in uv__work_done+0x9d
[pc=0x0000000001652906, sp=0x00007fffffffa350] in uv__async_io.part.1+0x126
[pc=0x0000000001664e44, sp=0x00007fffffffa7a0] in uv__io_poll+0x494
[pc=0x000000000165326e, sp=0x00007fffffffd8b0] in uv_run+0x14e
[pc=0x0000000000aafa2d, sp=0x00007fffffffd910] in node::SpinEventLoop(node::Environment*)+0x14d
[pc=0x0000000000bb11f4, sp=0x00007fffffffd9c0] in node::NodeMainInstance::Run()+0xf4
[pc=0x0000000000b26c44, sp=0x00007fffffffda50] in node::LoadSnapshotDataAndRun(node::SnapshotData const**, node::InitializationResult const*)+0xb4
[pc=0x0000000000b2a83f, sp=0x00007fffffffdb00] in node::Start(int, char**)+0x2df
[pc=0x00007ffff7a4618a, sp=0x00007fffffffdb70] in __libc_init_first+0x8a
[pc=0x00007ffff7a46245, sp=0x00007fffffffdc10] in __libc_start_main+0x85
[pc=0x0000000000aad7ee, sp=0x00007fffffffdc60] in _start+0x2e

---[ V8 JavaScript Stacktraces ]--- at Measuresuite (file:///home/andreser/CryptOpt/node_modules/measuresuite/ts/dist/index.js:62:16) at createMS (file:///home/andreser/CryptOpt/dist/optimizer.helper.class-13535478.js:4108:23) at init (file:///home/andreser/CryptOpt/dist/optimizer.helper.class-13535478.js:4129:12) at Optimizer (file:///home/andreser/CryptOpt/dist/CryptOpt.js:4535:46) at (null) (file:///home/andreser/CryptOpt/dist/CryptOpt.js:4856:20)

$ node --version
v18.13.0

Feature request: output C with intrinsics

It'd be neat if we could have CryptOpt do optimization on assembly as it does, but then print the scheduled and register-allocated code as C to see how much worse C compilers do when given the easiest possible task. The code generated in this manner might also be preferrable to raw fiat-crypto output in cases a C implementation needs to be deployed. I don't think we should make any effort to use standard intrinsics, rather let's do something like fiat-crypto where we just define C functions that compute the same values as supported assembly instructions.

Transient build error (missing makefile dependencies?)

On a clean clone I got the following:

andreser@andreser ~/CryptOpt % make -j all
mkdir -p ./bins
Installing dependencies
curl -L https://nodejs.org/dist/v20.5.1/node-v20.5.1-linux-x64.tar.xz | tar --extract --xz --directory ./bins
cd src/bridge/fiat-bridge/data && sha256sum unsaturated_solinas word_by_word_montgomery dettman_multiplication solinas_reduction > sha256sums
/bin/sh: line 1: npm: command not found
make: *** [Makefile:37: node_modules] Error 127
make: *** Waiting for unfinished jobs....
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 23.9M  100 23.9M    0     0  12.1M      0  0:00:01  0:00:01 --:--:-- 12.1M
mv -f ./bins/node-v20.5.1-linux-x64 "/usr/local/google/home/andreser/CryptOpt/bins/node"

I was following https://github.com/0xADE1A1DE/CryptOpt/blob/main/INSTALL.md#bare-metal

emit .cfi directives

See here and #144

update copyright

Feature request: option to maintain or preserve rbp

rbp-based stack unwinding is a requirement for stack-sampling profiling in a security-conscious production deployments where debug information cannot be evaluated during sampling, e.g. Google-wide profiling.

I imagine it would be pretty easy to just get CryptOpt to entirely ignore the rbp register by ripping it out from the list of registers. It'd be nice to add a command-line flag that achieves this behavior. Perhaps we could also have an option to update rbp as in -fno-omit-frame-pointer so time spent in CryptOpt-generated routines does not get misattributed during profiling.

Again, I'd be happy to do the work given the go-ahead and high-level guidance.

Feature request: --no-redzone should avoid using the red zone

when using the red zone, the stack is shifted by 0x80. With this switch the stack will start at 0x0 (needed non-unix for example)

optimizing register allocation

CryptOpt code contains more spills than comparable handwritten code, and sometimes more than compiler-generated code. It is not clear whether this is important for performance, but I do have an implementation where CryptOpt is not winning over naive handwritten assembly and that exhibits this pattern.

What if register allocation first identified uses that can be unspilled using x86 memory operands, "allocated" temporaties for which all uses are of this form, and then allocated the remaining variables into registers based on the remaining uses. In particular, I believe this generic heuristic would do a decent job on the adx-multiplication pattern of mulx a0 b0; mulx a0 b1; mulx b0 b2 ... ; mulx a1 b0; ...: all b can stay in memory and only one a needs to be live at a time. This suggestion is based on the hypothesis that memory-operand access is cheaper than a separate load even if that load is amortized over multiple instructions; I am not sure whether this is always true.

Another guess for a tweak would be to consider would be spilling and restoring callee-saved registers using push/pop at the very beginning and end of the function.

Have you tried something like these options already?

Hardware is incompatible for (x84-64) Intel i5-8257U

Hello,

I followed the Docker installation step.
It identified that my hardware is not incompatible with the current requirement (which I believe it is).

Result with: ./CryptOpt

Then, I worked around by directly executing .dist/CryptOpt.js by node. It identified illegal hardware instruction.

My Hardware:

Mackbook Pro 2019
CPU: Intel(R) Core(TM) i5-8257U CPU @ 1.40GHz (x84-64) - Coffee Lake
Memory: 8GB

Note:

Everything ran in Docker image shell.
I did modify cycle goal, evaluations round, batch size (in the way requiring as less resources as possible). Nothing work.

Error: tried to spill OF, but didnt work. TSNH.

p256sqr2.zip

./CryptOpt --bridge manual --cFile ./p256sqr2.c --jsonFile ./p256sqr2.json --seed 2

Start on brg/symbolname>>manual/p256_sqr<< >>without proofing correct<< on cpu >>Intel(R) Xeon(R) CPU @ 2.80GHz<< writing results to>>/home/andreser/CryptOpt/results<< with seed >>4156876406132912<< for >>  200<< evaluations against CC>>gcc -march=native -mtune=native -O3<< with cycle goal>>10000<< for each measurement on host>>andreser<< with pid>>3079059<< using counter>>RDTSCP<< framePointer=>>omit<< memoryConstraints>>none<< starting @>>2023-09-11T19:23:01.677Z<<
{
  curOperation: {
    datatype: 'u64',
    name: [ 'x59', '_' ],
    operation: 'addcarryx',
    arguments: [ 'x58', 'x38', 'x40' ],
    decisions: {
      di_choose_arg: [Array],
      di_spill_location: [Array],
      di_flag: [Array],
      di_handle_flags_kk: [Array],
      di_choose_imm: [Array]
    },
    decisionsHot: []
  },
  e: Error: tried to spill OF, but didnt work. TSNH.
      at Jt (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:24964)
      at pe (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:37026)
      at $e (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:33719)
      at Ae (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:51131)
      at file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:61683
      at Timeout._onTimeout (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:62312)
      at listOnTimeout (node:internal/timers:573:17)
      at process.processTimers (node:internal/timers:514:7),
  allocs: {
    '0x100000000': { datatype: 'u64', store: 'r10' },
    arg1: { datatype: 'u64[4]', store: 'rsi' },
    'calSv-r12': { datatype: 'u64', store: '[ rsp - 0x70 ]' },
    'calSv-r13': { datatype: 'u64', store: '[ rsp - 0x68 ]' },
    'calSv-r14': { datatype: 'u64', store: '[ rsp - 0x60 ]' },
    'calSv-r15': { datatype: 'u64', store: '[ rsp - 0x58 ]' },
    'calSv-rbp': { datatype: 'u64', store: '[ rsp - 0x78 ]' },
    'calSv-rbx': { datatype: 'u64', store: '[ rsp - 0x80 ]' },
    out1: { datatype: 'u64[4]', store: '[ rsp - 0x50 ]' },
    x106: { datatype: 'u64', store: 'r15' },
    x107: { datatype: 'u64', store: '[ rsp - 0x10 ]' },
    x22: { datatype: 'u64', store: '[ rsp - 0x18 ]' },
    x38: { datatype: 'u1', store: 'OF' },
    x39: { datatype: 'u64', store: '[ rsp - 0x40 ]' },
    x40: { datatype: 'u64', store: '[ rsp - 0x48 ]' },
    x41: { datatype: 'u64', store: '[ rsp - 0x20 ]' },
    x42: { datatype: 'u64', store: '[ rsp - 0x38 ]' },
    x51: { datatype: 'u64', store: 'rcx' },
    x53: { datatype: 'u64', store: 'r9' },
    x55: { datatype: 'u64', store: 'r12' },
    x57: { datatype: 'u64', store: 'r11' },
    x58: { datatype: 'u1', store: 'CF' },
    x66: { datatype: 'u64', store: 'rbp' },
    x68: { datatype: 'u1', store: 'dil' },
    x69: { datatype: 'u64', store: 'rdx' },
    x70: { datatype: 'u1', store: 'r8b' },
    x71: { datatype: 'u64', store: '[ rsp - 0x30 ]' },
    x72: { datatype: 'u64', store: '[ rsp - 0x28 ]' },
    x77: { datatype: 'u64', store: 'rbx' },
    x78: { datatype: 'u64', store: 'r13' },
    x96: { datatype: 'u64', store: 'r14' },
    x97: { datatype: 'u64', store: 'rax' }
  },
  pres: [
    '',
    ';should save OF(x38) but as it has not dependents, we just ignore it.'
  ],
  failfile: '/home/andreser/CryptOpt/results/lastFail.asm'
}
{
  curOperation: {
    datatype: 'u64',
    name: [ 'x59', '_' ],
    operation: 'addcarryx',
    arguments: [ 'x58', 'x38', 'x40' ],
    decisions: {
      di_choose_arg: [Array],
      di_spill_location: [Array],
      di_flag: [Array],
      di_handle_flags_kk: [Array],
      di_choose_imm: [Array]
    },
    decisionsHot: []
  },
  e: Error: tried to spill OF, but didnt work. TSNH.
      at Jt (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:24964)
      at pe (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:37026)
      at $e (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:33719)
      at Ae (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:51131)
      at file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:61683
      at Timeout._onTimeout (file:///home/andreser/CryptOpt/dist/CryptOpt.js:1:62312)
      at listOnTimeout (node:internal/timers:573:17)
      at process.processTimers (node:internal/timers:514:7),
  allocs: {
    '0x100000000': { datatype: 'u64', store: 'r10' },
    arg1: { datatype: 'u64[4]', store: 'rsi' },
    'calSv-r12': { datatype: 'u64', store: '[ rsp - 0x70 ]' },
    'calSv-r13': { datatype: 'u64', store: '[ rsp - 0x68 ]' },
    'calSv-r14': { datatype: 'u64', store: '[ rsp - 0x60 ]' },
    'calSv-r15': { datatype: 'u64', store: '[ rsp - 0x58 ]' },
    'calSv-rbp': { datatype: 'u64', store: '[ rsp - 0x78 ]' },
    'calSv-rbx': { datatype: 'u64', store: '[ rsp - 0x80 ]' },
    out1: { datatype: 'u64[4]', store: '[ rsp - 0x50 ]' },
    x106: { datatype: 'u64', store: 'r15' },
    x107: { datatype: 'u64', store: '[ rsp - 0x10 ]' },
    x22: { datatype: 'u64', store: '[ rsp - 0x18 ]' },
    x38: { datatype: 'u1', store: 'OF' },
    x39: { datatype: 'u64', store: '[ rsp - 0x40 ]' },
    x40: { datatype: 'u64', store: '[ rsp - 0x48 ]' },
    x41: { datatype: 'u64', store: '[ rsp - 0x20 ]' },
    x42: { datatype: 'u64', store: '[ rsp - 0x38 ]' },
    x51: { datatype: 'u64', store: 'rcx' },
    x53: { datatype: 'u64', store: 'r9' },
    x55: { datatype: 'u64', store: 'r12' },
    x57: { datatype: 'u64', store: 'r11' },
    x58: { datatype: 'u1', store: 'CF' },
    x66: { datatype: 'u64', store: 'rbp' },
    x68: { datatype: 'u1', store: 'dil' },
    x69: { datatype: 'u64', store: 'rdx' },
    x70: { datatype: 'u1', store: 'r8b' },
    x71: { datatype: 'u64', store: '[ rsp - 0x30 ]' },
    x72: { datatype: 'u64', store: '[ rsp - 0x28 ]' },
    x77: { datatype: 'u64', store: 'rbx' },
    x78: { datatype: 'u64', store: 'r13' },
    x96: { datatype: 'u64', store: 'r14' },
    x97: { datatype: 'u64', store: 'rax' }
  },
  pres: [
    '',
    ';should save OF(x38) but as it has not dependents, we just ignore it.'
  ],
  failfile: '/home/andreser/CryptOpt/results/lastFail.asm'
}

Done with code: 1 (statefile: /home/andreser/CryptOpt/results/manual/p256_sqr/seed0000000000000002.json)

Wrote RES/manual/p256_sqr/seed0000000000000002.json exiting.

It is possible that the input is silly, I haven't proven anything about it yet.

Feature requrest: enforce memory contraints

re bitcoin-core/secp256k1#1329 (comment)

r and a may point to the same object, but neither can be equal to b. (...)
*/
static void secp256k1_fe_mul(secp256k1_fe *r, const secp256k1_fe *a, const secp256k1_fe *b);

The idea is that we could have three enforcement levels:

do not care at all (current default). This assumes, that all memory is distinct. I.e read a[0] after writing r[0] is fine.
level 2: assume that r and a can be equal, but must be aligned. then, writing to r[0] and subsequently reading from a[1] is valid. Reading a[0] after writing r[0] would be invalid.
level 3: do not read any memory after writing any. (except for stack). I.e. there cannot be any reads after the first write, expect if the read is from the stack.

What is going on when further optimization appears to make the program slower?

A cartoon image of CryptOpt I've had is that it rejects mutations that make the program slower. I know that this is idealistic, even the optimization trace in the paper goes down for a bit. But how is it that continuing to try more mutations seems to have a real chance of making the cycle counts in the CryptOpt output go up? I understand that there's some chance that a mutation will misleadingly appear attractive due to measurement noise, but looking at the above log I can tell at a glance that a previous program would likely perform better. What is going on here -- is the theory that something changes about the machine to make both versions run slower, or does the wrong one just get picked sometimes?