slothy-optimizer / slothy Goto Github PK
View Code? Open in Web Editor NEWAssembly super-optimization via constraint solving
Home Page: https://slothy-optimizer.github.io/slothy/
License: Other
Assembly super-optimization via constraint solving
Home Page: https://slothy-optimizer.github.io/slothy/
License: Other
Background: Address offset fixup is an important feature to facilitate software pipelining as it allows the reordering of ldr/str instructions with increment operating on the same address register. Normally, SLOTHY would not allow such reordering because it breaks data dependencies, but commutativity relations such as ldr _, [x0]; str _, [x0], #imm == str _, [x0], #imm; ldr _, [x0, #-imm]
makes it possible, assuming a suitable 'address offset fixup' -- hence the name.
Issue: While important, address offset fixup is still somewhat experimental and poorly documented. What's more, SLOTHY's selfcheck is not aware of it and cannot catch bugs it may still have. To facilitate debugging, address offset fixup should be behind a configuration option to begin with.
Acceptance criteria:
config.address_offset_fixup
controlling the use of address offset fixup.I wonder whether it is possible to use -c reserved_regs=[..]
to avoid assignment to v8-v15 registers.
Only low-half parts of these registers are mandatory to be preserved by the callee, but totally avoiding usage of v8-v15 is still fine in my case because my input programs use few SIMD registers.
Used script:
OUTPUTS="[hint_buffer0]"
RESERVED_REGS="[x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,sp,q8,q9,q10,q11,q12,q13,q14,q15,v8,v9,v10,v11,v12,v13,v14,v15]"
${SLOTHY_DIR}/slothy-cli Arm_AArch64 Arm_Neoverse_N1_experimental \
a.s \
-o a_opt.s \
-c outputs=$OUTPUTS -c reserved_regs=$RESERVED_REGS \
$REDIRECT_OUTPUT
Input:
ldr q27, [x1] // @slothy:reads=buffer0
umull v8.2D, v27.2S, v27.2S
mov x6, v8.d[0]
mov x12, v8.d[1]
stp x6, x12, [x1] // @slothy:writes=buffer0
The output program still contains usage of v8:
ldr q5, [x1] //(omitting the plots here)
umull v8.2D, v5.2S, v5.2S << v8 is used
...
I tweaked SLOTHY a bit to print all input logs to info, and I could see this diagnostic message which does exclude v8-v15:
INFO:slothy-cli.split.0_68:Registers available for renaming
INFO:slothy-cli.split.0_68:- GPR available: ['x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x13', 'x14',
'x15', 'x16', 'x17']
INFO:slothy-cli.split.0_68:- NEON available: ['v0', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22
', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'v29', 'v30', 'v31'] <----- This correctly excludes v8 ~ v15..!
... which might hint that it is somewhere in its downstream that omits the renaming constraint.
I was trying to put together an example showing how to use the splitting heuristic on the X25519.
I'm not trying to have optimal code (for that we of course have https://github.com/slothy-optimizer/slothy/blob/main/paper/scripts/slothy_x25519.sh), but just trying to have something that terminates in a reasonable time on a laptop and demonstrates how the splitting heuristic works.
I have the following code (to be placed in a Python script in the root dir):
import logging
import sys
from slothy import Slothy
import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
arch = AArch64_Neon
target = Target_CortexA55
slothy = Slothy(arch, target)
# example
slothy.load_source_from_file("examples/naive/aarch64/X25519-AArch64-simple.s")
slothy.config.inputs_are_outputs=True
slothy.config.outputs=["x0"]
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32
slothy.config.split_heuristic = True
slothy.config.split_heuristic_stepsize = 0.1
slothy.config.split_heuristic_factor = 10
slothy.config.split_heuristic_repeat = 2
slothy.optimize(start="mainloop", end="end_label")
slothy.write_source_to_file("test.s")
Running this very quickly errors with
INFO:slothy:Instructions in body: 958
INFO:slothy.mainloop.split.0_96:Perform internal binary search for minimal number of stalls...
INFO:slothy.mainloop.split.0_96:Attempt optimization with max 32 stalls...
INFO:slothy.mainloop.split.0_96:Objective: minimize number of stalls
INFO:slothy.mainloop.split.0_96:Invoking external constraint solver (OR-Tools CP-SAT v9.7.2996) ...
INFO:slothy.mainloop.split.0_96:[4.5895s]: Found 1 solutions so far... objective 12.0, bound 6.0 (minimize number of stalls)
INFO:slothy.mainloop.split.0_96:OPTIMAL, wall time: 6.632225 s
INFO:slothy.mainloop.split.0_96:Booleans in result: 8991
INFO:slothy.mainloop.split.0_96.selfcheck:OK!
INFO:slothy.mainloop.split.0_96:Minimum number of stalls: 12
INFO:slothy.mainloop.split.96_192:Perform internal binary search for minimal number of stalls...
INFO:slothy.mainloop.split.96_192:Attempt optimization with max 32 stalls...
ERROR:slothy.mainloop.split.96_192:Ran out of registers trying to _statically_ assign architectural registers
for input and outputs of the snippet. You should consider enabling register
renaming for input and output, by setting config.rename_{inputs,outputs}.
Alternatively, you may fix input and output registers by hand yourself.
Traceback (most recent call last):
File "/home/mjk/git/slothy-tutorial/tst.py", line 34, in <module>
slothy.optimize(start="mainloop", end="end_label")
File "/home/mjk/git/slothy-tutorial/slothy/core/slothy.py", line 249, in optimize
early, core, late, num_exceptional = Heuristics.periodic(body, logger, c)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 310, in periodic
res = Heuristics.linear( body, logger=logger, conf=conf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 387, in linear
return Heuristics._split(body, logger, conf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 782, in _split
return Heuristics._split_inner(body, logger, c)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 723, in _split_inner
cur_body, stalls, _ = optimize_chunks_many(idx_lst, cur_body, stalls,show_stalls=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 683, in optimize_chunks_many
body, stalls, cur_stalls, local_perm = optimize_chunk(start_idx, end_idx, body,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 659, in optimize_chunk
result = Heuristics.optimize_binsearch(cur_body,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 132, in optimize_binsearch
return Heuristics.optimize_binsearch_internal(source, logger, conf, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/heuristics.py", line 234, in optimize_binsearch_internal
success = core.optimize(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1040, in optimize
self._load_source(source, prefix_len=prefix_len, suffix_len=suffix_len)
File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1164, in _load_source
self._restrict_input_output_renaming()
File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1288, in _restrict_input_output_renaming
raise e
File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1281, in _restrict_input_output_renaming
v.reg = get_fresh_renaming_reg(v.ty)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/mjk/git/slothy-tutorial/slothy/core/core.py", line 1273, in get_fresh_renaming_reg
raise OutOfRegisters
slothy.core.core.SlothyBase._restrict_input_output_renaming.<locals>.OutOfRegisters
If I instead do two calls to optimize
to first resolve the symbolic variable names (in the way the other script does it), it works fine:
import logging
import sys
from slothy import Slothy
import slothy.targets.aarch64.aarch64_neon as AArch64_Neon
import slothy.targets.aarch64.cortex_a55 as Target_CortexA55
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
arch = AArch64_Neon
target = Target_CortexA55
slothy = Slothy(arch, target)
# example
slothy.load_source_from_file("examples/naive/aarch64/X25519-AArch64-simple.s")
slothy.config.inputs_are_outputs=True
slothy.config.outputs=["x0"]
slothy.config.constraints.functional_only = True
slothy.config.constraints.allow_reordering = False
slothy.optimize(start="mainloop", end="end_label")
slothy.config.constraints.functional_only = False
slothy.config.constraints.allow_reordering = True
slothy.config.variable_size=True
slothy.config.constraints.stalls_first_attempt=32
slothy.config.split_heuristic = True
slothy.config.split_heuristic_stepsize = 0.1
slothy.config.split_heuristic_factor = 10
slothy.config.split_heuristic_repeat = 2
slothy.optimize(start="mainloop", end="end_label")
slothy.write_source_to_file("test.s")
I'm not sure if this is a bug, or if I misunderstand how SLOTHY handles symbolic register names.
@hanno-becker, any ideas?
I got a trouble on building SLOTHY.
I got following CMake error when I run cmake -S. -Bbuild -DBUILD_PYTHON:BOOL=ON
CMake Error at /usr/local/share/cmake-3.28/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find SWIG (missing: SWIG_EXECUTABLE SWIG_DIR)
Call Stack (most recent call first):
/usr/local/share/cmake-3.28/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
/usr/local/share/cmake-3.28/Modules/FindSWIG.cmake:153 (find_package_handle_standard_args)
cmake/python.cmake:27 (find_package)
CMakeLists.txt:333 (include)
I use cmake version 3.28.0-rc5 and GNU Make 4.2.1.
How can I solve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.