travisdowns / uarch-bench Goto Github PK

A benchmark for low-level CPU micro-architectural features

License: MIT License

Makefile 0.18% C 0.44% C++ 97.05% Assembly 2.10% Shell 0.23% Python 0.01%

uarch-bench's Introduction

uarch-bench

A collection of low level, fine-grained benchmarks intended to investigate micro-architectural details of a target CPU, or to precisely benchmark small functions in a repeatable manner.

Disclaimer

This project is very much a work-in-progress, and is currently in a very early state with limited documentation and testing. Pull requests and issues welcome.

Purpose

The uarch-bench project is a collection of micro-benchmarks that try to stress certain micro-architectural features of modern CPUs and a framework for writing such benchmarks. Using libpfc you can accurately track the value of Intel performance counters across the benchmarked region - often with precision of a single cycle.

At the moment it supports only x86, using mostly assembly and a few C++ benchmarks. In the future, I'd like to have more C or C++ benchmarks, allowing coverage (in principle) of more platforms (non-x86 assembly level benchmarks are also welcome). Of course, for any non-asm benchmark, it is possible that the compiler makes a transformation that invalidates the intent of the benchmark. You could detect this as a large difference between the C/C++ and assembly scores.

Of course, these have all the pitfalls of any microbenchmark and are not really intended to be a simple measure of the overall performance of any CPU architecture. Rather they are mostly useful to:

Suss out changes between architectures. Often there are changes to particular micro-architectural feature that can be exposed via benchmarks of specific features. For example, you might be able to understand something about the behavior of the store buffer based on tests that exercise store-to-load forwarding.
Understand low-level performance of various approaches to guide implementation of highly-tuned algorithms. For the vast majority of typical development tasks, the very low level information provided by these benches is essentially useless in providing any guidance about performance. For some very specific tasks, such as highly-tuned C or C++ methods or hand-written assembly, it might be useful to characterize the performance of, for example, the relative costs of aligned and unaligned accesses, or whatever.
Satisfy curiosity for those who care about this stuff and to collect the results from various architectures.
Provide a simple, standard way to quickly do one-off tests of some small assembly or C/C++ level idioms. Often the test itself is a few lines of code, but the cost is in all the infrastructure: implementing the timing code, converting measurements to cycles, removing outliers, running the tests for various parameters, reporting the results, whatever. This project aims to implement that infrastructure and make it easy to add your own tests (not complete!).

Platform support

Currently only supports x86 Linux, but Windows should arrive at some point, and one could even imagine a world with OSX support.

Prerequisites

You need some C++ compiler like g++ or clang++, but if you are interested in this project, you probably already have that. Beyond that, you need nasm and perhaps msr-tools on Intel platforms (used to as a backup method to disable turbo-boost if you aren't using intel_pstate driver). On Debian-like systems, this should do it:

sudo apt-get install nasm
sudo apt-get install msr-tools

NASM

The minimum required version of nasm is 2.12, for AVX-512 support (strictly speaking, some later version of nasm 2.11, e.g., nasm-2.11.08 also work). If you don't have nasm installed, a suitable version (on Linux) is used automatically from the included /nasm-binaries directory.

Building

This project has submodules, so it is best cloned with the --recursive flag to pull all the submodules as well:

git clone --recursive https://github.com/travisdowns/uarch-bench

If you've already cloned it without --recursive, this should pull in the submodules:

git submodule update --init

Then just run make in the project directory. If you want to modify any of the make settings, you can do it directly in config.mk or in a newly created local file local.mk (the latter having the advantage that this file is ignored by git so you won't have any merge conflicts on later pulls and won't automatically commit your local build settings).

For more about building, see BUILDING.md.

Running

Ideally, you run ./uarch-bench.sh as root, since this allows the permissions needed to disable frequency scaling, as well as making it possible use USE_LIBPFC=1 mode. If you don't have root or don't want to run a random project as root, you can also run it has non-root as uarch-bench (i.e., without the wrapper shell script), which will still work with some limitations. There is current an open issue for making non-root use a bit smoother.

With Root

Just run ./uarch-bench.sh after building. The script will generally invoke sudo to prompt you for root credentials in order to disable frequency scaling (either using the no_turbo flag if intel_pstate governor is used, or rdmsr and wrmsr otherwise).

Without Root

You can also run the binary as ./uarch-bench directly, which doesn't require sudo, but frequency scaling won't be automatically disabled in this case (you can still separately disable it prior to running uarch-bench).

Command Line Arguments

Run uarch-bench --help to see a list and brief description of command line arguments.

Frequency Scaling

One key to more reliable measurements (especially with the timing-based counters) is to ensure that there is no frequency scaling going on.

Generally this involves disabling turbo mode (to avoid scaling above nominal) and setting the power saving mode to performance (to avoid scaling below nominal). The uarch-bench.sh script tries to do this, while restoring your previous setting after it completes.

Example Output

$ sudo ./uarch-bench.sh
Driver: intel_pstate, governor: performance
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
Succesfully disabled turbo boost using intel_pstate/no_turbo
Using timer: clock
Welcome to uarch-bench (caa208f-dirty)
Supported CPU features: SSE3 PCLMULQDQ VMX SMX EST TM2 SSSE3 FMA CX16 SSE4_1 SSE4_2 MOVBE POPCNT AES AVX RDRND TSC_ADJ BMI1 HLE AVX2 BMI2 ERMS RTM RDSEED ADX INTEL_PT
Pinned to CPU 0
Median CPU speed: 2.193 GHz
Running benchmarks groups using timer clock

** Running group basic : Basic Benchmarks **
                               Benchmark    Cycles     Nanos
                     Dependent add chain      1.00      0.46
                   Independent add chain      0.26      0.12
                  Dependent imul 64->128      3.00      1.37
                   Dependent imul 64->64      3.00      1.37
                Independent imul 64->128      1.01      0.46
                    Same location stores      1.00      0.46
                Disjoint location stores      1.00      0.46
                Dependent push/pop chain      5.00      2.28
              Independent push/pop chain      1.00      0.46
         Simple addressing pointer chase      4.00      1.83
        Complex addressing pointer chase      5.01      2.28
Finished in 556 ms (basic)

** Running group memory/load-parallel : Parallel loads from fixed-size regions **
                               Benchmark    Cycles     Nanos
                    16-KiB parallel load      0.53      0.24
                    24-KiB parallel load      0.52      0.24
                    30-KiB parallel load      0.53      0.24
                    31-KiB parallel load      0.53      0.24
                    32-KiB parallel load      0.52      0.24
                    33-KiB parallel load      0.54      0.24
                    34-KiB parallel load      0.55      0.25
                    35-KiB parallel load      0.56      0.26
                    40-KiB parallel load      1.34      0.61
                    48-KiB parallel load      2.01      0.92
                    56-KiB parallel load      2.01      0.92
                    64-KiB parallel load      2.01      0.92
                    80-KiB parallel load      2.06      0.94
                    96-KiB parallel load      2.18      0.99
                   112-KiB parallel load      2.27      1.03
                   128-KiB parallel load      2.24      1.02
                   196-KiB parallel load      2.72      1.24
                   252-KiB parallel load      3.75      1.71
                   256-KiB parallel load      3.68      1.68
                   260-KiB parallel load      0.53      0.24
                   384-KiB parallel load      4.34      1.98
                   512-KiB parallel load      5.19      2.36
                  1024-KiB parallel load      5.64      2.57
                  2048-KiB parallel load      6.16      2.81
Finished in 7050 ms (memory/load-parallel)

** Running group memory/store-parallel : Parallel stores to fixed-size regions **
                               Benchmark    Cycles     Nanos
                   16-KiB parallel store      1.00      0.46
                   24-KiB parallel store      1.00      0.46
                   30-KiB parallel store      1.17      0.53
                   31-KiB parallel store      1.00      0.46
                   32-KiB parallel store      1.00      0.46
                   33-KiB parallel store      1.15      0.52
                   34-KiB parallel store      1.32      0.60
                   35-KiB parallel store      1.29      0.59
                   40-KiB parallel store      4.32      1.97
                   48-KiB parallel store      6.20      2.83
                   56-KiB parallel store      6.23      2.84
                   64-KiB parallel store      6.10      2.78
                   80-KiB parallel store      6.25      2.85
                   96-KiB parallel store      6.24      2.84
                  112-KiB parallel store      6.26      2.85
                  128-KiB parallel store      6.26      2.86
                  196-KiB parallel store      6.36      2.90
                  252-KiB parallel store      6.71      3.06
                  256-KiB parallel store      6.75      3.08
                  260-KiB parallel store      1.01      0.46
                  384-KiB parallel store      7.78      3.55
                  512-KiB parallel store      8.67      3.95
                 1024-KiB parallel store      9.59      4.37
                 2048-KiB parallel store      9.97      4.55
Finished in 14892 ms (memory/store-parallel)

** Running group memory/prefetch-parallel : Parallel prefetches from fixed-size regions **
                               Benchmark    Cycles     Nanos
              16-KiB parallel prefetcht0      0.50      0.23
              16-KiB parallel prefetcht1      0.50      0.23
              16-KiB parallel prefetcht2      0.50      0.23
             16-KiB parallel prefetchnta      0.50      0.23
              32-KiB parallel prefetcht0      0.50      0.23
              32-KiB parallel prefetcht1      1.98      0.90
              32-KiB parallel prefetcht2      1.99      0.91
             32-KiB parallel prefetchnta      0.50      0.23
              64-KiB parallel prefetcht0      2.00      0.91
              64-KiB parallel prefetcht1      1.90      0.86
              64-KiB parallel prefetcht2      2.00      0.91
             64-KiB parallel prefetchnta      5.90      2.69
             128-KiB parallel prefetcht0      2.26      1.03
             128-KiB parallel prefetcht1      2.04      0.93
             128-KiB parallel prefetcht2      2.04      0.93
            128-KiB parallel prefetchnta      5.91      2.69
             256-KiB parallel prefetcht0      3.66      1.67
             256-KiB parallel prefetcht1      3.49      1.59
             256-KiB parallel prefetcht2      3.49      1.59
            256-KiB parallel prefetchnta      5.85      2.67
             512-KiB parallel prefetcht0      5.25      2.39
             512-KiB parallel prefetcht1      4.90      2.23
             512-KiB parallel prefetcht2      4.90      2.24
            512-KiB parallel prefetchnta      5.77      2.63
            2048-KiB parallel prefetcht0      6.22      2.84
            2048-KiB parallel prefetcht1      5.84      2.66
            2048-KiB parallel prefetcht2      5.84      2.66
           2048-KiB parallel prefetchnta      9.43      4.30
            4096-KiB parallel prefetcht0     10.96      5.00
            4096-KiB parallel prefetcht1     10.69      4.87
            4096-KiB parallel prefetcht2     10.74      4.90
           4096-KiB parallel prefetchnta      9.58      4.37
            8192-KiB parallel prefetcht0     16.96      7.73
            8192-KiB parallel prefetcht1     16.44      7.50
            8192-KiB parallel prefetcht2     16.77      7.64
           8192-KiB parallel prefetchnta     12.27      5.59
           32768-KiB parallel prefetcht0     20.60      9.39
           32768-KiB parallel prefetcht1     20.23      9.22
           32768-KiB parallel prefetcht2     20.09      9.16
          32768-KiB parallel prefetchnta     20.22      9.22
Finished in 4492 ms (memory/prefetch-parallel)

** Running group memory/pointer-chase : Pointer-chasing **
                               Benchmark    Cycles     Nanos
  Simple addressing chase, half diffpage      6.51      2.97
Simple addressing chase, different pages      8.49      3.87
     Simple addressing chase with ALU op      6.01      2.74
                   load5 -> load4 -> alu     10.02      4.57
                   load4 -> load5 -> alu     11.03      5.03
        8 parallel simple pointer chases      4.00      1.82
      10 parallel complex pointer chases      5.16      2.35
        10 parallel mixed pointer chases      5.19      2.37
Finished in 916 ms (memory/pointer-chase)

** Running group memory/load-serial : Serial loads from fixed-size regions **
                               Benchmark    Cycles     Nanos
                     16-KiB serial loads      4.00      1.82
                     24-KiB serial loads      4.00      1.82
                     30-KiB serial loads      4.00      1.82
                     31-KiB serial loads      4.00      1.82
                     32-KiB serial loads      4.01      1.83
                     33-KiB serial loads      6.02      2.74
                     34-KiB serial loads      8.01      3.65
                     35-KiB serial loads      9.81      4.47
                     40-KiB serial loads     11.93      5.44
                     48-KiB serial loads     11.96      5.45
                     56-KiB serial loads     11.95      5.45
                     64-KiB serial loads     12.12      5.52
                     80-KiB serial loads     11.98      5.46
                     96-KiB serial loads     11.98      5.46
                    112-KiB serial loads     12.01      5.48
                    128-KiB serial loads     12.00      5.47
                    196-KiB serial loads     15.10      6.88
                    252-KiB serial loads     21.27      9.70
                    256-KiB serial loads     21.11      9.63
                    260-KiB serial loads     20.99      9.57
                    384-KiB serial loads     28.64     13.06
                    512-KiB serial loads     31.71     14.46
                   1024-KiB serial loads     38.92     17.74
                   2048-KiB serial loads     47.21     21.52
Finished in 683 ms (memory/load-serial)

** Running group bmi : BMI false-dependency tests **
                               Benchmark    Cycles     Nanos
                    dest-dependent tzcnt      3.00      1.37
                    dest-dependent lzcnt      3.00      1.37
                   dest-dependent popcnt      3.00      1.37
Finished in 190 ms (bmi)

** Running group studies/vzeroall : VZEROALL weirdness **
                               Benchmark    Cycles     Nanos
                 vpaddq zmm0, zmm0, zmm0 Skipped because hardware doesn't support required features: [AVX512F]
                 vpaddq zmm0, zmm1, zmm0 Skipped because hardware doesn't support required features: [AVX512F]
                vpaddq zmm0, zmm16, zmm0 Skipped because hardware doesn't support required features: [AVX512F]
   vpxor zmm16; vpaddq zmm0, zmm16, zmm0 Skipped because hardware doesn't support required features: [AVX512F]
                 vpaddq ymm0, ymm0, ymm0      1.00      0.46
                 vpaddq ymm0, ymm1, ymm0      1.00      0.46
                 vpaddq xmm0, xmm0, xmm0      1.00      0.46
                 vpaddq xmm0, xmm1, xmm0      1.00      0.46
                        paddq xmm0, xmm0      1.00      0.46
                        paddq xmm0, xmm1      1.00      0.46
Finished in 97 ms (studies/vzeroall)
Reverting no_turbo to 0
Succesfully restored no_turbo state: 0

uarch-bench's People

Contributors

Stargazers

Watchers

Forkers

jinankjain longjohncoder abhi-infrrd bgin hanyrax sacheendra vkamesh tisma fcccode anderspapitto cathyluo dongxiao92 microyea victorygogogo moneytech tedqian bvignesh cfandy melodylail michaeljclark saruspete mfleming lzufalcon clayne lukabob darkbuck paulie-g guoami wwiit lilh9598 qianlong-zhang icefishc mrragava robbietoo wolf15 joyxu colinianking cxl12 jsonlee0x01 goperry leon2401119 webglider shannonz88 anantharamaiah 1jo1 kinddevil x-tinkerer laevatin tangleintel moep0 tanglei1994 jamestiotio tz0 gmh5225 yanlonghu zeng-wch onlyoneofme pipython taiyang-li moonstarslo

uarch-bench's Issues

Disable frequency scaling before launching benchmark

Before we launch the binary we can try to disable turbo mode and power-management related scaling, at least on Linux, by writing to msrs and such. Then we restore the settings after the benchmark runs so the settings don't persist until next boot (e.g., wasting someone's battery since we set teh performance governor to performance).

Add arbitrary PMU counter support via libpfm4

When using libpfc mode, and by linking against libpfm, we can add support for specifying whatever counter the user wants via a its perf-like string name, e.g., "ld_blocks.store_forward:u".

We can use this functionality both within the benchmark code (e.g., for benchmarks which know which events are interesting) and perhaps at the command line (e.g., when the user wants to specify which additional events they are interested in).

<vector> missing in stats.hpp

diff --git a/stats.hpp b/stats.hpp
index 284f4b11c6ae..9964ac507dcd 100644
--- a/stats.hpp
+++ b/stats.hpp
@@ -11,6 +11,7 @@
 #include <limits>
 #include <iterator>
 #include <functional>
+#include <vector>
 
 namespace Stats {

`rep stos` appearing in benchmarked region

If you take a look at the core region of the innermost method in a benchmark in the libpfc case, you find a rep stos call inside the timed region as follows:

  40792a:       shl    rdx,0x20
  40792e:       or     rdx,rax
  407931:       add    QWORD PTR [rbp+0x28],rdx
  407935:       mov    rcx,0x3
  40793c:       rdpmc  
  40793e:       shl    rdx,0x20
  407942:       or     rdx,rax
  407945:       add    QWORD PTR [rbp+0x30],rdx
  407949:       lfence 
  40794c:       mov    rdi,QWORD PTR [rsp]
  407950:       mov    rsi,QWORD PTR [rsp+0x8]
  407955:       call   47f680 <dep_add_rax_rax>
  40795a:       mov    rdi,rbx
  40795d:       mov    rax,r12
  407960:       mov    ecx,0x7
  407965:       rep stos QWORD PTR es:[rdi],rax    <<< this guy
  407968:       lfence 
  40796b:       mov    rcx,0x40000000
  407972:       rdpmc  
  407974:       shl    rdx,0x20
  407978:       or     rdx,rax
  40797b:       add    QWORD PTR [rbx],rdx
  40797e:       mov    rcx,0x40000001
  407985:       rdpmc  
  407987:       shl    rdx,0x20
  40798b:       or     rdx,rax
  40798e:       add    QWORD PTR [rbx+0x8],rdx
  407992:       mov    rcx,0x40000002
  407999:       rdpmc  
  40799b:       shl    rdx,0x20

The code before and after is issuing rdpmc to read the performance counters, and the actual timed called is dep_add_rax_rax, but the presence of the rep stos is unfortunate, since it's slow, invokes microcode and so on. It's there because of:

struct LibpfcNow {
    PFC_CNT cnt[TOTAL_COUNTERS];
    ...

and

static now_t now() {
        LibpfcNow now = {};

which zero-initializes the counter array. The existing macro either add (PFC_END as shown above) or sub from the array location, so we require zero init since otherwise the garbage will be picked up. In principle though the array is just replaced with the current value, so this isn't necessary - we have have a new PFC_ macro which just mov in the absolute value.

In principle, the effect is cancelled out by the use of dummy_bench (or any other bench), but it would still be nice to eliminate all unnecessary code in the benchmarked region, especially rep instructions and those which modify memory.

Allow users to select the CPU to pin to

... with a command line argument. Currently we unconditionally try to pin to CPU 0, but users may perhaps want to try pinning to a different CPU.

Add the ability to restrict some benchmarks to CPUs with certain ISA extensions

For example, a benchmark could be tagged with AVX2 if it needs AVX2 to run. This is a lot better than simply crashing when the user runs such a benchmark on an incompatible machine.

Among other things this prevents us from running with --test-name=* on the CI server since it doesn't always have AVX2 and crashes on some test that needs it.

extra-events column names could use improvement

When the user specifies --extra-events=... we need names for the column headers displaying those names. Currently we just take the first 6 characters of the event name, which sometimes isn't meaningful and often leads to duplicates when you have several related events all with the same first six characters.

We should do better. Perhaps smartly generate the name by taking characters from after the first or last underscore. This could even be dynamic: i.e,. check that no abbreviations end as duplicates based on the selected events and if there are, change the name generation so that they aren't.

Alternatives are possible, such as just calling the events EVT1, EVT2, etc, or adding more header rows to allow more characters, etc.

Support libpfc as a CLOCK implemenetation

We should support libpfc as a type of CLOCK.

The original idea was to use libpfc for highly accurate cycle counts (as well as other interesting counters), but at the moment we are just using std::chrono::high_resolution_clock to measure time, and then converting time to cycles based on a calibrated assembly loop (see CalcCpuFreq in main.c).

libpfc timer should include kernel cycles too

Currently we are only getting user-mode cycles from libpfc. We should probably track all cycles.

Compilation fails with -O0 and -Os.

Compiling with -O1/-O2/-O3 is fine, -O0/-Os result in the following:

g++ -g -O0 -march=native -DGIT_VERSION="4fb3130-dirty" -c -std=c++11 -o main.o main.cpp
nasm -w+all -f elf64 -l x86_methods.list x86_methods.asm
g++ -g -O0 -march=native -DGIT_VERSION="4fb3130-dirty" main.o x86_methods.o -o uarch-bench
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &load16_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &load32_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &load64_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &load128_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &load256_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &store16_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &store32_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &store64_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &store128_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
main.o: In function 'std::shared_ptr LoadStoreGroup::make<ClockTimerTstd::chrono::_V2::system_clock, &store256_any>(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, unsigned int)':
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_COLS'
/home/swinnenb/uarch-bench/main.cpp:338: undefined reference to 'LoadStoreGroup::DEFAULT_ROWS'
collect2: error: ld returned 1 exit status
Makefile:15: recipe for target 'uarch-bench' failed
make: *** [uarch-bench] Error 1

GCC version:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

System:

Linux 4.4.0-43-Microsoft #1-Microsoft Wed Dec 31 14:42:53 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

Disable MSR-based turbo disabling on non-intel CPUs

If you run ./uarch-bench.sh on AMD, it will fail when it tries to write to the turbo-disabling MSRs which are Intel-specific. We should skip this step on AMD.

Issues with compiling with gcc-7.3

/usr/bin/ld: x86_methods.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
Makefile:70: recipe for target 'uarch-bench' failed
make: *** [uarch-bench] Error 1

So after googling a bit found that passing -no-pie as LD_FLAGS made it work.

More information:
https://wiki.ubuntu.com/SecurityTeam/PIE

Check if the clock is stable and try to wait until it is if not

Ideally the user runs uarch-bench with all frequency scaling behaviors disabled, but this is not always possible. In the case that scaling is occurring, we still want to provide a reasonable experience.

Especially on older CPUs (or even on newer CPUs still using software-driven p-state control) scaling might be a fairly slow process (i.e., 10s or 100s of milliseconds), and the CPU may transition through several p-states (frequencies) over a long(ish) period of time until it reaches max frequency. This affects all the benchmarks that run during this period. See for example these Sandy Bridge results which clearly show the frequency ramp during almost the entire test, greatly throwing off the results.

Allow dynamically loaded benchmarks from a shared object

Rather than compiling in benchmarks, it would be cool to allow benchmarks to be dynamically loaded from a shared object, allowing decoupling of the benchmark application and default benchmarks from other benchmarks.

This would need at least the following:

A mechanism to load shared objects (e.g., dlopen/dlsym and friends) and enumerate the contained benchmarks.
An API that shared benchmark objects are written against, which would be a subset of the existing uarch-bench code.
Some kind of versioning mechanism so that we don't blow up when we load benchmarks compiled against an older version of the uarch-bench API after a breaking change has been made.

OSX support

Should be easy to support OSX, in fact perhaps it already works.

We don't have a solution for reading performance counters on OSX, and it isn't clear the effort required to port the libpfc counters to that platform. It seems like PAPI won't work, but maybe we can get a driver from PCM?

uarch-bench not building due to missing libpfm4

Change 5b6777a removed the libpfc .tar and replaced it with a submodule pointing to the sourceforce git view for that library.

After this change, uarch-bench isn't building because the submodule doesn't seem to exist: a fresh git clone --recursive brings in the other three submodules but not libpfm4 and git submodule status and friends don't seem to know about it.

Similar issue described here.

Add "one-shot" test capability

Add the ability to do "one shot" tests which time a segment of code exactly once. This is useful when you want to test the cold-code behavior, e.g., before any predictors get warmed up.

Reconsider usage of sudo in uarch-bench.sh

uarch-bench.sh contains multiple calls to sudo, which obviously won't work if the system doesn't have sudo installed (Debian minimal doesn't come installed with sudo). You could test to see if the script is already running as root and not call sudo, though, to be quite honest, I'd just remove all instances of it altogether and instruct the user to run the script as root, using their own preferred method (and you could check for this at the beginning of the script and exit if not root).

As an aside, the readme should probably mention that linux kernel headers are needed to build the libpfc kernel module.
Also, I'm not sure how hard it would be, but being able to build a static binary could be useful for running tests on machines without the necessary build environment.

Thanks!

Allow selecting benchmarks to run using tags

Currently the benchmarks to run are organized into a hierarchy and you can select a subset to run using --test-name=PATTERN, but we should have a cross cutting tag feature that allows you to select by tag.

For example, there could be a default tag that picks out the most interesting tests to run by default. This in particular will solve the issue where you get a ton of very specific tests (e.g., for the memory disambiguation stuff) spammed to the output.

Allow filtering tests by glob

Allow the user specify a subset of tests using a glob against the ID, e.g., foo/* would run all the tests in the foo group, etc.

uarch-bench.sh doesn't restore governor if rdmsr isn't installed

Probably just move the if [[ -z $(which rdmsr) ]] check before the block that changes governor.

The problem is exit 1 after doing sudo sh -c "echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor"

Also, a nice way to use sudo there is echo foo | sudo tee /proc/sys/.... I think you can tee /proc/sys/... > /dev/null and still have sudo's prompt appear on the terminal.

Add 4K aliasing benchmark

Add a benchmark to test for 4K aliasing, with a load reading from a distinct earlier store location which is a multiple of 4K away. See here for example.

Calibration loop reports half of true frequency on Nehalem

As reported on RWT.

It seems to be related to the loop buffer implementation on Nehalem. Per Agner (emphasis mine):

The Core2 loop buffer works almost as a 64 bytes level-0 code cache, organized as 4 lines
of 16 bytes each. A loop that can be completely contained in four aligned blocks of 16 bytes
each can execute at a rate of up to 32 bytes of code per clock cycle. The four 16-bytes
blocks do not even have to be consecutive. A loop that contains jumps (but not calls and
returns) can still exceed the predecoder throughput if all the code in the loop can be
contained in four aligned 16-bytes blocks.

The Nehalem design is slightly different. The Core2 has the loop buffer between the
predecoders and the decoders, while the Nehalem has the loop buffer after the decoders.
The Nehalem loop buffer can hold 28 (possibly fused) μops. The size of the loop code is
limited to 256 bytes of code, or up to 8 blocks of 32 bytes each. A loop containing more than
256 bytes of code cannot use the loop buffer. There is a one-clock delay in this loop
process, so that a loop containing 4*N (possibly fused) μops will take N+1 clock cycles to
execute if there are no other bottlenecks elsewhere. The Core2 does not have this one-
clock delay.

So it would seem that a small loop like the 2-instruction add_calibration we use will still take 2 cycles on Nehalem.

Better output for the alignment tests

There are load/store tests that exercise all 64 possibly alignments within a 64B cache line, producing 64 lines of output each.

It would be much nicer if these were organized differently, e.g., 4 lines of 16 offsets each. In addition to greatly reducing the number of lines in the output, it is easier to visualize the behavior.

Implement "delta" measurement

Currently we just measure the absolute time of the code under test like so:

static int64_t time_method(size_t loop_count) {
	auto t0 = CLOCK::now();
	METHOD(loop_count);
	auto t1 = CLOCK::now();
	return t1 - t0;
}

The downside of this approach is that it includes the time for one CLOCK::now() call as well as all the overhead of METHOD(loop_count) which includes at least a call and ret and sometimes a small amount of setup overhead.

A better approach is to time the loop with two different loop_count and use the difference in time to calculate the performance. This causes the above overheads to cancel out (but the test/jump overhead inside the loop within the benchmark is still present, but this is small or sometimes zero).

Some "two code" events aren't supported with libpfc

If you include on the command line:

--timer=libpfc --extra-events=FRONTEND_RETIRED.L1I_MISS

You'll see this output:

Event 'FRONTEND_RETIRED.L1I_MISS' resolved to 'skl::FRONTEND_RETIRED:L1I_MISS:k=1:u=1:e=0:i=0:c=0:t=0:intx=0:intxcp=0:fe_thres=0, short name: 'FRONTE' with code 0x5301c6
Event 'FRONTEND_RETIRED.L1I_MISS' resolved to 'skl::FRONTEND_RETIRED:L1I_MISS:k=1:u=1:e=0:i=0:c=0:t=0:intx=0:intxcp=0:fe_thres=0, short name: 'FRONTE' with code 0x12
WARNING: MULTIPLE CODES FOR 'FRONTEND_RETIRED.L1I_MISS'

The problem is that the FRONTEND_RETIRED.L1I_MISS event needs programming of both the usual perf event select MSR (code 0x5301c6) as well programming the special MSR_PEBS_FRONTEND MSR with 0x12 (the second code) to implement the L1I_MISS part. Without this, the event will do something unknown depending on the existing value of MSR_PEBS_FRONTEND (presumably).

An apparent workaround is to use perf stat -e to record the event once - this sets the MSR_PEBS_FRONTEND MSR, which I guess gets left alone after, and after that running uarch-bench in the normal way without peft appears to just work. Evidently not a good long term solution!

Store forwarding benchmarks

Create benchmarks that investigate various aspects of store-forwarding behavior, such as:

store->load forwarding latency
misaligned load scenarios
store->load throughput
determine how far a load has to be from the aliasing store before store-forwarding apparently stops

Add continuous integration

Some type of CI would be nice.

It's clear that a lot of the non-functional behavior of uarch-bench is hardware specific: it wants to run on bare metal, and currently expects x86. The libpfc mode also wants to install a kernel driver, and set certain values that only root can set in /sys and a few other "root-needed" things - but you can definitely run as non root (perhaps the experience could be better though).

Just building and doing a "default" run of ./uarch-bench would be an awesome start.

I admit to being ignorant of how well cloud CI systems handle binary dependencies like nasm...

Better table printing

Right now we just hardcode some column sizes, an annoying tradeoff in many cases.

Let's property pretty-print our tables with dynamic column sizing.

Canonical names for the tests

Each test should have a canonical name, so we can implement, for example, selection of which test(s) to run on the command line.

Support for non-x86 architectures

Seems like this would be pretty difficult, but I'd love to have something like this working on other architectures, especially ARM.

Test whether port7 can be used as a store AGU

See for example this SO answer which indicates that port7 may also be used for loads. I believe it is just stores.

We can test this using the performance counters and a loop where the stores have simple addressing modes (e.g., base-only) and the stores complex. If any situation should "force" some use of port7 it should be this one.

Allow configurable formatting for measured values

E.g., the number of decimals to show for cycle counts.

Windows support

Support Windows.

Should be straightforward other than the build. I guess we'll use cmake for that.

I don't think libpfc supports Windows but perhaps we can port that too, or else just rely on the default std::chrono based clock.

Move the CPU frequency calculation out of global init

Currently the CPU frequency calculation is part of global init:

template <typename CLOCK>
double ClockTimerT<CLOCK>::ghz = CalcCpuFreq<10000,CLOCK,1000>();

which means that the benchmark pauses to calculate this even if running something like uarch-bench --help which will never need it. We should make it lazy.

pfc.ko build isn't re-attempted if it fails (but libpfc.so succeeds) when re-running make

I first tried make -j8 and didn't notice right away that there were errors. libpfc failed to build, so make stopped before building uarch-bench. That's fine, but it doesn't retry building libpfc.

The cause of libpfc failing to build might be a libpfc bug, though.

$ make -j8
...
cc  -g -Wall -Werror -Wextra -Wno-unused-parameter -I. -I/usr/local/src/uarch-bench/libpfm-4.8.0/lib/../include -DCONFIG_PFMLIB_DEBUG -DCONFIG_PFMLIB_OS_LINUX -D_REENTRANT -I. -fvisibility=hidden -DCONFIG_PFMLIB_ARCH_X86 -DCONFIG_PFMLIB_ARCH_X86_64 -I. -c pfmlib_intel_snb.c
rm -rf kmod.build
cp -r kmod kmod.build
cd kmod.build && make MAKEFLAGS=
make[2]: Entering directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make -C /lib/modules/4.12.8-2-ARCH/build M=/usr/local/src/uarch-bench/libpfc/kmod.build modules
make[3]: Entering directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[3]: *** No rule to make target 'modules'.  Stop.
make[3]: Leaving directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[2]: *** [Makefile:5: all] Error 2
make[2]: Leaving directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make[1]: *** [Makefile:30: pfc.ko] Error 2
make[1]: Leaving directory '/usr/local/src/uarch-bench/libpfc'
make: *** [Makefile:66: libpfc/libpfc.so] Error 2
cc  -g -Wall -Werror -Wextra -Wno-unused-parameter -I. -I/usr/local/src/uarch-bench/libpfm-4.8.0/lib/../include -DCONFIG_PFMLIB_DEBUG -DCONFIG_PFMLIB_OS_LINUX -D_REENTRANT -I. -fvisibility=hidden -DCONFIG_PFMLIB_ARCH_X86 -DCONFIG_PFMLIB_ARCH_X86_64 -I. -c pfmlib_intel_snb_unc.c
make: *** Waiting for unfinished jobs....
cc  -g -Wall -Werror -Wextra -Wno-unused-parameter -I. -I/usr/local/src/uarch-bench/libpfm-4.8.0/lib/../include -DCONFIG_PFMLIB_DEBUG -DCONFIG_PFMLIB_OS_LINUX -D_REENTRANT -I. -fvisibility=hidden -DCONFIG_PFMLIB_ARCH_X86 -DCONFIG_PFMLIB_ARCH_X86_64 -I. -c pfmlib_intel_ivb.c
cc  -g -Wall -Werror -Wextra -Wno-unused-parameter -I. -I/usr/local/src/uarch-bench/libpfm-4.8.0/lib/../include -DCONFIG_PFMLIB_DEBUG -DCONFIG_PFMLIB_OS_LINUX -D_REENTRANT -I. -fvisibility=hidden -DCONFIG_PFMLIB_ARCH_X86 -DCONFIG_PFMLIB_ARCH_X86_64 -I. -c pfmlib_intel_ivb_unc.c
cc  -g -Wall -Werror -Wextra -Wno-unused-parameter -I. -I/usr/local/src/uarch-bench/libpfm-4.8.0/lib/../include -DCONFIG_PFMLIB_DEBUG -DCONFIG_PFMLIB_OS_LI

Running make again after this built the uarch-bench binary.

But pfc kmod isn't built, so I guess somewhere there's a missing dependency to check on that?

Also I'm having trouble getting libpfc built on Arch Linux. I cded into uarch-bench/libpfc and did a make clean, then make:

rm -rf kmod.build
cp -r kmod kmod.build
cd kmod.build && make MAKEFLAGS=
make[1]: Entering directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make -C /lib/modules/4.12.8-2-ARCH/build M=/usr/local/src/uarch-bench/libpfc/kmod.build modules
make[2]: Entering directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[2]: *** No rule to make target 'modules'.  Stop.
make[2]: Leaving directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[1]: *** [Makefile:5: all] Error 2
make[1]: Leaving directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make: *** [Makefile:30: pfc.ko] Error 2

I don't see how the make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules is supposed to work, because there's no modules anywhere in libpfc or in /usr/lib/modules/4.12.8-2-ARCH/, just modules.*. Same on my Ubuntu 15.04 machine using a distro kernel. (I keep putting off updating it until I make a grand plan to put it on an SSD or something...)

And the Makefile doesn't define this target.

Missing <functional> include in main.cpp and stats.hpp

User octoploid reports that <functional> header is missing. On my platform (gcc 5.4, libc++ 5-something) it gets included indirectly via <memory> but that's not guaranteed and apparently not true everywhere.

Disable turbo on AMD

The current solution in uarch-bench.sh to disable turbo mode (write to msr regs) only works for Intel, but we should come up with a solution for AMD.

I didn't find anything obvious, but at least in pre-ZEN chips you could do it with the AMD overdrive software, and apparently also (on Windows) by tweaking the "maximum processor state" value in the advanced power settings. So maybe the same can be done through the cpufreq interface in Linux.

Create web version of the tool

I really like your tool, Travis!
And it would be even more cool to have web version like https://gcc.godbolt.org/ or http://quick-bench.com/
The use case that I'm looking for is that a user writes a piece of assembly and selects perf counters he wants to measure.

Implement turbo disablement for AMD processors

Currently the turbo disablement only works on Intel, we should find a way to make it work on AMD.

This might be a good start:

https://www.realworldtech.com/forum/?threadid=176780&curpostid=176818

Add width benchmarks

Add tests that excercise the CPU width for various instruction mixes.

Fail with diagnostic if submodules don't exist

Many users will probably just git clone followed by make and the submodules will be missing in that case, but the failure mode is a bit obscure: you'll get some weird error as you try to make a submodule that doesn't exist.

We should just check if (some) submodule exists and if not print a diagnostic recommending git submodule update --init.

Use `intel_pstate/no_turbo` to disable turboboost in preference to msr-tools

Currently the uarch-bench.sh wrapper script uses rdmsr and wrmsr from msr-tools to disable turbo by writing the msr regs, but a cleaner way for systems using the intel_pstate CPU governor (probably most modern systems) would be to write to /sys/devices/system/cpu/intel_pstate/no_turbo instead. Simpler (no need to do everything per core) and doesn't require the user to install msr-tools.

Benchmark registration shouldn't be coupled to `main.cpp`

Currently every file that defines benchmarks needs to declare a benchmark registration method that is called explicitly in make_benches in main.cpp which look like:

template <typename TIMER>
GroupList make_benches() {

    GroupList groupList;

    register_default<TIMER>(groupList);
    register_loadstore<TIMER>(groupList);
    register_mem<TIMER>(groupList);
    register_misc<TIMER>(groupList);
    register_cpp<TIMER>(groupList);
    register_vector<TIMER>(groupList);
    register_call<TIMER>(groupList);
    register_oneshot<TIMER>(groupList);

    return groupList;
}

This is unfortunate since it means that otherwise independent lists of benchmarks have to be registered in a common place (increasing merge conflicts for independent code) and it also adds another step to adding a new benchmark file.

We should allow independent registration of benchmarks: ideally simply dropping in a new .cpp file that has benchmarks would be enough for it to be picked up. That probably means registration should use some kind of global constructor to register tests from the implementing .cpp file directly. This the order of such calls aren't defined across translation units, we need to sort on benchmark name or something so that we have a consistent order in the benchmark list regardless of actual registration order.

libpfc isn't building on Arch Linux

Per Peter on issue #24:

I'm having trouble getting libpfc built on Arch Linux. I cded into uarch-bench/libpfc and did a make clean, then make:

rm -rf kmod.build
cp -r kmod kmod.build
cd kmod.build && make MAKEFLAGS=
make[1]: Entering directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make -C /lib/modules/4.12.8-2-ARCH/build M=/usr/local/src/uarch-bench/libpfc/kmod.build modules
make[2]: Entering directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[2]: *** No rule to make target 'modules'.  Stop.
make[2]: Leaving directory '/usr/lib/modules/4.12.8-2-ARCH/build'
make[1]: *** [Makefile:5: all] Error 2
make[1]: Leaving directory '/usr/local/src/uarch-bench/libpfc/kmod.build'
make: *** [Makefile:30: pfc.ko] Error 2

I don't see how the make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules is supposed to work, because there's no modules anywhere in libpfc or in /usr/lib/modules/4.12.8-2-ARCH/, just modules.*. Same on my Ubuntu 15.04 machine using a distro kernel. (I keep putting off updating it until I make a grand plan to put it on an SSD or something...)

And the Makefile doesn't define this target.

Better support for non-root runs

You can currently run uarch-bench as non-root, but the experience isn't great (for example, you have to run the uarch-bench binary directly rather than use the wrapper script). We should make this better, e.g., allowing users to still use the wrapper script with NOROOT=1 or something like that.

In that case we don't try to do things that require root and warn where this may affect the stability of results.

Add benchmarks for BMI instructions

Test the latency of chains of popcnt tzcnt and lzcnt instructions which use the same destination argument. In principle these should be independent (since the destination is write-only) and execute at the instruction throughput (1/cycle on recent Intel, better on AMD), but in practice on many recent Intel chips there is a false dependency on the output: the instruction won't execute until the destination register is ready (even though it is overwritten). See for the example this discussion.

Include memory disamgiuation tests

In the flavor of:

http://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/

Allow "timer-coupled" benchmarks

The existing benchmark generation mechanism creates all benchmarks for all timers. It does this by abstracting the Timer methods with TIMER::now() and the benchmarked code as a bench2_f function and then using a timer-generic timing loop like:

std::array<typename TIMER::delta_t,samples> time_one(size_t loop_count, void* arg) {
    std::array<typename TIMER::delta_t,samples> result;
    for (int i = 0; i < samples; i++) {
        auto t0 = TIMER::now();
        METHOD(loop_count, arg);
        auto t1 = TIMER::now();
        result[i] = TIMER::delta(t1, t0);
    }
    return result;
}

Great.

The downside is that even if now() can be inlined (it can't always, e.g. when it uses a std::chrono or clock_gettime type implementation), the METHOD generally cannot if it is written in asm, so this implies a function call (possibly mis-predicted, and involves reading and writing from the stack) inside the measured region.

Furthermore, the decoupling between TIMER::now() and METHOD(...) means that the ::now() method must be written in a generic way that saves the initial timestamp to t0 and then does a delta after. In the specific case of the libpfc timer, this means that it uses the PFC_END macro to save its ::now() results on an array on the stack. Given that this involves memory writes, including inside the "measured region" it potentially perturbs any benchmark that wants to carefully control that behavior, especially one-shot benchmarks.

Ideally it we would want to allow writing a "timer coupled" benchmark, which knows that it will be measured with libpfc and embeds the rdpmc calls directly in the benchmark. It avoids the function call, avoids writes to the stack, and generally lets the benchmark best choose exactly what is in the measured area. The flipside is that this benchmark will be coupled to that timer: it only makes sense to run it if the timer that it is written for is selected.

This issue covers making such benchmarks possible.

TravisCI keeps running even after command fails

Due to a bug or mis-design in TravisCI, the build will keep running even if a script command fails. For example, if the build fails, we'll still try to run uarch-bench.

For now, we can probably include set -e to fix this in the absence of a configuration key in TravisCI to specify fail-fast behavior.