martinus / nanobench Goto Github PK

View Code? Open in Web Editor NEW

1.3K 20.0 75.0 7.5 MB

Simple, fast, accurate single-header microbenchmarking functionality for C++11/14/17/20

Home Page: https://nanobench.ankerl.com

License: MIT License

CMake 7.27% Shell 7.01% C++ 81.78% Python 3.07% HTML 0.87%

benchmark cpp microbenchmark single-header single-header-lib single-file header-only cpp11

nanobench's Introduction

🦣 Mastodon

nanobench's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger lectem graydon iomeone godsme ccock d3v3l0 crixalis2013 curioustauseef syegres8 nirvananimbusa pr8x tedqian billionsbits usefulcat bazelregistry pjunger hassan-yamin tridacnid icodein fferflo yakush-sc vadi2 externalrepositories flyingrabbit881 rangila lzufalcon mu-l clayne jadegeek lxlyh chipot sfahnens chunhualiu vbirds matt-shi-cs cozycactus lossmaster jonas-schulze zenoyang overfittingstudyroom aismann hhuzhubo johnspeny isliser toge zakariew akiko97 cj-tommi-rantala kunitoki tuchhu drunkenlemurs mxmlkzdh oliveiradan shiina-mashlro sarvex yinlenc zhengshuxin e-gutrov zabrane antoineprv wsgeek shuibing811 viordash yanggangava rt2code abelianwang c6supper topazus pairinstability mortenfc amessing

nanobench's Issues

Add access to mResult?

Not entirely clear if mResult is reachable by the public interface, does appear necessary for printing results.
EG:

const Result &Bench::aggregate_result() noexcept {
return mResult;
}

Visual Studio Version 16.8.2: example_random_number_generators.cpp fails to build

C2598 linkage specification must be at global scope nb C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\setjmp.h 24

C2624 'WyRng::umul128::__m128d': local classes cannot be used to declare 'extern' variables nb C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include\emmintrin.h 77

Manual Timing

Is there a mechanism (similar to google benchmark) to perform manual timing. I.e. not use the clock builtin to nanbench but provide our own iteration time to nanobench.

Option to show seconds or milliseconds per op instead of nanoseconds

For some benchmarks nanoseconds are too small a time unit. On the other hand, it isn't like their use is likely to actually hurt, so this may not provide enough benefit.

API of latest release (v3.1.0) doesn't match tutorial / README

The latest v3.1.0 release does not match any of the examples since changing Config to Bench. This is quite a big (currently) undocumented API break and potentially warrants a new release?

more advanced logic for slow tests

Request: CSV output

I think it would be useful to be able to output to CSV files for some graphing in Excel.

MSVC2015 compile problem

See martinus/robin-hood-hashing@82da208

Doesn't compile on macOS

Works great for me on Linux but not on macOS:

  FAILED: _deps/nanobench-build/CMakeFiles/nanobench.dir/src/test/app/nanobench.cpp.o 
  ccache /usr/local/opt/ccache/libexec/c++  -I_deps/nanobench-src/src/include -isysroot /Applications/Xcode_12.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk -mmacosx-version-min=10.14 -MD -MT _deps/nanobench-build/CMakeFiles/nanobench.dir/src/test/app/nanobench.cpp.o -MF _deps/nanobench-build/CMakeFiles/nanobench.dir/src/test/app/nanobench.cpp.o.d -o _deps/nanobench-build/CMakeFiles/nanobench.dir/src/test/app/nanobench.cpp.o -c _deps/nanobench-src/src/test/app/nanobench.cpp
  In file included from _deps/nanobench-src/src/test/app/nanobench.cpp:2:
  Warning: _deps/nanobench-src/src/include/nanobench.h:117:15: warning: alias declarations are a C++11 extension [-Wc++11-extensions]
  using Clock = std::conditional<std::chrono::high_resolution_clock::is_steady, std::chrono::high_resolution_clock,
                ^
  Error: _deps/nanobench-src/src/include/nanobench.h:296:19: error: expected function body after function declarator
  char const* csv() noexcept;
                    ^
  Error: _deps/nanobench-src/src/include/nanobench.h:308:27: error: expected function body after function declarator
  char const* htmlBoxplot() noexcept;
                            ^
  Error: _deps/nanobench-src/src/include/nanobench.h:319:20: error: expected function body after function declarator
  char const* json() noexcept;
                     ^
  Error: _deps/nanobench-src/src/include/nanobench.h:347:7: error: function definition does not declare parameters
      T pageFaults{};
        ^
  Error: _deps/nanobench-src/src/include/nanobench.h:348:7: error: function definition does not declare parameters
      T cpuCycles{};
        ^
  Error: _deps/nanobench-src/src/include/nanobench.h:349:7: error: function definition does not declare parameters
      T contextSwitches{};
        ^
  Error: _deps/nanobench-src/src/include/nanobench.h:350:7: error: function definition does not declare parameters
      T instructions{};
        ^
  Error: _deps/nanobench-src/src/include/nanobench.h:351:7: error: function definition does not declare parameters
      T branchInstructions{};
        ^
  Error: _deps/nanobench-src/src/include/nanobench.h:352:7: error: function definition does not declare parameters
      T branchMisses{};
        ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:360:33: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::string mBenchmarkTitle = "benchmark";
                                  ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:361:32: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::string mBenchmarkName = "noname";
                                 ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:362:23: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::string mUnit = "op";
                        ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:363:19: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      double mBatch = 1.0;
                    ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:364:25: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      double mComplexityN = -1.0;
                          ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:365:23: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      size_t mNumEpochs = 11;
                        ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:366:37: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      size_t mClockResolutionMultiple = static_cast<size_t>(1000);
                                      ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:367:44: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::chrono::nanoseconds mMaxEpochTime = std::chrono::milliseconds(100);
                                             ^
  Error: _deps/nanobench-src/src/include/nanobench.h:368:30: error: function definition does not declare parameters
      std::chrono::nanoseconds mMinEpochTime{};
                               ^
  Error: _deps/nanobench-src/src/include/nanobench.h:369:14: error: function definition does not declare parameters
      uint64_t mMinEpochIterations{1};
               ^
  Error: _deps/nanobench-src/src/include/nanobench.h:370:14: error: function definition does not declare parameters
      uint64_t mEpochIterations{0}; // If not 0, run *exactly* these number of iterations per epoch.
               ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:371:22: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      uint64_t mWarmup = 0;
                       ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:372:24: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::ostream* mOut = nullptr;
                         ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:373:45: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::chrono::duration<double> mTimeUnit = std::chrono::nanoseconds{1};
                                              ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:374:31: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      std::string mTimeUnitName = "ns";
                                ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:375:35: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      bool mShowPerformanceCounters = true;
                                    ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:376:22: warning: in-class initialization of non-static data member is a C++11 extension [-Wc++11-extensions]
      bool mIsRelative = false;
                       ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:381:29: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
      Config& operator=(Config&&);
                              ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:383:18: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
      Config(Config&&) noexcept;
                   ^
  Error: _deps/nanobench-src/src/include/nanobench.h:383:21: error: expected ';' at end of declaration list
      Config(Config&&) noexcept;
                      ^
                      ;
  Error: _deps/nanobench-src/src/include/nanobench.h:373:71: error: expected '(' for function-style cast or type construction
      std::chrono::duration<double> mTimeUnit = std::chrono::nanoseconds{1};
                                                ~~~~~~~~~~~~~~~~~~~~~~~~^
  Warning: _deps/nanobench-src/src/include/nanobench.h:391:10: warning: scoped enumerations are a C++11 extension [-Wc++11-extensions]
      enum class Measure : size_t {
           ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:407:29: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
      Result& operator=(Result&&);
                              ^
  Warning: _deps/nanobench-src/src/include/nanobench.h:409:18: warning: rvalue references are a C++11 extension [-Wc++11-extensions]
      Result(Result&&) noexcept;
                   ^
  Error: _deps/nanobench-src/src/include/nanobench.h:409:21: error: expected ';' at end of declaration list
      Result(Result&&) noexcept;
                      ^
                      ;
  Error: _deps/nanobench-src/src/include/nanobench.h:415:61: error: expected ';' at end of declaration list
      ANKERL_NANOBENCH(NODISCARD) Config const& config() const noexcept;
                                                              ^
                                                              ;
  Error: _deps/nanobench-src/src/include/nanobench.h:420:60: error: expected ';' at end of declaration list
      ANKERL_NANOBENCH(NODISCARD) double sum(Measure m) const noexcept;
                                                             ^
                                                             ;
  Error: _deps/nanobench-src/src/include/nanobench.h:421:80: error: expected ';' at end of declaration list
      ANKERL_NANOBENCH(NODISCARD) double sumProduct(Measure m1, Measure m2) const noexcept;
                                                                                 ^
                                                                                 ;
  Error: _deps/nanobench-src/src/include/nanobench.h:422:64: error: expected ';' at end of declaration list
      ANKERL_NANOBENCH(NODISCARD) double minimum(Measure m) const noexcept;
                                                                 ^
                                                                 ;
  fatal error: too many errors emitted, stopping now [-ferror-limit=]

add maps benchmark, with random and warmup

"total" is not correct when custom number of epochs is set

found in bitcoin with

    bench.minEpochIterations(10).batch(BATCH_SIZE * BATCHES).unit("job").epochs(100).run([&] {

Does nanobench support to filter benchmark tasks?

Does nanobench support to filtering benchmark tasks like ./nanobench --fitler=.vector?

Integrate bitcoin's travis builds

So more systems are tested, e.g. Mac

the cycles/value output doesn't show on some platform

Hi,

I'm using nanobench in some of my projects, everything's good.
Some question though. On one of my linux system using arch-linux, I got all the cycles/value, IPC, branch etc measures displayed.
One of my coworker use Linux Mint ubuntu 18.04 and he got non of those.

Is there a specific package to install so that the extra perf counter get picked up ?

comparisons of results

Ability to store and later compare results. Maybe multiple results, to create a graph with changes over time

Some statistical analysis would be nice. Maybe output in a format that's understood by some good tool

add Rng

Sfc64, without random seeding

Clarify use of doctest in documentation?

The documentation says

In the remaining examples, I’m using doctest as a unit test framework, which is like Catch2 - but compiles much faster. It pairs well with nanobench.

The benefits I can see from it is benchmark registration and filtering; is there anything else? And it may be useful to explicitly mention/show how it allows filtering.

On the other hand, you get the test results table which isn't too useful because they definitely should all pass. It can be suppressed by using

int main(int argc, char** argv) {
    doctest::Context context;
    context.setOption("out", "/dev/null");
    context.setOption("no-version", true);
    context.applyCommandLine(argc, argv);
    return context.run();
}

which may also be worth mentioning.

which one is more important, ns/op or total?

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
7,266,190.00	137.62	3.3%	4,721,603.00	15,556,024.00	0.304	1,302,758.00	13.0%	0.18	`hopscotch_map`
35,033,938.00	28.54	0.9%	4,961,470.00	76,638,408.00	0.065	1,097,837.00	23.8%	0.44	`unordered_map`
6,696,755.00	149.33	1.7%	8,040,577.00	14,634,752.00	0.549	767,069.00	17.1%	0.12	`flat_hash_map`
7,676,762.00	130.26	2.3%	7,126,320.00	16,794,536.00	0.424	774,163.00	17.6%	0.09	`F14FastMap`

flat_hash_map： ns/op : 6,696,755.00, total : 0.12
F14FastMap: ns/op: 7,676,762.00, total: 0.09

support hardware counters, at least in Linux x86

Instructions, branches, branch misses, cache information, ...

https://github.com/lemire/simdjson/blob/master/benchmark/linux/linux-perf-events.h

https://github.com/obilaniu/libpfc

rng benchmarks

Benchmark all random number generators, plus my own. This will also be a good example

https://en.cppreference.com/w/cpp/numeric/random

build fails with linux-headers 2.6.32

PERF_EVENT_IOC_ID is not defined, see bitcoin/bitcoin#21549

make more stuff accessible in results

Benchmark time
CPU?
Uname?
Clock resolution
Warnings

Also in json

check environment variable for endless running

E.g. run with

NANOBENCH_RUN_INFINITELY="std::sin(x)" ./benchmark

will run the benchmark with the name std::sin(x) indefinitely long, so it is possible to attach with a profiler.

Better documentaton

separate into multiple files
describe mustache-like templates (with demonstration graph)
Start with simple usage

Minor typo in documentation

Ability to pause and resume timers inside the benchmark block

It would be nice to have a facility similar to Google Benchmark's PauseTiming() and ResumeTiming() for benchmarks that need a setup phase at each iteration.

pre and post actions for benching

I'd like to be able to perform some action before/after an iteration which doesn't get counted towards the runtime. For example, I want to copy an array before each iteration so it starts from a clean slate but I also do not want to measure the runtime of the copying.

CMake integration (interface library)

There is a CMakeLists.txt in the nanobench repository, but it seems for development and not for users of the library. It would be nice if nanobench could be integrated in a project via CMake's FetchContent or as a Git submodule (if one can't use relatively new versions of CMake). Specifically, the following CMakeLists.txt should work:

cmake_minimum_required(VERSION 3.14)

project(
  CMakeNanobenchExample
  VERSION 1.0
  LANGUAGES CXX)

include(FetchContent)

FetchContent_Declare(
  doctest
  GIT_REPOSITORY https://github.com/onqtam/doctest.git
  GIT_TAG 2.4.0
  GIT_SHALLOW TRUE)

FetchContent_Declare(
  nanobench
  GIT_REPOSITORY https://github.com/martinus/nanobench.git
  GIT_TAG v4.0.2
  GIT_SHALLOW TRUE)

FetchContent_MakeAvailable(doctest nanobench)

add_executable(MyExample my_example.cpp)
target_link_libraries(MyExample PRIVATE doctest nanobench)

Basically, what FetchContent_MakeAvailable does here is cloning repositories declared by FetchContent_Declare and then including them via add_subdirectory.

The problems are:

nanobench is not an interface library. It should be easy to create a header-only library with CMake's interface library feature, for example, see here.
CMakeLists.txt in the nanobench repository performs many things, searching ccache and including subdirectories for testings, which are not relevant for library users. Especially, it leads to building many test programs. To avoid this, perhaps looking into CMakeLists.txt of doctest or Catch2 is helpful, where they detect whether or not the libraries are included as a subdirectory. See also here, which is an example in this site.

proper windows support

Clarify err%

What does the err% metric measure? I check my application with unit tests, libFuzzer, and many clang flags. I would expect this value to be zero, but it is reporting about 5% error rate.

Does err% indicate that nanobench observes an iteration crashing? Could you please expand the documentation for what this metric means?

Raise default to 1ms runtime

In some cases this significantly improves the precision with very little overhead. Done so in bitcoin's usage of nanobench

Suggestions for benchmarking parallel data structures?

Hi, I want to do some microbenchmarking of parallel data structures where I run a bunch of parallel tasks and then time the work that they do in different threads. The problem is that I want to combine the timings somehow of the work done in each task to avoid including the overhead of the task scheduler. Any suggestions for using nanobench for this usage scenario? Thanks!

Option to suppress unstable warnings at run-time?

Though it is a good thing to show warnings for unstable results (:wavy_dash: ... (Unstable with ...), sometimes one wants to run benchmarks in an environment where the execution is potentially perturbed by other processes and so results are known to be unstable; for example, in continuous integration.

So, it might be handy if an environment variable (NANOBENCH_NO_UNSTABLE_WARNING or something) could change the behaviour and suppress warnings for unstable results.

Changing the behaviour by a command-line option could be an alternative way, but I think an environment variable is more suitable for CI, and I guess it is easier to implement.

add Travis build

Tutorial - it's nanoseconds not seconds?

The table is labled ns/op but the explanation says:

Which means that one x.compare_exchange_strong(y, 0); call takes 7.81s on my machine

Setup with vcpkg

Trying to add nanobench via vcpkg manifest yields bizarre results.
vcpkg manifest dependency looks like so,

...
    "dependencies": [
        "xtensor-blas",
        "nanobench",
        {
            "name": "highfive",
            "default-features": false,
            "features": [ "xtensor" ]
        },
...

It only seems to be happy if I don't add anything to CMake at all.
If I try to find_package(nanobench) or link against nanobench it complains that it can't be found or that I have to add to CMake_Prefix_List etc

Is not having anything setup against it in CMake standard practice for this or is there some wizardry going on?

Confusion with CPU governor

Hi,

I am confused about the state of the CPU governor. nanobench reports:

Warning, results might be unstable:

CPU governor is '' but should be 'performance'

Turbo is enabled, CPU frequency will fluctuate

But I have the following configuration for my CPU frequency:

$ grep GOVERNOR /etc/init.d/cpufrequtils
#       GOVERNOR="ondemand"                                                                                                                        
GOVERNOR="performance"                                                                                                                             
        if [ -f $info ] && grep -q "\<$GOVERNOR\>" $info ; then                                                                                    
if [ -n "$GOVERNOR" ] ; then                                                                                                                       
        CPUFREQ_OPTIONS="$CPUFREQ_OPTIONS --governor $GOVERNOR"                                                                                    
                log_action_begin_msg "$DESC: Setting $GOVERNOR CPUFreq governor"

I have disabled the systemd service for ondemand. I regenerated initramfs and rebooted.

Is nanobench correctly detecting the CPU governor state?

Maybe nanobench is mistaken in its parsing. This is an EC2 instance virtual machine, I do not truly know if CPU governor customization is able to change. What do you think about hiding this message on virtual machines?

nanobench 4.3.0, Ubuntu 20.04 Focal Fossa, EC2 t2.micro.

how to deal with unstable results?

I'm benchmarking protobuf. There are always some warning on unstable result on first run, even after increasing the warmup and minimum iteration:

TEST_CASE("varint encode benchmark"){

    nanobench::Bench b;
    b.title("varint encode")
        .warmup(100000)
        .relative(true);
    b.performanceCounters(true);

    uint8_t buf[10] = {};

    b.minEpochIterations(10000000).run("google uint32_t", [&]{
        nanobench::doNotOptimizeAway(
            google::protobuf::io::CodedOutputStream::WriteVarint32ToArray(static_cast<uint32_t>(2961488830), buf));
    });

    b.minEpochIterations(10000000).run("google uint64_t", [&]{
        nanobench::doNotOptimizeAway(
            google::protobuf::io::CodedOutputStream::WriteVarint64ToArray(static_cast<uint64_t>(-41256202580718336), buf));
    });

//     std::ofstream f{"varint encode benchmark.html"};
//     b.render(nanobench::templates::htmlBoxplot(), f);

} // TEST_CASE("varint encode benchmark")


TEST_CASE("varint decode benchmark"){

    nanobench::Bench b;
    b.title("varint decode")
        .warmup(100000)
        .relative(true);
    b.performanceCounters(true);

    std::initializer_list<uint8_t> buf32 = {0xbe, 0xf7, 0x92, 0x84, 0x0b};
    std::initializer_list<uint8_t> buf64 = {0x9b, 0xa8, 0xf9, 0xc2, 0xbb, 0xd6, 0x80, 0x85, 0xa6, 0x01};

    b.minEpochIterations(10000000).run("google uint32_t", [&]{
        uint32_t v;
        nanobench::doNotOptimizeAway(
            google::protobuf::io::CodedInputStream{buf32.begin(), (int)buf32.size()}.ReadVarint32(&v));
    });

    b.minEpochIterations(10000000).run("google uint64_t", [&]{
        uint64_t v;
        nanobench::doNotOptimizeAway(
            google::protobuf::io::CodedInputStream{buf64.begin(), (int)buf64.size()}.ReadVarint64(&v));
    });

//     std::ofstream f{"varint decode benchmark.html"};
//     b.render(nanobench::templates::htmlBoxplot(), f);

} // TEST_CASE("varint decode benchmark")

encode results for 3 runs:

relative	ns/op	op/s	err%	total	varint encode
100.0%	6.95	143,943,329.37	5.3%	0.87	〰️ `google uint32_t` (Unstable with ~11,003,388.7 iters. Increase `minEpochIterations` to e.g. 110033887)
54.0%	12.86	77,751,471.35	2.0%	1.56	`google uint64_t`

relative	ns/op	op/s	err%	total	varint encode
100.0%	7.26	137,743,666.50	4.8%	0.88	`google uint32_t`
55.0%	13.21	75,702,505.04	1.6%	1.62	`google uint64_t`

relative	ns/op	op/s	err%	total	varint encode
100.0%	7.22	138,464,875.81	2.9%	0.86	`google uint32_t`
56.4%	12.80	78,102,820.79	1.9%	1.55	`google uint64_t`

decode results for 3 runs:

relative	ns/op	op/s	err%	total	varint decode
100.0%	26.93	37,132,748.98	1.9%	3.26	`google uint32_t`
101.0%	26.67	37,490,487.35	3.4%	3.23	`google uint64_t`

relative	ns/op	op/s	err%	total	varint decode
100.0%	26.64	37,543,305.23	1.4%	3.23	`google uint32_t`
102.7%	25.93	38,558,548.08	1.2%	3.17	`google uint64_t`

relative	ns/op	op/s	err%	total	varint decode
100.0%	27.45	36,434,999.13	0.8%	3.31	`google uint32_t`
103.1%	26.62	37,568,188.30	2.8%	3.23	`google uint64_t`

compiled with vs2019 16.8 msvc /O2
runs on Win10, [email protected]

Benchmarking C code

Can I use nanoBench to benchmark c codes?
If not, how can this be done?

check pyperf tunings

See if the system is properly configured. See what pyperf actually does under it's hood, and then at least check that the proper values are set. Much like sudo python3 -m pyperf system. Warn when something is not set.

Why a fork on pyperf?

Hi,
Is there a reason why does it recommend a fork of pyperf (https://github.com/vstinner/pyperf) instead of pyperf itself? (https://github.com/psf/pyperf)

Thanks!

update readme

compare with other frameworks

Run app 100 times, show nice diagram with the accuracy

Request: Ability to disable output

Perhaps there's already a way? But I couldn't find how to disable the output altogether. The use case is to be able to keep benchmarking with different data sizes so that one can programmatically determine algorithm thresholds using nanobench.

Randomly unstable results on Alder Lake, Win 11

Can give hugely (about 100 times slower) differ results when run AVX2 code

int main(int argc, char ** argv)
{
	alignas(32) float res[8];
	float * mem = static_cast<float *>(operator new(256, std::align_val_t(32)));
	float * mulmem = static_cast<float *>(operator new(256, std::align_val_t(32)));

	Bench().run("simd", [&]()
	{
		__m256 simdvec_ = _mm256_loadu_ps(mem);
		__m256 simdvecmul_ = _mm256_loadu_ps(mulmem);
		simdvec_ = _mm256_mul_ps(simdvec_, simdvecmul_);
		_mm256_storeu_ps(res, simdvec_);
	});
	return res[0];
}

Sometimes when i rebuild writes about 0.5 ns/op, and when i relaunch writes about 29 ns/op, i think it related to Windows 11 thread manager or/and because my processor is Alder Lake i5-12600k with E-cores.

Google Benchmark seems give more consistent results about 0.23 ns

More details in output

add number of iterations etc. Do this optional, e.g. with an environment variable

Clarify whether CPU or wallclock time is used

In Google Benchmark docs:

If the benchmarked code itself uses threads and you want to compare it to single-threaded code, you may want to use real-time ("wallclock") measurements for latency comparisons... Without UseRealTime, CPU time is used by default.

and

By default, the CPU timer only measures the time spent by the main thread. If the benchmark itself uses threads internally, this measurement may not be what you are looking for. Instead, there is a way to measure the total CPU usage of the process, by all the threads.

Does nanobench use wallclock or CPU time? And if CPU, for all threads or main thread only? I'd assume wallclock, but not certain. You may also want to make it configurable.

[feature] single run benchmarks

All microbenchmark software run code under benchmark in loop. This will lead CPU running with all instructions are hot, cached with microcode commands, predictions known for 1M iterations ahead...

My Feature request is to add some single run tests with getting time from rdtsc before common benchmark.

first run when code is cold will produce cold TimeStamp Counter
second and third run the same code (with data and instructions cached) and get two hot TSC

Next run current benchmark.

Benchmark statistics need include cold and two hot tsc for each test.

This three numbers (cold and two hot runs) will provide information about times for cold code run (first run when instructions not in CPU cache) and running cached instructions (and data). Two cold numbers will allow evaluate jitter. This numbers will be not so accurate as running code 1M times and get dispersion, but it will provide useful information for running times without trained predictor.

Won't compile due to undefined references errors

It compiles fine with g++, fails with clang++ with e.g

nb.cpp:(.text+0x2f): undefined reference to `ankerl::nanobench::Config::Config()'
nb.cpp:(.text+0xcf): undefined reference to `ankerl::nanobench::Config::~Config()'
nb.cpp:(.text+0x11d): undefined reference to `ankerl::nanobench::Config::~Config()'
/tmp/nb-dff5f7.o: In function `ankerl::nanobench::Result ankerl::nanobench::Config::run<main::$_0>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, main::$_0)':
nb.cpp:(.text+0x197): undefined reference to `ankerl::nanobench::detail::IterationLogic::IterationLogic(ankerl::nanobench::Config const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)'
nb.cpp:(.text+0x1af): undefined reference to `ankerl::nanobench::detail::IterationLogic::numIters() const'
nb.cpp:(.text+0x26f): undefined reference to `ankerl::nanobench::detail::IterationLogic::add(std::chrono::duration<long, std::ratio<1l, 1000000000l> >)'
nb.cpp:(.text+0x2a2): undefined reference to `ankerl::nanobench::detail::IterationLogic::result() const'
nb.cpp:(.text+0x2de): undefined reference to `ankerl::nanobench::detail::IterationLogic::result() const'
clang: error: linker command failed with exit code 1 (use -v to see invocation)
``

This is likely because the definition of those methods can't be resolved by clang++ somehow.

Idea? Timing only particular sections of code

Recently I've been testing a few different sorting algorithms, the rough setup has a preallocated block of memory filled with random data. However that means for each benchmark run after a sort, I have to scramble or generate more random data, and this code is shared between all the benchmarks. Which roughly translates to a benchmark which is really measuring the cost of those two things together, rather than just the sorting algorithm on its own. Presumably the sort algorithm overwhelms the cost of generating new data, but it's difficult to gauge exactly how much of a cost generating data is without running a benchmark with that part on it's own.

It seems almost as if with a few edits to add additional callbacks it'd be possible to add timings or even ignore parts of code which are not really part of the test. If a second callback doesn't make the code more unstable / slow to test it'd probably be a handy tool for cases like this.

Might look like:

    template <typename Start, typename Op, typename End>
    ANKERL_NANOBENCH(NOINLINE)
    Bench& run(std::string const& benchmarkName, Start&& start, Op&& op, End&& end);