ECCE @ EIC Software
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
BenchmarkTools.hpp
Go to the documentation of this file. Or view the newest version in sPHENIX GitHub for file BenchmarkTools.hpp
1 // This file is part of the Acts project.
2 //
3 // Copyright (C) 2019-2020 CERN for the benefit of the Acts project
4 //
5 // This Source Code Form is subject to the terms of the Mozilla Public
6 // License, v. 2.0. If a copy of the MPL was not distributed with this
7 // file, You can obtain one at http://mozilla.org/MPL/2.0/.
8 
9 #pragma once
10 
11 #include <algorithm>
12 #include <cassert>
13 #include <chrono>
14 #include <cmath>
15 #include <iomanip>
16 #include <numeric>
17 #include <ostream>
18 #include <type_traits>
19 #include <utility>
20 #include <vector>
21 
23 
24 namespace Acts {
25 namespace Test {
26 
27 // === INTRODUCTION ===
28 //
29 // This header contains tools which can be used to avoid unrealistic
30 // compiler optimizations in microbenchmarks, such as elimination of unused
31 // benchmark computations or deduplication of benchmark loops which repeat
32 // the same operation N times with unchanging inputs.
33 //
34 // ### WARNING ON HARDWARE OPTIMIZATIONS ###
35 //
36 // When using these tools, you must keep in mind that compiler optimizations
37 // are only a part of the story, and CPUs will also perform a bunch of dynamic
38 // optimizations on their own, including but by no means limited to...
39 //
40 // - Moving small benchmark inputs into fast caches
41 // - Buffering and caching small benchmark outputs
42 // - Instruction-level parallelization (pipelining, superscalar
43 // execution...) of supposedly unrelated benchmark iterations
44 // - "Turbo" overclocking of CPU cores which is only sustainable in specific
45 // circumstances (short-running code sections, single-core execution...)
46 // - Branch predictor training for specific loops and conditionals
47 //
48 // Unfortunately, there is no way to disable those hardware optimizations in
49 // a way which meets even basic sanity requirements:
50 //
51 // - Easily portable to all CPU models of interest
52 // - Fine-grained enough to target CPU optimizations which are an
53 // unrealistic artifact of the microbenchmark, without also disabling
54 // optimizations which will happen in realistic usage scenarios (like
55 // hot code remaining resident in the CPU instruction cache)
56 // - Generic enough for all benchmark inputs, outputs, and code
57 // - Minimal impact on overall benchmark timings
58 //
59 // You will thus need to live with the fact that any microbenchmark has
60 // hardware bias, which is not as egregious as compiler bias (since the CPU
61 // can't just delete gigantic sections of the code like the compiler does)
62 // but still very much significant on simple benchmarks.
63 //
64 // The proper way to detect and correct this bias is not to try to write more
65 // clever microbenchmarks. It is to use realistic benchmarks of full app
66 // workloads whenever possible, and treat microbenchmarks only as a tool for
67 // detecting performance regressions in known-critical components and
68 // easily fine-tuning the performance of these components, in a manner whose
69 // effectiveness will later have to be checked on a more realistic benchmark.
70 //
71 // #########################################
72 //
73 //
74 // === OPTIMIZATION BARRIERS ===
75 //
76 // Mark some data as read and written in the eye of the compiler, so it can't
77 // optimize under the assumption that the data isn't used or does not change.
78 //
79 // The current implementation has limitations that you should bear in mind:
80 //
81 // - "clobber" data which resides in CPU registers must be flushed to memory
82 // and reloaded from memory afterwards if re-used. This will increase memory
83 // traffic and cache footprint in a potentially unrealistic fashion.
84 // - Putting this optimization barrier on every iteration of a loop will
85 // prevent compiler loop optimizations like autovectorization, which is
86 // generally too strong. Consider putting the barrier every N loop
87 // iterations instead (and checking for various N), or at the very end of
88 // the loop/benchmark if the memory traffic of storing all the end results
89 // doesn't bias your benchmark too much.
90 // - The barrier's implementation uses a very basic subset of the GCC dialect of
91 // inline assembly, which although widely supported (from a quick godbolt
92 // check, in addition to GCC it works on clang, djgpp, ellcc, icc and zapcc),
93 // is most notably not supported by MSVC. I have not yet found an MSVC or
94 // portable equivalent which is UB-free and works even when passing in a
95 // pointer to a local variable, suggestions and patches are welcome.
96 //
97 #ifdef __GNUC__
98 
99 template <typename T>
100 inline void assumeAccessed(T&& clobber) {
101  // This optimization barrier leverages the fact that inline ASM isn't smart,
102  // and couldn't get smarter as the compiler would then need to be able to
103  // parse and understand the side effects of arbitrary assembly statements,
104  // which would be a ridiculous amount of work for compiler devs:
105  //
106  // - The compiler only performs trivial analysis of the assembly (e.g.
107  // number of instructions for inlining purposes), so it can't leverage the
108  // fact that it's empty to optimize out its inputs. It must assume that
109  // the assembly statements will use the declared inputs, emit the declared
110  // outputs, and read/write the contents of declared clobbers.
111  // - A pointer to "clobber" is declared as an input to assembly, and the
112  // "memory" clobber allows the assembly statement to read and write memory
113  // targeted by this pointer. Therefore, the compiler must compute and
114  // write down the value of "clobber" in memory, and reload it into
115  // registers if it's used later on by the program.
116  // - A volatile marker is used, therefore the compiler can't optimize out
117  // the asm statement even though from its specification it should have no
118  // program-visible side effects. It also prevents the compiler from moving
119  // the asm statement out of a loop, which would be bad... But note that it
120  // does _not_ prevent it from moving within a loop iteration.
121  //
122  __asm__ volatile("" : : "g"(&clobber) : "memory");
123 }
124 
125 #else
126 
127 template <typename T>
128 void assumeAccessed(T&& clobber) {
129  // FIXME: Find a reliable optimization barrier for MSVC. As a litmus test, the
130  // assembly generated for the following code should store "42" in
131  // memory twice, not once:
132  //
133  // ```
134  // int x = 42;
135  // assumeAccessed(x);
136  // x = 42;
137  // assumeAccessed(&x);
138  // ```
139  //
140  static_assert(false, "No optimization barrier available for this compiler");
141 }
142 
143 #endif
144 
145 // Mark some data as read in the eye of the compiler, so it can't optimize out
146 // the source computation under the assumption that its output isn't used.
147 //
148 // Has the same caveats as assumeAccessed().
149 //
150 template <typename T>
151 inline void assumeRead(const T& clobber) {
152  // FIXME: I don't know of a finer-grained compiler optimization barrier that
153  // 1/can be used when one only wants to fake a read and 2/works for
154  // all inputs (not just machine types), so for now this function is
155  // only a way to clarify developer intent.
156  assumeAccessed(clobber);
157 }
158 
159 // Mark some data as written in the eye of the compiler, so it can't optimize
160 // out repetitive dependent computations under the assumption that the result
161 // will always be the same (and more generally can't assume anything about the
162 // value of this input beyond type-level properties).
163 //
164 // Has the same caveats as assumeAccessed().
165 //
166 template <typename T>
167 inline void assumeWritten(T& clobber) {
168  // FIXME: I don't know of a finer-grained compiler optimization barrier that
169  // 1/can be used when one only wants to fake a write and 2/ works for
170  // all outputs (not just machine types), so for now this
171  // function is only a way to clarify developer intent.
172  assumeAccessed(clobber);
173 }
174 //
175 //
176 // === MICROBENCHMARK HARNESS ===
177 
178 // Results of a microbenchmark
179 //
180 // Holds the timings of each benchmark run, and allows the user to query various
181 // statistical information about them.
182 //
184  using Duration = std::chrono::duration<double, std::nano>;
185 
187  std::vector<Duration> run_timings;
188 
189  // Total benchmark running time
190  //
191  // Unless your machine has been tuned for extremely stable timings (minimal
192  // background activity, no CPU frequency scaling, no disk spin-down...), you
193  // shouldn't give much credence to benchmarks with running times smaller than
194  // a few hundreds of milliseconds.
195  //
196  Duration totalTime() const {
197  return std::accumulate(run_timings.cbegin(), run_timings.cend(),
198  Duration());
199  }
200 
201  // Robust estimator of the benchmark iteration time
202  //
203  // Computed as the average iteration time of the median benchmark run, this
204  // estimator provides a tunable compromise between the mean and median
205  // estimators of the benchmark iteration time:
206  //
207  // - As a mean iteration time, it can be measured with low overhead by tuning
208  // iters_per_run up until the impact of time measurement on benchmark
209  // timings is negligible. It also converges with optimal efficiency on
210  // unbiased data as iters_per_run increases.
211  // - Being based on the median run timing, this estimator is also robust
212  // against outlier measurements, such as timing spikes originating from
213  // bursts of background system load. The more benchmark runs you measure,
214  // the higher the robustness.
215  //
217  assert(iters_per_run > 0);
218  return runTimeMedian() / iters_per_run;
219  }
220 
221  // Robust estimator of the standard error on the benchmark iteration time
223  assert(iters_per_run > 0);
224  return runTimeRobustStddev() / std::sqrt(iters_per_run);
225  }
226 
227  // Sorted benchmark run times, used for computing outlier-robust statistics
228  std::vector<Duration> sortedRunTimes() const {
229  std::vector<Duration> sorted_timings = run_timings;
230  std::sort(sorted_timings.begin(), sorted_timings.end());
231  return sorted_timings;
232  }
233 
234  // Median time per benchmark run
236  assert(run_timings.size() >= 1);
237  const std::vector<Duration> sorted_timings = sortedRunTimes();
238  const size_t midpoint = sorted_timings.size() / 2;
239  if (sorted_timings.size() % 2 == 0) {
240  return (sorted_timings[midpoint - 1] + sorted_timings[midpoint]) / 2;
241  } else {
242  return sorted_timings[midpoint];
243  }
244  }
245 
246  // First and third quartiles of benchmark run time timings
247  std::pair<Duration, Duration> runTimeQuartiles() const {
248  // Unfortunately, quartile computations on datasets whose size is not a
249  // multiple of 4 are not standardized. We use an interpolation- and
250  // symmetry-based definition that follows all consensual properties:
251  //
252  // - When the dataset size is a multiple of 4, the first quantile is
253  // the mean of the last point of the first quarter of the sorted dataset
254  // (called first_point below) and the first point of the second quarter of
255  // the dataset (which is the next point). This is universally agreed upon.
256  // - More generally, when the dataset size is a multiple of 2, the first
257  // quantile is the median of the first half of the sorted dataset. Most
258  // commonly used quantile definitions agree on this, but some don't.
259  // - The third quantile is defined symmetrically with respect to the first
260  // one, starting from the end of the sorted dataset and going downwards.
261  //
262  assert(run_timings.size() >= 2);
263  const std::vector<Duration> sorted_timings = sortedRunTimes();
264  const size_t first_point = (sorted_timings.size() - 2) / 4;
265  const size_t offset = (sorted_timings.size() - 2) % 4;
266  const size_t third_point = (sorted_timings.size() - 1) - first_point;
267  if (offset == 0) {
268  return {sorted_timings[first_point], sorted_timings[third_point]};
269  } else {
270  const auto first_quartile = ((4 - offset) * sorted_timings[first_point] +
271  offset * sorted_timings[first_point + 1]) /
272  4;
273  const auto third_quartile = ((4 - offset) * sorted_timings[third_point] +
274  offset * sorted_timings[third_point - 1]) /
275  4;
276  return {first_quartile, third_quartile};
277  }
278  }
279 
280  // Robust estimator of benchmark run timing standard deviation
281  //
282  // Assuming that the run time distribution is normal aside from occasional
283  // outlier pollution, an outlier-robust, unbiased, and consistent estimator of
284  // its standard deviation can be built from the interquartile range via
285  // the formula estimated_stddev = IQR / (2 * sqrt(2) * erf-1(1/2)).
286  //
288  auto [firstq, thirdq] = runTimeQuartiles();
289  return (thirdq - firstq) / (2. * std::sqrt(2.) * 0.4769362762044698733814);
290  }
291 
292  // Standardized display for benchmark statistics
293  friend std::ostream& operator<<(std::ostream& os,
294  const MicroBenchmarkResult& res) {
295  auto old_precision = os.precision();
296  auto old_flags = os.flags();
297  os << std::fixed << res.run_timings.size() << " runs of "
298  << res.iters_per_run << " iteration(s), " << std::setprecision(1)
299  << res.totalTime().count() / 1'000'000 << "ms total, "
300  << std::setprecision(4) << res.runTimeMedian().count() / 1'000 << "+/-"
301  << res.runTimeRobustStddev().count() / 1'000 << "µs per run, "
302  << std::setprecision(3) << res.iterTimeAverage().count() << "+/-"
303  << res.iterTimeError().count() << "ns per iteration";
304  os.precision(old_precision);
305  os.flags(old_flags);
306  return os;
307  }
308 };
309 
310 // Implementation details, scroll down for more public API
311 namespace benchmark_tools_internal {
312 
313 // General iteration of microBenchmark with inputs and outputs
314 template <typename Callable, typename Input, typename Result>
316  static inline void iter(const Callable& iteration, const Input& input) {
317  assumeWritten(iteration);
318  assumeWritten(input);
319  const auto result = iteration(input);
320  assumeRead(result);
321  }
322 };
323 
324 // Specialization for void(Input) functors, where there is no output
325 template <typename Callable, typename Input>
326 struct MicroBenchmarkIterImpl<Callable, Input, void> {
327  static inline void iter(const Callable& iteration, const Input& input) {
328  assumeWritten(iteration);
329  assumeWritten(input);
330  iteration(input);
331  }
332 };
333 
334 // Specialization for Result(void) functors, where there is no input
335 template <typename Callable, typename Result>
336 struct MicroBenchmarkIterImpl<Callable, void, Result> {
337  static inline void iter(const Callable& iteration) {
338  assumeWritten(iteration);
339  const auto result = iteration();
340  assumeRead(result);
341  }
342 };
343 
344 // Specialization for void() functors, where there is no input and no output
345 template <typename Callable>
346 struct MicroBenchmarkIterImpl<Callable, void, void> {
347  static inline void iter(const Callable& iteration) {
348  assumeWritten(iteration);
349  iteration();
350  }
351 };
352 
353 template <typename T, typename I>
354 using call_with_input_t = decltype(std::declval<T>()(std::declval<I>()));
355 
356 template <typename T>
357 using call_without_input_t = decltype(std::declval<T>()());
358 
359 // If callable is a callable that takes the expected input argument type, then
360 // this specialization will be selected...
361 template <typename Callable, typename Input = void>
363  constexpr static bool is_callable =
364  concept ::exists<call_with_input_t, Callable, Input>;
365  static inline void iter(const Callable& iteration, const Input* input) {
366  static_assert(is_callable, "Gave callable that is not callable with input");
367  if constexpr (is_callable) {
368  using Result = std::invoke_result_t<Callable, const Input&>;
370  }
371  }
372 };
373 
374 // If Callable is a callable that takes no argument, this specialization will be
375 // picked instead of the one above...
376 template <typename Callable>
377 struct MicroBenchmarkIter<Callable, void> {
378  constexpr static bool is_callable =
379  concept ::exists<call_without_input_t, Callable>;
380 
381  static inline void iter(const Callable& iteration, const void* = nullptr) {
382  static_assert(is_callable,
383  "Gave callable that is not callable without input");
384  if constexpr (is_callable) {
385  using Result = std::invoke_result_t<Callable>;
387  }
388  }
389 };
390 
391 // Common logic between iteration-based and data-based microBenchmark
392 template <typename Callable>
393 MicroBenchmarkResult microBenchmarkImpl(Callable&& run, size_t iters_per_run,
394  size_t num_runs,
395  std::chrono::milliseconds warmup_time) {
396  using Clock = std::chrono::steady_clock;
397 
398  MicroBenchmarkResult result;
399  result.iters_per_run = iters_per_run;
400  result.run_timings = std::vector(num_runs, MicroBenchmarkResult::Duration());
401 
402  const auto warmup_start = Clock::now();
403  while (Clock::now() - warmup_start < warmup_time) {
404  run();
405  }
406 
407  for (size_t i = 0; i < num_runs; ++i) {
408  const auto start = Clock::now();
409  run();
410  result.run_timings[i] = Clock::now() - start;
411  }
412 
413  return result;
414 }
415 
416 } // namespace benchmark_tools_internal
417 
418 // Run a user-provided benchmark `iteration` function that takes no argument
419 // in batches of `iters_per_run` iterations, for `num_runs` batches, after some
420 // warmup period of `warmup_time` has elapsed. Return execution statistics.
421 //
422 // The output of `iteration` is marked as read, so the compiler cannot optimize
423 // it out except by const-propagation (i.e. it is a function of a constexpr
424 // quantity). If `iteration` is a lambda which captures inputs from a higher
425 // scope, those are marked as written on every iteration as well, so the
426 // compiler cannot optimize out the iteration loop into a single iteration.
427 //
428 // Together, these precautions void the need for manually calling assumeRead and
429 // assumeWritten in all simple cases where the benchmark iteration function
430 // ingests some inputs from an outer scope and emits the output of its
431 // computations as a return value.
432 //
433 // We do batched runs instead of using a single iteration loop because:
434 // - It allows us to provide error bars on the timing measurement, which allows
435 // comparing two measurements and detecting timing jitter problems emerging
436 // from e.g. excessive background system activity.
437 // - For short-running iteration functions, it allows you to keep benchmark runs
438 // much shorter than one OS scheduling quantum (typically ~1ms), which enables
439 // high-precision measurements devoid of any scheduling-induced timing jitter,
440 // while still aggregating as much statistics as you want via `num_runs`.
441 //
442 // For optimal timing resolution on modern x86 CPUs, you will want to tune
443 // `iters_per_run` so that the median run timing is around a few tens of µs:
444 // - The TSC CPU clock which is most likely used by your operating system's
445 // high-resolution clock has an overhead of a few tens of CPU cycles, so for
446 // a ~GHz CPU clock this gives you a clock-related bias and noise of ~10ns. At
447 // ~10µs run times, that only contributes for 1/1000 of observed timings.
448 // - The main source of benchmark noise is that your operating system's
449 // scheduler disturbs the program every few miliseconds. With run timings of
450 // ~10µs, this disturbance affects less than 1% of data points, and is thus
451 // perfectly eliminated by our outlier-robust statistics.
452 //
453 // You shouldn't usually need to adjust the number of runs and warmup time, but
454 // here are some guidelines for those times where you need to:
455 // - `num_runs` is a somewhat delicate compromise between several concerns:
456 // * Quality of error bars (many runs mean more precise error bars)
457 // * Outlier rejection (many runs mean better outlier rejection)
458 // * Benchmark running time (many runs take longer)
459 // * Handling of short-lived background disturbances (these have a higher
460 // chance of occuring in longer-lived benchmarks, but if they only take
461 // a small portion of the benchmark's running time, they can be taken
462 // care of by outlier rejection instead of polluting the results)
463 // - `warmup_time` should be chosen based on the time it takes for run timings
464 // to reach a steady state on your system after a benchmark starts. Here is a
465 // possible measurement protocol for that:
466 // * Set up a long-running benchmark (several seconds) with no warmup.
467 // * Dump benchmark run timings to a file at the end of every execution.
468 // * Run this benchmark a couple of times (say, 5-10x times).
469 // * Plot the resulting run timing time series against their cumulated sum
470 // (which is a good approximation of the elapsed time).
471 // * Note after how much elapsed time the timings typically become steady.
472 //
473 
474 template <typename Callable>
476  Callable&& iteration, size_t iters_per_run, size_t num_runs = 20000,
477  std::chrono::milliseconds warmup_time = std::chrono::milliseconds(2000)) {
479  [&] {
480  for (size_t iter = 0; iter < iters_per_run; ++iter) {
482  iteration);
483  }
484  },
485  iters_per_run, num_runs, warmup_time);
486 }
487 
488 // Same idea as above, but the iteration function takes one argument, which is
489 // taken from the `inputs` collection. A run is one iteration through `inputs`.
490 //
491 // This variant is convenient when you want to test on random data in order to
492 // avoid the bias of always benchmarking on the same data, but don't want to
493 // "see" the cost of random number generation in your benchmark timings.
494 template <typename Callable, typename Input>
496  Callable&& iterationWithInput, const std::vector<Input>& inputs,
497  size_t num_runs = 20000,
498  std::chrono::milliseconds warmup_time = std::chrono::milliseconds(2000)) {
500  [&] {
501  for (const auto& input : inputs) {
503  iterationWithInput, &input);
504  }
505  },
506  inputs.size(), num_runs, warmup_time);
507 }
508 
509 } // namespace Test
510 } // namespace Acts