Benchmarking Pitfalls

Ways a benchmark number becomes a lie. From “Liar, Liar” (ECE459 L29) and Andy Georges et al. “Statistically Rigorous Java Performance Evaluation.”

  • Dead-code elimination: the compiler proves your computed value isn’t used and deletes the loop. Touch the result (black_box, volatile write, print it).
  • Constant folding: inputs are compile-time constants; compiler precomputes. Make inputs opaque to the compiler.
  • No warm-up: first runs hit cold caches, cold JIT, cold page tables. Discard warm-up iterations.
  • One sample: noise on a laptop is ±10–30%. Run many; report median + p95, not min, not “last run”.
  • CPU frequency scaling: idle CPUs downclock; a short benchmark runs on a low frequency until the governor ramps up. Pin frequency or run long enough.
  • Noisy neighbour: background process, anti-virus, Docker build. Quiesce the system.
  • Branch predictor / cache state leaking between runs: interleave A and B runs, not all A then all B.
  • Different inputs between runs: randomness matters; fix seeds.
  • Wall-clock vs CPU time confusion: parallel programs look slow on CPU-time, fast on wall-clock.
  • Microbenchmark that doesn’t match production: optimizing the microbenchmark makes the real program slower (Runway effect).

Corollary to Rule 4 of Crista’s Laws: a controlled experiment with untrustworthy measurement is no experiment at all.

From ECE459 L29 “Liar, Liar”

Police-and-prosecutor analogy: a profiler collects evidence, then builds a narrative. Both parts can lie. The helicopter-with-still-blades YouTube video: the samples are correct, the sampling frequency just matches blade rotation. Same for code.

Sampling misses what it isn’t looking at

Periodic interrupt handlers scheduled between samples are invisible.

mfence vs lock: attribution lie [Khu14]

Spinlock profiles said spinlocks were cheap, but removing them helped more than expected.

Microbench of uncached reads + concurrency control:

No atomic/fence:   2.81e9 cycles
lock inc/dec:      3.66e9 cycles
mfence:           19.60e9 cycles

perf annotate showed mfence costing only 15% and locks 40%. But total runtime with mfence is 5× higher. mfence causes a pipeline flush, and the flushed instructions get the blame, not the mfence itself. Profilers overestimate locks and underestimate mfence.

Lesson: a lower percentage of a larger total is not a win.

Skid

The instruction pointer in a sample is where the CPU was when the PMU-overflow interrupt got handled, not where the counter actually overflowed. Blame lands on whatever ran next, often a cheap NOP.

ld r1,0x12341234   0.1%
add r2,r3          1.0%
sub r3,r4          1.0%
NOP              27.0%   ← actually the load

Modern Intel/AMD have low-/no-skid sampling with hardware support; not used by default (overhead? distortion in specific cases?).

Long-tail lies [Luu16]

Averages hide bimodal/long-tail distributions. Google disk read histogram showed peaks at 250, 500, 750, 1000 ms; p99 = 696 ms.

  • Peak 1: RAM cache.
  • Peak 2 ~3 ms: disk cache via PCIe.
  • Peak 3 ~25 ms: real seek.
  • Extra peaks at quarter-second multiples.

Cause: kernel CPU quota throttling: processes over quota got slept until the next quarter-second boundary; if still over, slept again. Phase-change exit required a sparse quarter-second by chance. Was happening on 25% of Google disk servers, ~30 min/day, up to 23 hours at a stretch, for 3 years.

Sampling rate is capped

Default perf on Lucene → no useful info. perf at max rate → slightly more. SHIM (instrumentation-based) at dramatically higher rate → actually useful. perf samples via interrupts and cranking interrupt frequency consumes the CPU. SHIM instruments the program itself to emit on function return; more expensive per event but no interrupt cost. DTrace and nnethercote’s counts enable similar custom instrumentation.

Making counters deterministic (Rust compiler hackers)

Hardware perf counters are ~5 orders of magnitude more deterministic than wall time, but still noisy. To reduce noise:

  • Disable ASLR: randomized pointers randomize hash layouts.
  • Subtract IRQ time.
  • Profile one thread only if possible.

AMD speculates past atomics then rolls back, but doesn’t roll back perf counters. Post-Spectre there’s a hidden SpecLockMap MSR that disables that speculation.

Calling-context lies (gprof) [Kre13]

gprof combines two sources:

  • profil(): statistical IP samples, 100 Hz.
  • mcount(): exact call-graph edges (via -pg instrumentation).

Combining stat + exact produces bogus inferences for any function whose runtime depends on inputs and is called from multiple contexts. Canonical failure: easy and hard each called once, hard eats nearly all CPU, yet gprof divides total time evenly and reports both as 50/50. Reliable only for single-caller functions or constant-time ones (rand()).

Take-away

Focus on the metric you actually care about; understand how your tool works; if a result doesn’t make sense, dig in.