Valgrind

Cachegrind

A Valgrind tool that simulates a multi-level cache and branch predictor, reporting per-line cache misses and branch mispredictions.

Why?

Hardware performance counters (perf stat) give aggregate miss rates. Cachegrind tells you which source line caused the miss.

What it reports:

  • I1, D1, LL cache refs, misses, miss rates
  • Branch counts and mispredict rates
  • Per-function and per-line breakdown

Invocation:

valgrind --tool=cachegrind ./mybin
cg_annotate cachegrind.out.<pid>        # per-source-line report

Caveats

  • Simulated, not measured: cache model is generic (matches your CPU’s sizes but not real coherency traffic or prefetching quirks). Good for patterns, not absolute numbers
  • Slow: typically 20 to 100x runtime. Run on a short representative workload
  • Single-threaded view: no coherency / false-sharing modelling. Use perf c2c for that

Companions:

  • Callgrind: extends Cachegrind with a call graph. Open callgrind.out in KCachegrind for a clickable hot-path view
  • Memcheck: default Valgrind tool, finds memory bugs, unrelated to cache

From ECE459 L28

Penalty numbers [Dev15]

  • Fast-cache miss: ~10 cycles
  • Miss all the way to memory: ~200 cycles
  • Mispredicted branch: 10 to 30 cycles

Cache levels reported: I1, D1, LL (LL reuses your L3’s sizes for the simulation).

Compile optimized

Counter-intuitive vs other Valgrind tools: compile with optimizations on for Cachegrind. You want to see what happens in the released binary. Keep debug symbols.

Invocation

valgrind --tool=cachegrind --branch-sim=yes ./search
cg_annotate cachegrind.out.<pid>

--branch-sim=yes is needed since branch sim is off by default.

Reading the results

Sample -O0 vs -O2 run of a simple search:

  • Instruction/data miss rates barely moved
  • Branch mispredict rate went up (10.8% to 10.7%), but total branches dropped, so wasted cycles went down. Net win

To reason about “did this change help”, estimate wasted cycles: (misses * ~200) + (mispredicts * ~200), sum across categories, compare before/after.

cg_annotate output

Per-line columns: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim. Instruction ref + I1 miss + LL-i miss, then data read/write similar, then conditional branches + mispredicts and indirect + mispredicts.

Use case

Very verbose. Best used for why a specific change helped on performance-critical code, not for “what should I change?“.