Cachegrind
A Valgrind tool that simulates a multi-level cache and branch predictor, reporting per-line cache misses and branch mispredictions.
Why?
Hardware performance counters (
perf stat) give aggregate miss rates. Cachegrind tells you which source line caused the miss.
What it reports:
- I1, D1, LL cache refs, misses, miss rates
- Branch counts and mispredict rates
- Per-function and per-line breakdown
Invocation:
valgrind --tool=cachegrind ./mybin
cg_annotate cachegrind.out.<pid> # per-source-line report
Caveats
- Simulated, not measured: cache model is generic (matches your CPU’s sizes but not real coherency traffic or prefetching quirks). Good for patterns, not absolute numbers
- Slow: typically 20 to 100x runtime. Run on a short representative workload
- Single-threaded view: no coherency / false-sharing modelling. Use
perf c2cfor that
Companions:
- Callgrind: extends Cachegrind with a call graph. Open
callgrind.outin KCachegrind for a clickable hot-path view - Memcheck: default Valgrind tool, finds memory bugs, unrelated to cache
From ECE459 L28
Penalty numbers [Dev15]
- Fast-cache miss: ~10 cycles
- Miss all the way to memory: ~200 cycles
- Mispredicted branch: 10 to 30 cycles
Cache levels reported: I1, D1, LL (LL reuses your L3’s sizes for the simulation).
Compile optimized
Counter-intuitive vs other Valgrind tools: compile with optimizations on for Cachegrind. You want to see what happens in the released binary. Keep debug symbols.
Invocation
valgrind --tool=cachegrind --branch-sim=yes ./search
cg_annotate cachegrind.out.<pid>
--branch-sim=yes is needed since branch sim is off by default.
Reading the results
Sample -O0 vs -O2 run of a simple search:
- Instruction/data miss rates barely moved
- Branch mispredict rate went up (10.8% to 10.7%), but total branches dropped, so wasted cycles went down. Net win
To reason about “did this change help”, estimate wasted cycles: (misses * ~200) + (mispredicts * ~200), sum across categories, compare before/after.
cg_annotate output
Per-line columns: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim. Instruction ref + I1 miss + LL-i miss, then data read/write similar, then conditional branches + mispredicts and indirect + mispredicts.
Use case
Very verbose. Best used for why a specific change helped on performance-critical code, not for “what should I change?“.