Cachegrind

A Valgrind tool that simulates a multi-level cache and branch predictor, reporting per-line cache misses and branch mispredictions.

Why?

Hardware performance counters (perf stat) give aggregate miss rates. Cachegrind tells you which source line caused the miss.

What it reports:

I1, D1, LL cache refs, misses, miss rates
Branch counts and mispredict rates
Per-function and per-line breakdown

Invocation:

valgrind --tool=cachegrind ./mybin
cg_annotate cachegrind.out.<pid>        # per-source-line report

Caveats

Simulated, not measured: cache model is generic (matches your CPU’s sizes but not real coherency traffic or prefetching quirks). Good for patterns, not absolute numbers

Slow: typically 20 to 100x runtime. Run on a short representative workload

Single-threaded view: no coherency / false-sharing modelling. Use perf c2c for that

Companions:

Callgrind: extends Cachegrind with a call graph. Open callgrind.out in KCachegrind for a clickable hot-path view
Memcheck: default Valgrind tool, finds memory bugs, unrelated to cache

From ECE459 L28

Penalty numbers [Dev15]

Fast-cache miss: ~10 cycles
Miss all the way to memory: ~200 cycles
Mispredicted branch: 10 to 30 cycles

Cache levels reported: I1, D1, LL (LL reuses your L3’s sizes for the simulation).

Compile optimized

Counter-intuitive vs other Valgrind tools: compile with optimizations on for Cachegrind. You want to see what happens in the released binary. Keep debug symbols.

Invocation

valgrind --tool=cachegrind --branch-sim=yes ./search
cg_annotate cachegrind.out.<pid>

--branch-sim=yes is needed since branch sim is off by default.

Reading the results

Sample -O0 vs -O2 run of a simple search:

Instruction/data miss rates barely moved
Branch mispredict rate went up (10.8% to 10.7%), but total branches dropped, so wasted cycles went down. Net win

To reason about “did this change help”, estimate wasted cycles: (misses * ~200) + (mispredicts * ~200), sum across categories, compare before/after.

`cg_annotate` output

Per-line columns: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim. Instruction ref + I1 miss + LL-i miss, then data read/write similar, then conditional branches + mispredicts and indirect + mispredicts.

Use case

Very verbose. Best used for why a specific change helped on performance-critical code, not for “what should I change?“.

🛠️ Steven Gong

Table of Contents

Cachegrind

From ECE459 L28

Penalty numbers [Dev15]

Invocation

Reading the results

`cg_annotate` output

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Cachegrind

From ECE459 L28

Penalty numbers [Dev15]

Invocation

Reading the results

cg_annotate output

Related

Graph View

Backlinks

`cg_annotate` output