Instruction-Level Parallelism

Miss Shadow

A miss shadow is the window between a load being issued and the loaded value being used, during which a modern CPU keeps executing unrelated instructions instead of stalling.

Why?

A load from DRAM takes 200–300 cycles. A naive CPU that stalls the instant it can’t find the value in cache would throw away hundreds of cycles per miss, and in cache-miss-dominated code, that’s most of the runtime. The miss shadow is how the CPU gets useful work done while waiting.

Cost of a load depends on where the value lives:

Where the value livesCost
L1 cache2–3 cycles
L2 / L3in between
DRAM200–300 cycles

Naive CPU: issue the load, stall until the value arrives in the destination register.

Modern CPU: keep issuing instructions. Hardware tracks “this register isn’t ready yet”; only stall when an instruction actually reads it.

ld  rax, [mem]    ; MISS: value arrives in ~200 cycles
add rbx, 16       ; runs in the shadow (doesn't touch rax)
cmp rcx, 0        ; runs in the shadow
jeq label         ; runs in the shadow (speculated)
...
mov rdx, rax      ; ← first USE of rax. Stalls if not ready.

Multiple loads in flight. The CPU isn’t limited to one pending miss. Depending on the architecture, 2+ loads can be outstanding at once, so a second ld inside the shadow starts its own shadow, and the two miss latencies overlap instead of stacking. This is what OoO exploits to turn cache-miss-dominated code from “wait, wait, wait” into “wait once for several misses at once.”

From ECE459 L06.