Instruction-Level Parallelism

Out-of-Order Execution (OoO)

Out-of-order execution is when the CPU executes instructions in whatever order their operands become ready, rather than strictly in program order, and reorders the results at commit so the program still appears sequential externally.

Why?

Strict in-order execution forces the whole pipeline to wait on the slowest pending instruction (usually a cache miss at 200–300 cycles). That’s catastrophic in cache-miss-dominated code. OoO lets the CPU skip past the pending miss and keep chewing on independent work, ideally starting the next miss before the first finishes so their latencies overlap instead of stacking. It’s the single biggest lever for hiding memory latency.

OoO only works because three other tricks cover its flanks:

Worked example from L06: 7 x86 instructions with 2 cache misses:

ld rax, rbx+16    ; MISS #1 — assume cache miss
add rbx, 16       ; ADD doesn't need rax, keeps going (renaming makes this safe)
cmp rax, 0        ; needs rax, queues until available
jeq null_chk      ; needs cmp result — speculate "not taken"
st rbx-16, rcx    ; speculative store (goes to store buffer, not L1)
ld rcx, rdx       ; MISS #2 — now 2 misses in flight, 1 speculative op
ld rax, rax+8     ; must wait for MISS #1
ScenarioCycles
Serialized misses (naive)~600
Overlapped via OoO~305

Misses complete at cycles 300 and 304. All 7 operations finish in ~305. Nearly 2× speedup, entirely from starting miss #2 during miss #1’s shadow.

Static vs dynamic ILP. Intel’s Itanium bet on the compiler doing this work ahead of time (static). x86 bet on the hardware doing it at runtime (dynamic) with bigger reorder buffers and more functional units. Dynamic won.

From ECE459 L06.