Memory Reordering

Memory reordering is when memory operations execute or become visible in a different order than the source program specifies. Covered in ECE459 L15.

Why does it happen?

Both the compiler and the CPU look for reorderings that speed up execution (e.g. filling load-delay slots). Separately, writes from one thread may become visible to another out of program order.

Two sources

Compiler reordering: the compiler swaps unrelated instructions to fill stalls caused by slow loads
Hardware reordering: the CPU runs the given sequence in whatever order it prefers, and writes from other threads may not be visible yet

A compiler-reordering example, delaying the println! that depends on a slow load of thing.y so the two independent adds fit into the stall:

// before
let x = thing.y;
println!("x = {}", x);
z = z + 1;
a = b + c;
 
// after
let x = thing.y;
z = z + 1;
a = b + c;
println!("x = {}", x);

Why compiler reordering is faster

A memory load doesn’t finish in one cycle. Reading thing.y might take ~4 cycles on L1 hit, ~12 on L2, ~40 on L3, or hundreds on DRAM. The CPU issues the load and moves on, but the register isn’t populated for a while.

Before reordering:

cycle 1:  load x = thing.y    // load issued, takes ~4 cycles
cycle 2:  println!(x)         // needs x, not ready -> STALL
cycle 3:    (bubble)
cycle 4:    (bubble)
cycle 5:  println!(x)         // x finally ready
cycle 6:  z = z + 1
cycle 7:  a = b + c

After reordering:

cycle 1:  load x = thing.y
cycle 2:  z = z + 1           // independent, runs while load in flight
cycle 3:  a = b + c           // independent, runs while load in flight
cycle 4:    (maybe 1 bubble)
cycle 5:  println!(x)         // x ready now

The load itself takes the same time either way, you can’t make memory faster. What changes is what the CPU does while it waits. Those adds fit into cycles that would otherwise be bubbles, so they cost zero extra time.

Independence is the legality condition

z = z + 1 and a = b + c don’t touch x or thing.y, so moving them earlier doesn’t change the result. If the line were z = x + 1, the compiler couldn’t hoist it, it would just stall in the new spot.

What about the loads for b and c?

The example assumes b, c, z are already in registers, so the adds are 1-cycle ops. If they aren’t, their loads get issued in parallel with the load of thing.y via memory-level parallelism (the cache handles multiple outstanding loads, bounded by load ports / MSHRs). Their latencies overlap instead of stacking. The only thing you can never overlap is dependent work: any instruction reading a value still in flight stalls.

What “slot” means

A slot is an instruction issue position in the pipeline. The CPU fills something there either way; the question is whether it’s useful work or a no-op bubble.

Is a slot just a cycle?

Related but not identical. A cycle is a unit of time (one clock tick); a slot is an issue position within that cycle. On a scalar in-order machine (one instruction per cycle), slot and cycle coincide. A superscalar core issues multiple instructions per cycle, so each cycle has several slots, often specialized (e.g. 2 ALU slots, 1 load slot, 1 branch slot).

The delay slot of a load is the position(s) right after the load, while the value isn’t yet available. “Filling the slot” means putting an independent instruction there instead of letting it become a stall.

On classic MIPS this was architectural: the ISA guaranteed the instruction at PC+4 after a load ran before the load’s result was visible, and the compiler had to either find useful work or emit a nop. Modern ISAs dropped the explicit slot, but the underlying latency is still there, now handled by hardware interlock.

Out-of-order vs in-order cores

Modern x86, Apple Silicon, and similar: hardware reorders dynamically via the reorder buffer, so compile-time scheduling matters less, but still helps for long-latency misses where the compiler can hoist the load further than the finite reorder window reaches
Simpler in-order cores (embedded ARM, older MIPS, many RISC-V): can’t reorder at all, so the compiler’s scheduling is the whole game, an unfilled slot is a guaranteed stall

What does not get reordered

Operations with clear data dependencies (e.g. z *= 2; z += 1;) stay in order. The compiler and hardware both see the dependency.

The dangerous case is reordering across what you think are synchronization boundaries but are not marked as such:

lock(mutex)            →  lock(mutex)
point.x = 42;             point.x = 42;
point.y = -42;            point.y = -42;
point.z = 0;              unlock(mutex)
unlock(mutex)             point.z = 0;   // leaked outside critical section!

The compiler sees point.z = 0 and the mutex unlock as having no data dependency, and is happy to swap them.

Hardware variance

386: no reordering
Modern x86: usually no reordering (specific exceptions)
ARM: weak ordering except where data dependencies exist [Pre12]

Fix with memory barriers, or atomic types with appropriate Ordering.

🛠️ Steven Gong

Table of Contents

Memory Reordering

Two sources

Why compiler reordering is faster

What “slot” means

Out-of-order vs in-order cores

What does not get reordered

Hardware variance

Graph View

Backlinks