Memory Reordering
Memory reordering is when memory operations execute or become visible in a different order than the source program specifies. Covered in ECE459 L15.
Why does it happen?
Both the compiler and the CPU look for reorderings that speed up execution (e.g. filling load-delay slots). Separately, writes from one thread may become visible to another out of program order.
Two sources
- Compiler reordering: the compiler swaps unrelated instructions to fill stalls caused by slow loads
- Hardware reordering: the CPU runs the given sequence in whatever order it prefers, and writes from other threads may not be visible yet
A compiler-reordering example, delaying the println! that depends on a slow load of thing.y so the two independent adds fit into the stall:
// before
let x = thing.y;
println!("x = {}", x);
z = z + 1;
a = b + c;
// after
let x = thing.y;
z = z + 1;
a = b + c;
println!("x = {}", x);Why compiler reordering is faster
A memory load doesn’t finish in one cycle. Reading thing.y might take ~4 cycles on L1 hit, ~12 on L2, ~40 on L3, or hundreds on DRAM. The CPU issues the load and moves on, but the register isn’t populated for a while.
Before reordering:
cycle 1: load x = thing.y // load issued, takes ~4 cycles
cycle 2: println!(x) // needs x, not ready -> STALL
cycle 3: (bubble)
cycle 4: (bubble)
cycle 5: println!(x) // x finally ready
cycle 6: z = z + 1
cycle 7: a = b + c
After reordering:
cycle 1: load x = thing.y
cycle 2: z = z + 1 // independent, runs while load in flight
cycle 3: a = b + c // independent, runs while load in flight
cycle 4: (maybe 1 bubble)
cycle 5: println!(x) // x ready now
The load itself takes the same time either way, you can’t make memory faster. What changes is what the CPU does while it waits. Those adds fit into cycles that would otherwise be bubbles, so they cost zero extra time.
Independence is the legality condition
z = z + 1anda = b + cdon’t touchxorthing.y, so moving them earlier doesn’t change the result. If the line werez = x + 1, the compiler couldn’t hoist it, it would just stall in the new spot.
What about the loads for
bandc?The example assumes
b,c,zare already in registers, so the adds are 1-cycle ops. If they aren’t, their loads get issued in parallel with the load ofthing.yvia memory-level parallelism (the cache handles multiple outstanding loads, bounded by load ports / MSHRs). Their latencies overlap instead of stacking. The only thing you can never overlap is dependent work: any instruction reading a value still in flight stalls.
What “slot” means
A slot is an instruction issue position in the pipeline. The CPU fills something there either way; the question is whether it’s useful work or a no-op bubble.
Is a slot just a cycle?
Related but not identical. A cycle is a unit of time (one clock tick); a slot is an issue position within that cycle. On a scalar in-order machine (one instruction per cycle), slot and cycle coincide. A superscalar core issues multiple instructions per cycle, so each cycle has several slots, often specialized (e.g. 2 ALU slots, 1 load slot, 1 branch slot).
The delay slot of a load is the position(s) right after the load, while the value isn’t yet available. “Filling the slot” means putting an independent instruction there instead of letting it become a stall.
On classic MIPS this was architectural: the ISA guaranteed the instruction at PC+4 after a load ran before the load’s result was visible, and the compiler had to either find useful work or emit a nop. Modern ISAs dropped the explicit slot, but the underlying latency is still there, now handled by hardware interlock.
Out-of-order vs in-order cores
- Modern x86, Apple Silicon, and similar: hardware reorders dynamically via the reorder buffer, so compile-time scheduling matters less, but still helps for long-latency misses where the compiler can hoist the load further than the finite reorder window reaches
- Simpler in-order cores (embedded ARM, older MIPS, many RISC-V): can’t reorder at all, so the compiler’s scheduling is the whole game, an unfilled slot is a guaranteed stall
What does not get reordered
Operations with clear data dependencies (e.g. z *= 2; z += 1;) stay in order. The compiler and hardware both see the dependency.
The dangerous case is reordering across what you think are synchronization boundaries but are not marked as such:
lock(mutex) → lock(mutex)
point.x = 42; point.x = 42;
point.y = -42; point.y = -42;
point.z = 0; unlock(mutex)
unlock(mutex) point.z = 0; // leaked outside critical section!
The compiler sees point.z = 0 and the mutex unlock as having no data dependency, and is happy to swap them.
Hardware variance
- 386: no reordering
- Modern x86: usually no reordering (specific exceptions)
- ARM: weak ordering except where data dependencies exist [Pre12]
Fix with memory barriers, or atomic types with appropriate Ordering.