Memory Barrier

A memory barrier is an instruction that prevents memory reordering across itself. No access after the barrier becomes visible until all accesses before it have become visible. Covered in ECE459 L15.

Why do we need barriers?

Compilers and CPUs reorder loads and stores for speed. A barrier is the low-level tool that says “this happens before that”, analogous to a semaphore at a higher level.

x86 barriers

  • mfence: all loads and stores before become visible before any loads and stores after
  • sfence: all stores before become visible before all stores after
  • lfence: all loads before become visible before all loads after

An sfence on one CPU makes stores visible, but another CPU still needs an lfence or mfence to read them in the right order.

Spin-wait flag

f = 0
/* thread 1 */                  /* thread 2 */
while (f == 0) /* spin */;      x = 42;
// memory fence                 // memory fence
printf("%d", x);                f = 1;

Fences ensure x = 42 is visible before f = 1, so when thread 1 escapes the spin it prints 42.

In Rust atomics

  • Ordering::Acquire: later accesses cannot move before
  • Ordering::Release: earlier accesses cannot move after
  • Ordering::SeqCst: full fence, restores SC
  • Ordering::Relaxed: no fence

Fence vs instruction-level ordering

Acquire and Release are semantic contracts, not “emit lfence”. The compiler satisfies them two ways:

  • block its own reordering at compile time
  • pick machine instructions whose architectural semantics already carry the ordering

On AArch64 an acquire-load is typically just LDAR, and a release-store is STLR, no separate barrier instruction needed. On x86 (TSO), only W→R can be reordered, so an ordinary load is already an acquire and an ordinary store is already a release; the compiler just refuses to reorder across the op. A real fence only shows up when the ISA’s plain instructions are too weak, e.g. SeqCst on x86 needs mfence (or a locked op) at the W→R boundary, and streaming stores (MOVNT*) need sfence before a publish.

Rust doc hint

An acquire load on read-only memory can be written as a relaxed load + fence(Acquire). Same semantics, different implementation: ordering on the op itself, or a weaker op plus a fence.

Cost

Fences block reorderings the compiler and CPU would use for speed, and force a thread to wait for another. SC necessarily generates fences, which is why it is expensive. Use the weakest ordering that is still correct.