Memory Barrier

A memory barrier is an instruction that prevents memory reordering across itself. No access after the barrier becomes visible until all accesses before it have become visible. Covered in ECE459 L15.

Why do we need barriers?

Compilers and CPUs reorder loads and stores for speed. A barrier is the low-level tool that says “this happens before that”, analogous to a semaphore at a higher level.

x86 barriers

mfence: all loads and stores before become visible before any loads and stores after
sfence: all stores before become visible before all stores after
lfence: all loads before become visible before all loads after

An sfence on one CPU makes stores visible, but another CPU still needs an lfence or mfence to read them in the right order.

Spin-wait flag
f = 0
/* thread 1 */                  /* thread 2 */
while (f == 0) /* spin */;      x = 42;
// memory fence                 // memory fence
printf("%d", x);                f = 1;
Fences ensure x = 42 is visible before f = 1, so when thread 1 escapes the spin it prints 42.

In Rust atomics

Ordering::Acquire: later accesses cannot move before
Ordering::Release: earlier accesses cannot move after
Ordering::SeqCst: full fence, restores SC
Ordering::Relaxed: no fence

Fence vs instruction-level ordering

Acquire and Release are semantic contracts, not “emit lfence”. The compiler satisfies them two ways:

block its own reordering at compile time
pick machine instructions whose architectural semantics already carry the ordering

On AArch64 an acquire-load is typically just LDAR, and a release-store is STLR, no separate barrier instruction needed. On x86 (TSO), only W→R can be reordered, so an ordinary load is already an acquire and an ordinary store is already a release; the compiler just refuses to reorder across the op. A real fence only shows up when the ISA’s plain instructions are too weak, e.g. SeqCst on x86 needs mfence (or a locked op) at the W→R boundary, and streaming stores (MOVNT*) need sfence before a publish.

Rust doc hint

An acquire load on read-only memory can be written as a relaxed load + fence(Acquire). Same semantics, different implementation: ordering on the op itself, or a weaker op plus a fence.

Cost

Fences block reorderings the compiler and CPU would use for speed, and force a thread to wait for another. SC necessarily generates fences, which is why it is expensive. Use the weakest ordering that is still correct.

🛠️ Steven Gong

Table of Contents

Memory Barrier

x86 barriers

In Rust atomics

Fence vs instruction-level ordering

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Memory Barrier

x86 barriers

In Rust atomics

Fence vs instruction-level ordering

Related

Graph View

Backlinks