Memory Consistency Model

A memory consistency model is the contract that specifies which orderings of memory operations are observable across threads. Covered in CS343 §10 and ECE459 L15.

Why do we need one?

Without a model, “event A happens before event B” has no meaning across threads. The model names the guarantees so locks, atomics, and fences can be reasoned about.

Relaxation models

Hardware models differ in which reorderings they allow between disjoint loads and stores, and in whether cache updates are lazy:

ModelW→RR→WW→WLazy cacheReal hardware
Atomic Consistent (AT)slow/impossible (distributed)
Sequential Consistency (SC)yesnone natively; strongest useful
Total Store Order (TSO)yesyesx86, SPARC
Partial Store Order (PSO)yesyesyes
Weak Order (WO)yesyesyesyesARM, Alpha
Release Consistency (RC)yesyesyesyesPowerPC + explicit atomic R/W syncs
  • AT: events occur instantaneously, impossible on real distributed hardware
  • SC: events are not instantaneous, so reads may be stale, but Dekker/Peterson still work
  • TSO (x86): only write-then-read can be reordered (store buffer). Most SC software still works with fences at critical W→R boundaries
  • WO (ARM): all four reorderings allowed, so software mutex requires fences
  • RC: explicit acquire/release points, hardware is free between them

Key principle

No user races + strong locks ⇒ SC semantics. Build locks with hardware atomics plus fences, protect all shared data with locks, and programmer-visible behaviour collapses back to SC even on a WO machine.

Where reorderings come from

  • Compiler: moves unrelated instructions to fill load-delay slots, and may do anything under undefined behaviour
  • Hardware: the CPU runs instructions in whatever order it finds convenient
  • Cross-thread visibility: thread A’s read can be reordered before thread B’s write becomes visible, so A sees stale data

Programming against it

Rust adopts C++‘s model (see std::memory_order and std::atomic). The orderings are RC-style: you pick which reorderings are blocked at each atomic op, trading safety for speed.

Relaxed (memory_order_relaxed)

Atomicity only, no ordering. The op itself is indivisible (no torn reads/writes), but the compiler and CPU can move unrelated accesses freely across it. No happens-before is established with other threads. Use it for independent counters:

hits.fetch_add(1, memory_order_relaxed);  // metrics counter, nobody sequences on this

Acquire (memory_order_acquire)

Applies to loads. Forbids later reads/writes from being reordered before this load. Think of it as “after I’ve loaded this value, everything below must stay below.”

Release (memory_order_release)

Applies to stores. Forbids earlier reads/writes from being reordered after this store. Think of it as “before I publish this value, everything above must actually be done.”

Acquire + Release compose into the publish/subscribe pattern that makes most lock-free code work:

data = 123;                           // (1) above the release
flag.store(1, memory_order_release);  // (2) nothing above can slip below
// ... another thread ...
while (flag.load(memory_order_acquire) == 0) {}  // (3) nothing below can slip above
assert(data == 123);                  // (4) guaranteed to see (1)

AcqRel (memory_order_acq_rel)

Both acquire and release, for read-modify-write ops like fetch_add that logically do a load then a store.

SeqCst (memory_order_seq_cst, default)

Strongest. Acquire + Release plus a single total order across all SeqCst ops on all threads. Needed when you have multiple independent flags and the code relies on their relative ordering. Expensive on weak-memory hardware (ARM inserts full fences).

Rule of thumb

Use SeqCst until profiling proves otherwise. Drop to Acquire/Release for publish/subscribe patterns. Drop to Relaxed only for counters nobody sequences on.

Real bug (Crossbeam PR #98)

A lock-free queue reported garbage in registers. The fix: the load of ready needed at least Acquire, and the store needed Release. Without them, the thread parked too early [O’C18].

Fences in practice

  • sfence / lfence / mfence: x86 store/load/full fences
  • Java volatile and C++11 std::atomic with default memory_order_seq_cst insert fences automatically
  • __asm__ __volatile__("" ::: "memory"): compiler barrier only, no hardware fence