Replication (Optimization)

Replication gives each thread its own private copy of a value so threads don’t fight over a shared cache line. The aggregate is reconstructed on demand by summing/merging per-thread copies. Trades space (N × sizeof(T)) for eliminating false sharing and cache-coherence traffic on the hot path.

Why replicate instead of lock-free atomics?

Because even a lock-free atomic counter serializes the cache line across cores, every increment bounces ownership, producing coherence traffic proportional to thread count. A replicated counter has each thread write to its own line, zero contention, and pays only when someone reads the total. For metrics that are written constantly and read rarely (request counters, hit counts), replication is orders of magnitude faster.

Sketch

alignas(64) struct PaddedCounter {             // cache-line padding
    std::atomic<long> v{0};
    char pad[64 - sizeof(std::atomic<long>)];
};
PaddedCounter counters[ NTHREADS ];
 
void inc( int tid ) { counters[tid].v.fetch_add( 1, std::memory_order_relaxed ); }
 
long total() {
    long s = 0;
    for ( auto &c : counters ) s += c.v.load( std::memory_order_relaxed );
    return s;                                  // may be slightly stale — fine for metrics
}

When to reach for it

  • Write-heavy, read-rare metrics (counters, histograms).
  • Monotonically-growing aggregates (sum, max, count).
  • Bounded thread count (per-core or per-thread slots known in advance).

Not for

  • Anything where readers need an exact snapshot (inventory, bank balance).
  • Structures with cross-thread invariants (a replicated size() on a queue is meaningless).

Relationship to false sharing

Replication is the cure for false sharing, but only if each copy is padded to its own cache line, otherwise adjacent slots still collide. Hence alignas(64) or manual padding.