Profiling

Causal Profiling

A profiling technique that answers the question conventional profilers can’t: if I made this function faster, how much faster would the whole program get?

Why?

Traditional sampling profilers show where time is spent, but in parallel programs a hot function may not be on the critical path. Speeding it up might do nothing. Coz (Curtsinger and Berger, 2015) solves this.

Mechanism, virtual speedup:

  • Pick a line of interest
  • Instead of actually speeding it up, slow down all other threads proportionally for runs where that line executes
  • Measure end-to-end throughput delta
  • The relative improvement of the whole program matches what you’d see from a real speedup

Output: a plot of “virtual speedup applied” vs “program speedup achieved” for each candidate hotspot. Flat lines are not worth optimizing; steep lines are.

What it catches

In parallel code, lock contention, synchronization, and cache coherency bottlenecks don’t show up as obviously hot functions. Coz can identify “speeding up X would help” even when X is a small fraction of sampled time.

Extension: SCoz (system-wide causal profiling) for scenarios Coz can’t instrument cleanly.

From ECE459 L28

The core insight [CB15]

Speeding up function f by factor d is equivalent to slowing everything else down by d * t(f). Relative timings are identical, so implement the “speedup” by inserting pauses into all other threads. This is virtual speedup (no source change required).

Naive subtraction fails

You cannot “just subtract 10% of work()’s time from the total.” Speeding work() could increase lock contention, reorder dependencies, etc. You need simulation.

Graph of outcomes

For each candidate line, Coz plots virtual speedup applied vs program-level speedup achieved. Shapes:

  • Linear: uniformly helps
  • Capped: helps up to a point, then work() stops being the bottleneck
  • Flat: code is not on the critical path, optimization does nothing
  • Negative: optimization makes things worse (speeding a thread increases lock contention or adds to the critical path). Sometimes recovers at higher speedups

“Speeding up X would help” does not mean X can be sped up, nor arbitrarily. After a real change, re-run with a new baseline.

Overhead [CB15]

~17.6% total: 2.6% startup debug info, 4.8% sampling, 10.2% the pause-other-threads delay.

Limits

Works when all threads live on one machine under your control. Distributed systems would need significant extension (coordinating pauses across servers).

SCoz [AKNJ21]

Extends Coz for (a) multi-process apps and (b) OS-as-bottleneck. Coz can’t help when the kernel is the bottleneck since it can’t pause the kernel.

SCoz moves from thread-based pausing to core-based pausing: one profiler thread pinned per core calling ndelay with preemption disabled. Needs kernel support, which limits where you can apply it.