Profile-Guided Optimization

A compilation strategy that uses runtime profile data to inform static compiler decisions. Microsoft calls it POGO. Same idea as “feedback-directed optimization.”

Why?

Static analysis has to guess branches, hot arms, and virtual call targets. A bad guess (e.g., 95% wrong fallback) is expensive. Real execution data replaces the guess.

Three-stage flow:

  1. Instrumented build: compile with counters on every edge/branch
  2. Training run: execute on a representative workload, collect .profraw data
  3. Optimized build: recompile feeding the profile back to the compiler

With actual branch frequencies, the compiler can:

  • Lay out basic blocks so taken branches are fall-through, improving I-cache locality
  • Inline aggressively along hot paths
  • Outline cold code (error handling) into a separate section that does not pollute I-cache
  • Choose branch hints aligned with actual direction
  • Specialize virtual calls on observed types (devirtualization)
  • Register allocate with real pressure, not estimates

Typical gain: 5 to 20% on realistic workloads, more on branch-heavy code.

Bootlooped variant: LLVM BOLT, a post-link optimizer that rearranges the already-compiled binary using profile data. Facebook uses it on HHVM, Google on Chrome.

Catch

The training workload has to match production. A bad profile can make things worse.

From ECE459 L27

Three motivating cases for the compiler to guess [Ast13a]:

  • if a < b { ... } else { ... }: which branch? If the fallback guesses wrong 95% of the time, bad
  • Virtual call on a trait object (thing.greet()): if only one impl is live, devirtualize. With two, guess
  • match x { 0..10 => ..., 11..100 => ..., _ => ... }: which arm is hot? Reorder arms in likelihood order

Hand-writing branch hints is manual. HotSpot JIT (and .NET) fix this at runtime, swapping predictions when observation says so. Rust has no such runtime; the compiler runs once.

Three probe types during instrumented run [Ast13a]

  • Function entry: times a function was called
  • Edge: transitions taken (which branch of an if)
  • Value: histograms over a value (match input distribution)

Rust workflow

rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv
llvm-profdata merge -o merged.profdata /tmp/pgo-data
rustc -Cprofile-use=./merged.profdata -O ./main.rs

.profraw is raw, furiously-jotted data. llvm-profdata cleans it up into .profdata, and you can merge multiple runs with per-run weights. Training data can be checked into source control and re-used until the code diverges significantly [Ast13a].

Training guidance

Match real-world usage, not unit-test coverage. Hitting every bell and whistle equally teaches the compiler that nothing is important. Focus time on performance-critical paths. The program runs slower during training because of probes.

Optimizations the data enables [Ast13b]

Full/partial inlining, function layout, speed-vs-size decision, basic block layout, code separation, virtual call speculation, switch expansion, data separation, loop unrolling.

Speed/size: compile hot methods for speed, cold for size. Guideline: under 5% of methods should be compiled for speed.

Inlining is the biggest win. Decisions are call-graph-path-sensitive: foo called from bar behaves differently than foo called from baz. Inline bar into bat if it pays, skip bar into goo if it only bloats.

Page locality: pack hot call chains onto the same page so the callee is not a page away from the caller. “Dead” code (never invoked in training, not necessarily statically unreachable) goes in its own cold block.

Spec2k gains [Ast13b]

Appsjenggobmkperlpovraygcc
Inlined edges50%53%25%79%65%
Page locality97%75%85%98%80%
Speed gain8.5%6.6%14.9%36.9%7.9%