Profile-Guided Optimization
A compilation strategy that uses runtime profile data to inform static compiler decisions. Microsoft calls it POGO. Same idea as “feedback-directed optimization.”
Why?
Static analysis has to guess branches, hot arms, and virtual call targets. A bad guess (e.g., 95% wrong fallback) is expensive. Real execution data replaces the guess.
Three-stage flow:
- Instrumented build: compile with counters on every edge/branch
- Training run: execute on a representative workload, collect
.profrawdata - Optimized build: recompile feeding the profile back to the compiler
With actual branch frequencies, the compiler can:
- Lay out basic blocks so taken branches are fall-through, improving I-cache locality
- Inline aggressively along hot paths
- Outline cold code (error handling) into a separate section that does not pollute I-cache
- Choose branch hints aligned with actual direction
- Specialize virtual calls on observed types (devirtualization)
- Register allocate with real pressure, not estimates
Typical gain: 5 to 20% on realistic workloads, more on branch-heavy code.
Bootlooped variant: LLVM BOLT, a post-link optimizer that rearranges the already-compiled binary using profile data. Facebook uses it on HHVM, Google on Chrome.
Catch
The training workload has to match production. A bad profile can make things worse.
From ECE459 L27
Three motivating cases for the compiler to guess [Ast13a]:
if a < b { ... } else { ... }: which branch? If the fallback guesses wrong 95% of the time, bad- Virtual call on a trait object (
thing.greet()): if only one impl is live, devirtualize. With two, guess match x { 0..10 => ..., 11..100 => ..., _ => ... }: which arm is hot? Reorder arms in likelihood order
Hand-writing branch hints is manual. HotSpot JIT (and .NET) fix this at runtime, swapping predictions when observation says so. Rust has no such runtime; the compiler runs once.
Three probe types during instrumented run [Ast13a]
- Function entry: times a function was called
- Edge: transitions taken (which branch of an
if) - Value: histograms over a value (
matchinput distribution)
Rust workflow
rustc -Cprofile-generate=/tmp/pgo-data -O ./main.rs
./main mydata1.csv
./main mydata2.csv
./main mydata3.csv
llvm-profdata merge -o merged.profdata /tmp/pgo-data
rustc -Cprofile-use=./merged.profdata -O ./main.rs
.profraw is raw, furiously-jotted data. llvm-profdata cleans it up into .profdata, and you can merge multiple runs with per-run weights. Training data can be checked into source control and re-used until the code diverges significantly [Ast13a].
Training guidance
Match real-world usage, not unit-test coverage. Hitting every bell and whistle equally teaches the compiler that nothing is important. Focus time on performance-critical paths. The program runs slower during training because of probes.
Optimizations the data enables [Ast13b]
Full/partial inlining, function layout, speed-vs-size decision, basic block layout, code separation, virtual call speculation, switch expansion, data separation, loop unrolling.
Speed/size: compile hot methods for speed, cold for size. Guideline: under 5% of methods should be compiled for speed.
Inlining is the biggest win. Decisions are call-graph-path-sensitive: foo called from bar behaves differently than foo called from baz. Inline bar into bat if it pays, skip bar into goo if it only bloats.
Page locality: pack hot call chains onto the same page so the callee is not a page away from the caller. “Dead” code (never invoked in training, not necessarily statically unreachable) goes in its own cold block.
Spec2k gains [Ast13b]
| App | sjeng | gobmk | perl | povray | gcc |
|---|---|---|---|---|---|
| Inlined edges | 50% | 53% | 25% | 79% | 65% |
| Page locality | 97% | 75% | 85% | 98% | 80% |
| Speed gain | 8.5% | 6.6% | 14.9% | 36.9% | 7.9% |