Link Time Optimization

From L18. Whole-program / interprocedural optimization done at link time. In 2004 gcc’s only interprocedural optimization was inlining and it was ad-hoc [Hub14]. Modern compilers can do more — but scalability is the central challenge.

Rust: enable via Cargo.toml:

[profile.release]
lto = true

How it works

Three phases:

  1. Local generation (parallelizable) — compile each unit to IR. Keep IR compact for whole-program use.
  2. Whole-program analysis (hard to parallelize) — build the call graph, make transform decisions. Possibly partition the program.
  3. Local transformations (parallelizable) — apply decisions per IR file, emit object code.

Transformations

Global decisions, local transformations:

  • Devirtualization
  • Dead variable / dead function elimination
  • Field reordering, struct splitting/reorganization

Global decisions, global transformations:

  • Cross-module inlining
  • Virtual function inlining
  • Interprocedural constant propagation

Scalability

Chromium / Linux kernel / Firefox are tens of millions of lines. Getting all the IR into memory for the analysis phase is the first hurdle. Anything more than linear time breaks. Partitioning the call graph (keep closely-related functions together, use approximations for far-away ones) helps.

gcc’s progress per version [Hub15]:

  • 4.5 — initial LTO.
  • 4.6 — parallelization, call-graph partitioning. Bottleneck: streaming in types and declarations.
  • 4.7–4.9 — better build times and memory use (“chasing unnecessary data away”).

Impact

gcc LTO delivers roughly 3–5% performance improvement — considered good by compiler experts. Lets developers stop manually factoring translation units and let the compiler do it (like going from manual to automatic transmission).

LLVM’s LTO can optimize across source languages — a program with both C and Rust gets optimized across the boundary. Details: [LLV17], [Die09].

Background video: https://www.youtube.com/watch?v=j2sND8ATjsE