Compiler Optimization

Compiler optimization is the set of transformations a compiler applies to produce code with the same observable behaviour as the source, but faster. Covered in ECE459 L18.

Why rely on the compiler?

“Optimization” is a misnomer: compilers do not produce optimal code, just better code. But automatic transformations are usually worth more than hand-rolling them, because hand optimization costs your time and hurts readability. Rust can even force certain optimizations (inlining) past the compiler’s own judgment.

The contract: execute a program with the same behaviour as yours, faster. The contract is void if there is undefined behaviour.

rustc levels

rustc does a little vectorization itself; most optimization happens in the LLVM backend. -C opt-level mostly sets inline limits and passes the level down:

  • 0: no optimizations, also turns on cfg(debug_assertions)
  • 1: basic
  • 2: some
  • 3: all
  • "s": binary size
  • "z": size, also turns off loop vectorization

For fast release binaries: cargo --release. Enable LTO in Cargo.toml:

[profile.release]
lto = true

Use https://godbolt.org/ with -C overflow-checks=n to investigate output.

Scalar optimizations

  • Constant folding: i = 1024 * 1024 becomes i = 1048576. Always enabled
  • Common Subexpression Elimination (CSE): compute c + d once when it appears twice with unchanged operands. Enabled at level 1
  • Constant propagation: move a constant from its definition to its use if unchanged in between
  • Copy propagation: same for variable copies, often run after CSE. Given let y = x; let z = w + y, propagate to z = w + x (or z = 3 + x with constant prop). C complicates both with pointers; Rust’s uniqueness makes them easier, though LLVM may not exploit Rust’s guarantees
  • Dead code elimination: remove code the compiler proves won’t run. Undecidable in general, compilers do their best

Low-level

  • #[cold]: Rust attribute marking a rarely-called function (e.g. panic). Rust doesn’t directly expose likely / unlikely
  • Architecture-specific: -C target-cpu=native, -C target-feature unlock instructions like SSE4.2. Good for local/cloud builds, bad for code you ship [Wil20]

See LTO and Alias Analysis.

Devirtualization: convert virtual calls to direct calls so the CPU doesn’t read the vtable or guess the branch. “Rapid Type Analysis” can sometimes observe that only one implementing type is instantiated.

Inlining: #[inline] hints, #[inline(always)] forces, #[inline(never)] forbids. Benefits: no call overhead, enables context-sensitive optimizations. Costs: bigger binary (more i-cache misses), harder debugging (can’t breakpoint a function that no longer exists), and changing an inlined library function breaks binary compat with callers.

Tail Recursion Elimination: replaces a tail call with a goto. Mandatory in some functional languages, not mandatory in C/C++/Rust. A tail-recursive fibonacci_lr avoids stack growth when the compiler applies it.

Prior concepts

See also the easyperf book: https://book.easyperf.net/perf_book