Compiler Optimization
Compiler optimization is the set of transformations a compiler applies to produce code with the same observable behaviour as the source, but faster. Covered in ECE459 L18.
Why rely on the compiler?
“Optimization” is a misnomer: compilers do not produce optimal code, just better code. But automatic transformations are usually worth more than hand-rolling them, because hand optimization costs your time and hurts readability. Rust can even force certain optimizations (inlining) past the compiler’s own judgment.
The contract: execute a program with the same behaviour as yours, faster. The contract is void if there is undefined behaviour.
rustc levels
rustc does a little vectorization itself; most optimization happens in the LLVM backend. -C opt-level mostly sets inline limits and passes the level down:
0: no optimizations, also turns oncfg(debug_assertions)1: basic2: some3: all"s": binary size"z": size, also turns off loop vectorization
For fast release binaries: cargo --release. Enable LTO in Cargo.toml:
[profile.release]
lto = trueUse https://godbolt.org/ with -C overflow-checks=n to investigate output.
Scalar optimizations
- Constant folding:
i = 1024 * 1024becomesi = 1048576. Always enabled - Common Subexpression Elimination (CSE): compute
c + donce when it appears twice with unchanged operands. Enabled at level 1 - Constant propagation: move a constant from its definition to its use if unchanged in between
- Copy propagation: same for variable copies, often run after CSE. Given
let y = x; let z = w + y, propagate toz = w + x(orz = 3 + xwith constant prop). C complicates both with pointers; Rust’s uniqueness makes them easier, though LLVM may not exploit Rust’s guarantees - Dead code elimination: remove code the compiler proves won’t run. Undecidable in general, compilers do their best
Low-level
#[cold]: Rust attribute marking a rarely-called function (e.g.panic). Rust doesn’t directly exposelikely/unlikely- Architecture-specific:
-C target-cpu=native,-C target-featureunlock instructions like SSE4.2. Good for local/cloud builds, bad for code you ship [Wil20]
Interprocedural and link-time
See LTO and Alias Analysis.
Devirtualization: convert virtual calls to direct calls so the CPU doesn’t read the vtable or guess the branch. “Rapid Type Analysis” can sometimes observe that only one implementing type is instantiated.
Inlining: #[inline] hints, #[inline(always)] forces, #[inline(never)] forbids. Benefits: no call overhead, enables context-sensitive optimizations. Costs: bigger binary (more i-cache misses), harder debugging (can’t breakpoint a function that no longer exists), and changing an inlined library function breaks binary compat with callers.
Tail Recursion Elimination: replaces a tail call with a goto. Mandatory in some functional languages, not mandatory in C/C++/Rust. A tail-recursive fibonacci_lr avoids stack growth when the compiler applies it.
Prior concepts
- Static Single-Assignment Form: underlying representation that enables many of these transforms
- data flow analysis, e.g. constant propagation
- Register Allocation
See also the easyperf book: https://book.easyperf.net/perf_book