Performance Playbook
A map of where performance goes and what you do about it. Organized by the layer the cost lives at, from fundamental laws down to deployment. Covers ECE459 L01-L35.
How do I use this?
Start by finding where your program is slow (profile first), then jump to that section. The laws at the top tell you what’s even achievable before you start.
Fundamental limits
You can’t optimize past these, know them before you start:
- Amdahl’s Law, serial fraction caps speedup, 5% serial = 20x max no matter how many cores
- Gustafson’s Law, the optimistic counterpoint, scale the problem with cores
- Little’s Law, , ties throughput, latency, and in-flight work
- Crista’s Five Laws, the meta-principles
- Scalability, what “more resources help” actually means
Single-core CPU
Making one thread go faster, the hardware already does most of this for you:
- ILP: pipelining, dual issue, OoO, register renaming
- Branch prediction + speculative execution, and their security costs (Spectre, Meltdown)
- CPU performance walls, why clocks stopped getting faster
- Value speculation, guess the result, verify later
- SIMD / data parallelism, one instruction, many lanes
- Compiler optimizations: loop opts, alias analysis
Memory hierarchy
Cache is usually where your time actually goes:
- Cache coherence + MESI + snoopy cache
- False sharing, two threads writing near each other trash the cache line
- Write-back vs write-through
- Memory consistency, SC is the ideal, hardware gives you weaker
- Memory barriers, force ordering when you need it
- Memory reordering, what the compiler and CPU rearrange by default
Concurrency
Using multiple cores on shared state, see Concurrency Problems for the failure taxonomy:
- Lock granularity, coarse is simple, fine is fast, pick wrong and suffer
- Lock convoys, threads queue behind a popular lock and stall
- Deadlock and the four conditions
- Reentrancy, can a function safely call itself?
- Atomics + CAS
- Lock-free and wait-free, plus the ABA hazard
- STM, optimistic concurrency at language level
- Thread pools, don’t spawn threads per task
- GIL, why Python threading is a lie
- Functional programming, immutability sidesteps most of this
- Data dependencies on the critical path
Async and I/O
Hiding latency when the bottleneck isn’t CPU:
- O, don’t block a thread on a syscall
- Futures and tokio
- Rate limiting, protect downstream services
Rust as a performance language
Why the course teaches it: safety without GC overhead:
- Ownership + borrowing + lifetimes
- Fearless concurrency, compiler enforces no data races
- Arc + Mutex for shared state
- Unsafe when you need to break out
Reduced-resource computation
Sometimes the win is doing less work:
- Trade accuracy for time, approximate is often fine
- Early phase termination, stop when enough workers are done
- Loop perforation, skip iterations
Offload to GPU
Massive parallelism for throughput-bound work:
- GPU programming + CUDA kernels
- Heterogeneous programming, CPU + GPU together
- Applications: password cracking, bitcoin mining, LLMs
Scale out
When one machine isn’t enough:
- Clusters + cloud
- Load balancing
- Architecture: monolith vs microservices
- N+1 query problem, classic microservice tax
- Query optimization + query cost + join elimination
- Queueing theory: 1, k, ergodicity
Measurement (do this first!)
You can’t optimize what you can’t measure:
- Profiling + aggregate counters
- Bottleneck analysis, find the slowest device
- Causal profiling, what would help if I sped it up?
- Load testing
- Cachegrind, cache-miss profiling
- POGO, feed profile back to the compiler
- Benchmarking pitfalls + Miri
- Self-optimizing software + dynamic recompilation
- Accidentally quadratic, the bug you never see until N gets big
Deploy and operate
Performance doesn’t stop at the binary:
Quick “my program is slow, what do I do”
- CPU-bound single thread: profile, look at compiler flags, SIMD, branch hints
- CPU-bound multi-core: finer locks, lock-free structures, thread pool
- Memory-bound: check false sharing, cache-friendly layout, cachegrind
- I/O-bound: async, batching, rate limit downstream
- Contention / waiting: queueing theory, load balancing, scale out
- Embarrassingly parallel: GPU or cluster
- “Good enough” tolerated: loop perforation, approximate