Performance Playbook

A map of where performance goes and what you do about it. Organized by the layer the cost lives at, from fundamental laws down to deployment. Covers ECE459 L01-L35.

How do I use this?

Start by finding where your program is slow (profile first), then jump to that section. The laws at the top tell you what’s even achievable before you start.

Fundamental limits

You can’t optimize past these, know them before you start:

Amdahl’s Law, serial fraction caps speedup, 5% serial = 20x max no matter how many cores
Gustafson’s Law, the optimistic counterpoint, scale the problem with cores
Little’s Law, $L = λW$ , ties throughput, latency, and in-flight work
Crista’s Five Laws, the meta-principles
Scalability, what “more resources help” actually means

Single-core CPU

Making one thread go faster, the hardware already does most of this for you:

ILP: pipelining, dual issue, OoO, register renaming
Branch prediction + speculative execution, and their security costs (Spectre, Meltdown)
CPU performance walls, why clocks stopped getting faster
Value speculation, guess the result, verify later
SIMD / data parallelism, one instruction, many lanes
Compiler optimizations: loop opts, alias analysis

Memory hierarchy

Cache is usually where your time actually goes:

Cache coherence + MESI + snoopy cache
False sharing, two threads writing near each other trash the cache line
Write-back vs write-through
Memory consistency, SC is the ideal, hardware gives you weaker
Memory barriers, force ordering when you need it
Memory reordering, what the compiler and CPU rearrange by default

Concurrency

Using multiple cores on shared state, see Concurrency Problems for the failure taxonomy:

Lock granularity, coarse is simple, fine is fast, pick wrong and suffer
Lock convoys, threads queue behind a popular lock and stall
Deadlock and the four conditions
Reentrancy, can a function safely call itself?
Atomics + CAS
Lock-free and wait-free, plus the ABA hazard
STM, optimistic concurrency at language level
Thread pools, don’t spawn threads per task
GIL, why Python threading is a lie
Functional programming, immutability sidesteps most of this
Data dependencies on the critical path

Async and I/O

Hiding latency when the bottleneck isn’t CPU:

O, don’t block a thread on a syscall
Futures and tokio
Rate limiting, protect downstream services

Rust as a performance language

Why the course teaches it: safety without GC overhead:

Ownership + borrowing + lifetimes
Fearless concurrency, compiler enforces no data races
Arc + Mutex for shared state
Unsafe when you need to break out

Reduced-resource computation

Sometimes the win is doing less work:

Trade accuracy for time, approximate is often fine
Early phase termination, stop when enough workers are done
Loop perforation, skip iterations

Offload to GPU

Massive parallelism for throughput-bound work:

GPU programming + CUDA kernels
Heterogeneous programming, CPU + GPU together
Applications: password cracking, bitcoin mining, LLMs

Scale out

When one machine isn’t enough:

Clusters + cloud
Load balancing
Architecture: monolith vs microservices
N+1 query problem, classic microservice tax
Query optimization + query cost + join elimination
Queueing theory: 1, k, ergodicity

Measurement (do this first!)

You can’t optimize what you can’t measure:

Profiling + aggregate counters
Bottleneck analysis, find the slowest device
Causal profiling, what would help if I sped it up?
Load testing
Cachegrind, cache-miss profiling
POGO, feed profile back to the compiler
Benchmarking pitfalls + Miri
Self-optimizing software + dynamic recompilation
Accidentally quadratic, the bug you never see until N gets big

Deploy and operate

Performance doesn’t stop at the binary:

Quick “my program is slow, what do I do”

CPU-bound single thread: profile, look at compiler flags, SIMD, branch hints
CPU-bound multi-core: finer locks, lock-free structures, thread pool
Memory-bound: check false sharing, cache-friendly layout, cachegrind
I/O-bound: async, batching, rate limit downstream
Contention / waiting: queueing theory, load balancing, scale out
Embarrassingly parallel: GPU or cluster
“Good enough” tolerated: loop perforation, approximate

🛠️ Steven Gong

Table of Contents

Performance Playbook

Fundamental limits

Single-core CPU

Memory hierarchy

Concurrency

Async and I/O

Rust as a performance language

Reduced-resource computation

Offload to GPU

Scale out

Measurement (do this first!)

Deploy and operate

Quick “my program is slow, what do I do”

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Performance Playbook

Fundamental limits

Single-core CPU

Memory hierarchy

Concurrency

Async and I/O

Rust as a performance language

Reduced-resource computation

Offload to GPU

Scale out

Measurement (do this first!)

Deploy and operate

Quick “my program is slow, what do I do”

Related

Graph View

Backlinks