Profiling
Measuring where a program spends time or resources, so optimization targets the actual bottleneck. Rule 0: “don’t guess, measure.”
Why?
Intuition about what is fast is usually wrong, at the macro and micro level. Knuth: “premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
Two styles:
- Sampling profilers: interrupt periodically (~ms) and record the current call stack. Low overhead, statistical. Examples:
perf,pprof,py-spy, Xcode Instruments,samply - Instrumentation profilers: insert probes at function entry/exit. Exact counts, higher overhead, distorts timings on small functions. Examples:
gprof,callgrind, DTrace
Dimensions:
- CPU time: where cycles go. Start here
- Wall time: includes I/O waits. Often more useful for latency-bound services
- Memory: allocations, peak usage, leaks (heaptrack, valgrind massif)
- Cache/branch events:
perf statcounters for cache misses, branch mispredicts, IPC - System calls:
strace/dtruss
Output forms:
- Flat profile: function to time
- Call graph: who calls whom, inclusive/exclusive time
- Flame graph: stack depth on y, width = time. Best for seeing hot paths at a glance
Caveats
Inlining, tail calls, and JITs confuse attribution. Profile a realistic workload. Release build differs from debug. Multithreaded profiling needs per-thread views.
From ECE459 L24
Four lenses on program behaviour [Sit21]:
- Logs: timestamped messages + attributes. The one tool to keep if forced to pick one
- Counters: stored count of events (cache misses,
foo()calls, logins) - Profiles: higher-level overviews of where time goes in aggregate
- Traces: recordings of event sequences, how the program reached its current state
Logging
Log input, stages, output. Include the reason for a decision, not just the outcome. Levels (error/warn/info/debug) let readers filter.
Timestamp precision matches your timescales (seconds vs µs vs ns). Use UTC: DST makes a minute happen twice going back, and time zones split teams. Don’t do this on a Mars rover.
Over-logging hurts
Three failure modes: drowning out useful info, leaking PII (bank details in web-team logs, a real Privacy Commissioner of Canada case), and making tracing overhead significant.
Overhead ballpark [Sit21]:
- Trace every conditional branch: ~20x slowdown. Dev only
- Timestamp each function entry/exit: 1.2 to 1.5x. Sometimes prod, temporarily
- Timestamp each syscall entry/exit: <1%. Fine for prod always
- Valgrind-style memory trace: huge slowdown, checks init, bounds, leak origin stack
- Disk / network traces: tiny overhead (both slow vs CPU)
- Lock tracing: log every lock/unlock, focus on contended locks since waits are the interesting part
Trace data volume
20 GB/s memory bandwidth, 64B cache lines, 8B per trace entry gives 2.4 GB/s of trace data. Fills an 8 TB drive in under an hour. Either trace less, or trace for less time.
Aggregates need context
Counters and averages mean nothing without a baseline. Login avg = 0.75s: good or bad? If the previous PR had 0.5s, you regressed, unless login was deliberately slowed to harden against brute force.
Average response 1.27s vs a 10s deadline sounds fine, until the max is 15s. Use max, p95, p99 for hard deadlines. Averages also hide burstiness: avg 7 req/s can be evenly spaced or a single burst.
Aggregation window
Choice of window matters. Purchases/hour looks very different Sunday night vs Monday noon.
Visualization
Analog speedometers make acceleration obvious where digital ones don’t, the same reason graphs beat tables for trends. Dashboards collect graphs (trends, histograms) so many metrics stay digestible. Less useful when optimizing one specific thing.