Profiling

Measuring where a program spends time or resources, so optimization targets the actual bottleneck. Rule 0: “don’t guess, measure.”

Why?

Intuition about what is fast is usually wrong, at the macro and micro level. Knuth: “premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”

Two styles:

  • Sampling profilers: interrupt periodically (~ms) and record the current call stack. Low overhead, statistical. Examples: perf, pprof, py-spy, Xcode Instruments, samply
  • Instrumentation profilers: insert probes at function entry/exit. Exact counts, higher overhead, distorts timings on small functions. Examples: gprof, callgrind, DTrace

Dimensions:

  • CPU time: where cycles go. Start here
  • Wall time: includes I/O waits. Often more useful for latency-bound services
  • Memory: allocations, peak usage, leaks (heaptrack, valgrind massif)
  • Cache/branch events: perf stat counters for cache misses, branch mispredicts, IPC
  • System calls: strace / dtruss

Output forms:

  • Flat profile: function to time
  • Call graph: who calls whom, inclusive/exclusive time
  • Flame graph: stack depth on y, width = time. Best for seeing hot paths at a glance

Caveats

Inlining, tail calls, and JITs confuse attribution. Profile a realistic workload. Release build differs from debug. Multithreaded profiling needs per-thread views.

From ECE459 L24

Four lenses on program behaviour [Sit21]:

  • Logs: timestamped messages + attributes. The one tool to keep if forced to pick one
  • Counters: stored count of events (cache misses, foo() calls, logins)
  • Profiles: higher-level overviews of where time goes in aggregate
  • Traces: recordings of event sequences, how the program reached its current state

Logging

Log input, stages, output. Include the reason for a decision, not just the outcome. Levels (error/warn/info/debug) let readers filter.

Timestamp precision matches your timescales (seconds vs µs vs ns). Use UTC: DST makes a minute happen twice going back, and time zones split teams. Don’t do this on a Mars rover.

Over-logging hurts

Three failure modes: drowning out useful info, leaking PII (bank details in web-team logs, a real Privacy Commissioner of Canada case), and making tracing overhead significant.

Overhead ballpark [Sit21]:

  • Trace every conditional branch: ~20x slowdown. Dev only
  • Timestamp each function entry/exit: 1.2 to 1.5x. Sometimes prod, temporarily
  • Timestamp each syscall entry/exit: <1%. Fine for prod always
  • Valgrind-style memory trace: huge slowdown, checks init, bounds, leak origin stack
  • Disk / network traces: tiny overhead (both slow vs CPU)
  • Lock tracing: log every lock/unlock, focus on contended locks since waits are the interesting part

Trace data volume

20 GB/s memory bandwidth, 64B cache lines, 8B per trace entry gives 2.4 GB/s of trace data. Fills an 8 TB drive in under an hour. Either trace less, or trace for less time.

Aggregates need context

Counters and averages mean nothing without a baseline. Login avg = 0.75s: good or bad? If the previous PR had 0.5s, you regressed, unless login was deliberately slowed to harden against brute force.

Average response 1.27s vs a 10s deadline sounds fine, until the max is 15s. Use max, p95, p99 for hard deadlines. Averages also hide burstiness: avg 7 req/s can be evenly spaced or a single burst.

Aggregation window

Choice of window matters. Purchases/hour looks very different Sunday night vs Monday noon.

Visualization

Analog speedometers make acceleration obvious where digital ones don’t, the same reason graphs beat tables for trends. Dashboards collect graphs (trends, histograms) so many metrics stay digestible. Less useful when optimizing one specific thing.