Profiling

Aggregate Counters

Metrics that summarize many events into a single number over a time window: requests/sec, error rate, p99 latency, bytes transmitted.

Why?

Cheaper than per-event logs, visible in dashboards, alertable. The backbone of observability before you open a profiler.

Typical counter types:

  • Counter: monotonically increasing. Requests served, bytes sent. Rate = counter / time
  • Gauge: point-in-time value that can go up or down. Memory in use, queue depth
  • Histogram / summary: bucketed latency distribution. Lets you compute p50, p95, p99, where the tail lives

Pitfalls

  • Averages lie: mean latency is almost useless when the p99 is what users feel
  • Aggregation across servers: merging histograms is fine, merging averages isn’t
  • Long-tail truncation: if the last bucket is “1s+”, you can’t tell a 1s p99 from a 60s one
  • Cardinality explosion: labeling by user_id or URL path creates a series per value, expensive

Standard stacks: Prometheus + Grafana (metrics), OpenTelemetry (instrumentation), StatsD (legacy).

From ECE459 L24

Counters are cheap, so counting every occurrence of an event is plausible where full tracing isn’t. Space cost is tiny vs a trace.

Simplest aggregate is the sum. “Average response time per request” is an aggregate (sum divided by count over a window), useful because no human averages 50 000 requests. Other typical aggregates: request count, requests by type, error rate, p95/max response time.

Context is mandatory

A 0.75s login is bad if the baseline was 0.5s, fine if it was 2.0s, intentional if slowing brute-force attacks.

Averages mislead. Average 1.27s under a 10s deadline looks fine, but max could still be 15s, missing the deadline some % of the time. For hard deadlines use max / p95 / p99.

Averages also hide burstiness: 7 req/s average can look evenly spaced, arrival-clustered, or bursty. Queueing theory is needed to reason about arrivals.

Window choice

Full-execution windows fit one-shot tasks (“video encoded in 3h21m, so 210.9 fps”). Services need shorter windows, but Sunday-night purchases/hour is not comparable to Monday-noon.