Bottleneck Analysis
Identifying the single resource that caps throughput or dominates latency: the thing to fix, as opposed to everything that is merely “slow.”
Why?
In any system one resource is always the bottleneck. Making anything else faster delivers nothing. See Amdahl’s Law.
The four usual suspects (USE method, Brendan Gregg):
- CPU: utilization high, context switches, run queue length
- Memory: RSS high, swap in/out, major page faults, OOM kills
- Disk: IOPS saturated, queue depth high, high
iowait - Network: bandwidth saturated, retransmits, high latency, dropped packets
Tools:
- Linux:
top/htop,vmstat,iostat,pidstat,sar,nicstat,ss,perf,bpftrace traceroute/mtrfor network path bottlenecks- APM (Datadog, New Relic) for in-app latency attribution
Method:
- Measure utilization of each resource under realistic load
- The saturated one is the bottleneck
- Fix it (scale up, optimize, cache), then re-profile. The bottleneck will have moved
Common mistake
Optimizing the part that is easy or familiar instead of the part that is slow. Improving a component off the critical path delivers nothing.
From ECE459 L26
“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sherlock Holmes)
Most profiling focuses on CPU, but check the assumption first. Candidates: CPU, memory, disk, network, locks. No required order.
The “bottleneck” may not be a perf problem at all. Android “app not responding” dialogs are often heavy work on the UI thread [LVVLP15], i.e. a bug.
CPU: load averages
uptime / top shows 1/5/15-min load averages.
Bridge-lane analogy [And15]
One core = one lane.
- 0.00 to 0.99: under capacity, no delay
- 1.00: exactly at capacity
- 2.00: bridge full plus equal queue waiting
On a quad-core, 3.00 means 3 lanes used, 75% utilization. Rules of thumb:
- Consistent load > 0.70 per core: investigate
- Consistent >= 1.00: serious
-
= 5.00: red alert
Memory and disk
GC language? Look for frequent or long GC runs [LVVLP15]. RAM “full” is not itself bad (free RAM is wasted RAM); swapping is bad.
Tell swapping apart with vmstat 5, watching si / so (swap in/out). All-zero: not the bottleneck. Nonzero every interval: big problem.
Page faults via ps -eo min_flt,maj_flt,cmd: minor (copy from another process) vs major (from disk). These are lifetime values.
Disk: iostat -dx /dev/sda 5. The %util column shows device saturation. 100% means disk is the limit. iotop (root) shows which process.
Network
nload: live current/avg/min/max/total throughput. Even below nominal link speed you can be network-limited due to intermediate hardware (power-line Ethernet, wireless with walls/interference).
Latency is not bandwidth (JZ)
Hong Kong users on a Frankfurt backend: fine bandwidth, awful latency.
tracerouteshows per-hop latency. Latency has a speed-of-light floor: NY to Lyon ping ~73 ms is already ~84% of the speed of light in fibre.
Packet loss forces retransmits and stalls. Could be environmental, or a dying device.
Locks
No general user-space lock-trace tool. POSIX pthread has no locks-tracing in its spec [Sit21].
Symptom: unexpectedly low CPU usage not explained by I/O-wait means many threads are likely blocked. Workaround: log “entering A / in A / leaving A” yourself. Intel VTune claims to find it (costs money, vendor-specific). perf lock does not really find user-space contention.
Deadlocks are a correctness issue, not performance. Use Helgrind from the Valgrind suite.