Bottleneck Analysis

Identifying the single resource that caps throughput or dominates latency: the thing to fix, as opposed to everything that is merely “slow.”

Why?

In any system one resource is always the bottleneck. Making anything else faster delivers nothing. See Amdahl’s Law.

The four usual suspects (USE method, Brendan Gregg):

CPU: utilization high, context switches, run queue length
Memory: RSS high, swap in/out, major page faults, OOM kills
Disk: IOPS saturated, queue depth high, high iowait
Network: bandwidth saturated, retransmits, high latency, dropped packets

Tools:

Linux: top/htop, vmstat, iostat, pidstat, sar, nicstat, ss, perf, bpftrace
traceroute / mtr for network path bottlenecks
APM (Datadog, New Relic) for in-app latency attribution

Method:

Measure utilization of each resource under realistic load
The saturated one is the bottleneck
Fix it (scale up, optimize, cache), then re-profile. The bottleneck will have moved

Common mistake

Optimizing the part that is easy or familiar instead of the part that is slow. Improving a component off the critical path delivers nothing.

From ECE459 L26

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sherlock Holmes)

Most profiling focuses on CPU, but check the assumption first. Candidates: CPU, memory, disk, network, locks. No required order.

The “bottleneck” may not be a perf problem at all. Android “app not responding” dialogs are often heavy work on the UI thread [LVVLP15], i.e. a bug.

CPU: load averages

uptime / top shows 1/5/15-min load averages.

Bridge-lane analogy [And15]

One core = one lane.

0.00 to 0.99: under capacity, no delay

1.00: exactly at capacity

2.00: bridge full plus equal queue waiting

On a quad-core, 3.00 means 3 lanes used, 75% utilization. Rules of thumb:

Consistent load > 0.70 per core: investigate
Consistent >= 1.00: serious
= 5.00: red alert

Memory and disk

GC language? Look for frequent or long GC runs [LVVLP15]. RAM “full” is not itself bad (free RAM is wasted RAM); swapping is bad.

Tell swapping apart with vmstat 5, watching si / so (swap in/out). All-zero: not the bottleneck. Nonzero every interval: big problem.

Page faults via ps -eo min_flt,maj_flt,cmd: minor (copy from another process) vs major (from disk). These are lifetime values.

Disk: iostat -dx /dev/sda 5. The %util column shows device saturation. 100% means disk is the limit. iotop (root) shows which process.

Network

nload: live current/avg/min/max/total throughput. Even below nominal link speed you can be network-limited due to intermediate hardware (power-line Ethernet, wireless with walls/interference).

Latency is not bandwidth (JZ)

Hong Kong users on a Frankfurt backend: fine bandwidth, awful latency. traceroute shows per-hop latency. Latency has a speed-of-light floor: NY to Lyon ping ~73 ms is already ~84% of the speed of light in fibre.

Packet loss forces retransmits and stalls. Could be environmental, or a dying device.

Locks

No general user-space lock-trace tool. POSIX pthread has no locks-tracing in its spec [Sit21].

Symptom: unexpectedly low CPU usage not explained by I/O-wait means many threads are likely blocked. Workaround: log “entering A / in A / leaving A” yourself. Intel VTune claims to find it (costs money, vendor-specific). perf lock does not really find user-space contention.

Deadlocks are a correctness issue, not performance. Use Helgrind from the Valgrind suite.

🛠️ Steven Gong

Table of Contents

Bottleneck Analysis

From ECE459 L26

CPU: load averages

Memory and disk

Network

Locks

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Bottleneck Analysis

From ECE459 L26

CPU: load averages

Memory and disk

Network

Locks

Related

Graph View

Backlinks