Lock Convoy

A lock convoy is a degenerate form of lock contention where several same-priority threads line up behind a short-held lock and the system spends all its CPU on context switches instead of useful work. Covered in ECE459 L12.

Why does this happen?

A takes the lock and does a tiny bit of work, but its quantum expires before releasing

Scheduler parks A, wakes B

B tries to take the lock, but A still owns it → B blocks

Same thing happens to C, D, E as they get scheduled

Eventually A runs again and releases the lock; the next waiter acquires it, and we’re back to step 1

Everyone keeps waking up only to immediately re-block. The CPU budget goes to context switches, not work.

The tell is a lock with waiting threads but no apparent owner. You caught the system mid-handover.

Why "convoy"?

Named after the “boxcar problem” [Ost04]. A train engine nudges one car forward a tiny bit, stops, then the next car moves, stops. Each thread inches forward and stops; nobody makes real progress.

Can't you avoid this easily by releasing the lock before sleeping?

The thread isn’t choosing to sleep, it’s being preempted by the scheduler mid-instruction. User code has no way to say “I’m about to lose my quantum, let me release first.” The preemption happens whether you like it or not.

The real fix is to make the critical section short enough that it almost never spans a quantum boundary (the mitigations below), or use an unfair lock so the still-running thread can re-acquire instead of forcing a handover.

Isn't there a mechanism that releases locks on preemption?

No, not for ordinary user-space mutexes. Related mechanisms exist but don’t do what you’d hope:

Kernel spinlocks (Linux): taking one disables preemption on that CPU. Holder can’t be preempted, but only works because kernel critical sections are microseconds

Priority inheritance: boosts a low-priority holder when a high-priority waiter shows up, so it finishes faster. Doesn’t release

POSIX robust mutexes: auto-release on owner process death, for crash recovery, not preemption

Rust’s MutexGuard drop: releases on normal scope exit. The stack (with the guard) is preserved across preemption exactly so lock ownership survives

So for user-space: preemption keeps the lock held. That’s why all the mitigations target “don’t hold it across a quantum boundary”.

Preventing the convoy from forming

Shrink the critical section so the lock is never held long enough to span a quantum boundary. Four concrete mitigations [Loh05]:

Sleep: non-offender threads call sleep() to yield. Band-aid, and thread switches are expensive
Share: reader-writer lock if reads dominate, or split the critical section
Cache: shrink the critical section to just copy shared data out, then operate on the local copy
Trylock: try the lock non-blockingly; on failure yield and retry a few times before falling back to a blocking acquire. Even a retry limit of 2 lets two threads recover without forming a convoy

Fair vs unfair locks

Whether the OS hands the lock directly to a waiter or leaves it unowned affects convoy behaviour:

Windows XP (fair): on release, the lock is handed to waiter B. B becomes owner while still descheduled. If B has to wait behind many threads before running, the lock is held the whole time
Windows Vista+ (unfair): on release, the lock is unowned. Whoever the scheduler picks next can grab it. Small starvation risk, mitigated by a priority boost on unblock

Unfair locks break convoys by letting the already-running thread re-acquire instead of forcing a context switch.

Thundering herd: a broadcast wakes many threads; most immediately re-block. Prefer waking one at a time
Lost wakeup: if you only notify_one(), a thread may fail to pass the baton. Prefer notify_all on Condvar

🛠️ Steven Gong

Table of Contents

Lock Convoy

Preventing the convoy from forming

Fair vs unfair locks

Graph View

Backlinks

🛠️ Steven Gong

Table of Contents

Lock Convoy

Preventing the convoy from forming

Fair vs unfair locks

Related pathologies

Related

Graph View

Backlinks