Nonlocal Block

The Nonlocal Block drops self-attention into a CNN so any spatio-temporal location can attend to any other.

Starting from a 3D feature map $f \in R^{C \times T \times H \times W}$ :

Use three $1 \times 1 \times 1$ convolutions to produce $Q, K, V$
Flatten the $(T, H, W)$ dimensions into tokens
Compute attention weights across all locations
Apply those weights to $V$
Project back to $C$ channels and add the residual

Structurally this is the same idea as self-attention, just applied over video or feature-map positions instead of text tokens, and using convolutions for the projections.

Action Classification
SlowFast
VideoMAE

🛠️ Steven Gong

Nonlocal Block

Graph View

Backlinks