Nonlocal Block

The Nonlocal Block drops self-attention into a CNN so any spatio-temporal location can attend to any other.

Starting from a 3D feature map :

  1. Use three convolutions to produce
  2. Flatten the dimensions into tokens
  3. Compute attention weights across all locations
  4. Apply those weights to
  5. Project back to channels and add the residual

Structurally this is the same idea as self-attention, just applied over video or feature-map positions instead of text tokens, and using convolutions for the projections.

Related: