Nonlocal Block
The Nonlocal Block drops self-attention into a CNN so any spatio-temporal location can attend to any other.
Starting from a 3D feature map :
- Use three convolutions to produce
- Flatten the dimensions into tokens
- Compute attention weights across all locations
- Apply those weights to
- Project back to channels and add the residual
Structurally this is the same idea as self-attention, just applied over video or feature-map positions instead of text tokens, and using convolutions for the projections.
Related: