Sparse Attention
Sparse attention is a technique used in transformer models (like GPT, BERT, etc.) to reduce the computational cost and memory usage of the self-attention mechanism.
Instead of attending to every other token, each token attends to only a subset of tokens, using a predefined or learned pattern.