Sparse Attention

Sparse attention is a technique used in transformer models (like GPT, BERT, etc.) to reduce the computational cost and memory usage of the self-attention mechanism.

Instead of attending to every other token, each token attends to only a subset of tokens, using a predefined or learned pattern.

🛠️ Steven Gong

Sparse Attention

Graph View

Backlinks