🛠️ Steven Gong

Search

SearchSearch

May 09, 2025, 1 min read

Attention

Masked Attention

https://www.youtube.com/watch?v=bCz4OMemCcA

How is masked attention implemented? Just use a lower triangular matrix right.

  • Andrej Karpathy shows how this is implemented

MaskedAttention(Q,K,V)=softmax(dk​​QK⊤​+M)V

  • You just add the Causal Attention Mask

The causal mask looks like this

M=​0000​−∞000​−∞−∞00​−∞−∞−∞0​​

Graph View

Backlinks

  • Causal Attention Mask

Created with Quartz, © 2025

  • Blog
  • LinkedIn
  • Twitter
  • GitHub