🛠️ Steven Gong

Search

May 09, 2025, 1 min read

Masked Attention

https://www.youtube.com/watch?v=bCz4OMemCcA

How is masked attention implemented? Just use a lower triangular matrix right.

Andrej Karpathy shows how this is implemented

$MaskedAttention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}} + M) V$

You just add the Causal Attention Mask

The causal mask looks like this

M = 0000 - \infty 000 - \infty - \infty 00 - \infty - \infty - \infty 0

Graph View

Backlinks

Causal Attention Mask

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub