KV Cache

This is a really good explanation by a guy who was explaining LLamA:

Notice that for QK^T, we only need to compute the new column and row.

BUT ACTUALLY, we mask out the right column before applying softmax, so we don’t even need to compute the right column!! only the bottom row, which requires multiplying the new Q entry by K (so K needs to be cached)

This is actually the best explanation:

The same logic applies to the value matrix.

🛠️ Steven Gong