KV Cache

https://www.youtube.com/watch?v=80bIUggRJf4

This is a really good explanation by a guy who was explaining LLamA:

https://medium.com/@joaolages/kv-caching-explained-276520203249

Cursor team explaining KV cache: https://www.youtube.com/watch?v=PncVSWbxdWU

  • A new row gets added for query, key and value matrices

Notice that for QK^T, we only need to compute the new column and row.

  • BUT ACTUALLY, we mask out the right column before applying softmax, so we don’t even need to compute the right column!! only the bottom row, which requires multiplying the new Q entry by K (so K needs to be cached)

This is actually the best explanation:

The same logic applies to the value matrix.

  • This is such a great visualization explanation