AI Inference

LLM inference optimization: Architecture, KV cache and Flash attention

Fundamental Concepts

AI Inference Libraries

All of these do quantization. What about pruning?

Some great resourcescA:

Found related to Latency numbers every programmer should know