🛠️ Steven Gong

Search

AI Inference
AI Inference Libraries

Mar 23, 2025, 1 min read

AI Inference

LLM inference optimization: Architecture, KV cache and Flash attention

Fundamental Concepts

KV Cache
Flash Attention
FlashDecoding

AI Inference Libraries

TensorRT (no really a library)
llama.cpp
vLLM

All of these do quantization. What about pruning?

Some great resourcescA:

https://fleetwood.dev/posts/domain-specific-architectures
Found from here: https://x.com/fleetwood___/status/1898464628180742618
- which bilal quoted https://x.com/bilaltwovec/status/1898938339895493065

Found related to Latency numbers every programmer should know

Graph View

Backlinks

Distributed Machine Learning
Flash-Decoding

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub