AI Inference
LLM inference optimization: Architecture, KV cache and Flash attention
Fundamental Concepts
AI Inference Libraries
All of these do quantization. What about pruning?
Some great resourcescA:
- https://fleetwood.dev/posts/domain-specific-architectures
- Found from here: https://x.com/fleetwood___/status/1898464628180742618
- which bilal quoted https://x.com/bilaltwovec/status/1898938339895493065
Found related to Latency numbers every programmer should know