LLM Inference Optimization Part 2 — KV Cache Optimization
KV Cache quantization (int8/int4), PCA compression (KVTC), and PagedAttention (vLLM). Hands-on memory reduction code and scenario-based configuration guide.

LLM Inference Optimization Part 2 — KV Cache Optimization
In Part 1, we covered the structure of Attention and how KV Cache works. In this part, we look at practical techniques for optimizing the KV Cache itself, with code.
Even when model weights are reduced through quantization, KV Cache is almost always left in fp16. As context length grows, it is common for KV Cache to consume more than half of total VRAM. We cover three approaches to solving this problem.
1. KV Cache Quantization
How It Works
Related Posts

Self-Evolving AI Agents — The New Paradigm of 2026
GenericAgent, Evolver, Open Agents — comparing 3 self-evolving agent frameworks that learn, adapt, and grow without human coding.

Build Your Own LLM Knowledge Base — A Karpathy-Style Knowledge System
Complete guide to building a permanent personal knowledge system with Obsidian + Claude Code. Wiki + Memory dual-axis architecture.

Why Karpathy's CLAUDE.md Got 48K Stars — And How to Write Your Own
One markdown file raised AI coding accuracy from 65% to 94%. Analyzing Karpathy's 4 rules and practical writing guide.