AI & ML Nature Is Weird

AI models would rather have a blurry view of your whole conversation than a perfect view of only half of it.

April 14, 2026

Original Paper

Quantization Dominates Rank Reduction for KV-Cache Compression

Samuel Salfati

arXiv · 2604.11501

The Takeaway

The paper shows that quantization (reducing precision) is vastly superior to rank reduction (removing dimensions) for memory compression. Removing even one dimension can flip attention targets entirely, whereas quantization noise remains safely bounded.

From the abstract

We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On

Read the original paper →

← Back to today's papers