AI & ML Efficiency Breakthrough

KVSculpt moves beyond simple eviction/merging to optimize unconstrained KV pairs in continuous space for extreme cache compression.

March 31, 2026

Original Paper

KVSculpt: KV Cache Compression as Distillation

Bo Jiang, Sian Jin

arXiv · 2603.27819

The Takeaway

By treating KV cache compression as a continuous optimization/distillation problem rather than a discrete selection problem, it achieves 4x better KL divergence reduction. This significantly improves the quality of long-context LLM inference at high compression ratios.

From the abstract

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to

Read the original paper →

← Back to today's papers