VQKV uses Vector Quantization to achieve over 80% KV cache compression with almost zero loss in model performance.
March 18, 2026
Original Paper
VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization
arXiv · 2603.16435
The Takeaway
High-ratio cache compression is the primary bottleneck for deploying long-context LLMs. This training-free method allows for 4.3x longer generation lengths on existing hardware, making it immediately useful for production environments.
From the abstract
The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while