Unifies KV cache compression and sparse attention into a single 1-bit indexing structure, eliminating the need for external metadata or predictors.
March 17, 2026
Original Paper
Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys
arXiv · 2603.14224
The Takeaway
It addresses the primary bottleneck in long-context LLM inference by making sparsity and compression parts of the same hardware-friendly bit-vector. This allows for significantly larger batch sizes and context lengths with minimal memory and runtime overhead.
From the abstract
The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules, relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability.In this paper, we propose a novel paradigm: treating the compressed key representation not me