Accelerates sparse attention by 75% by reusing lightning indexer decisions across layers, tackling the hidden bottleneck in production-grade LLMs.
arXiv · March 13, 2026 · 2603.12201
Why it matters
In models like DeepSeek, the sparse attention indexer itself often retains quadratic complexity; IndexCache exploits cross-layer redundancy to remove this overhead. This is a practical, training-aware optimization that directly reduces serving costs for long-context agentic workflows.
From the abstract
Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity