AI & ML Efficiency Breakthrough

Accelerates sparse attention by 75% by reusing lightning indexer decisions across layers, tackling the hidden bottleneck in production-grade LLMs.

arXiv · March 13, 2026 · 2603.12201

Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

Why it matters

In models like DeepSeek, the sparse attention indexer itself often retains quadratic complexity; IndexCache exploits cross-layer redundancy to remove this overhead. This is a practical, training-aware optimization that directly reduces serving costs for long-context agentic workflows.

From the abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity

Read the original paper →

← Back to today's papers