AI & ML Efficiency Breakthrough

Achieves up to 14.4x higher decoding throughput in long-context LLMs via a training-free framework that reuses sparse memory at semantic boundaries.

arXiv · March 13, 2026 · 2603.12038

Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh, Shuicheng Yan

Why it matters

By exploiting the stability of attention patterns within semantically coherent spans, it provides a practical way to drastically reduce the cost of long-context and agentic workloads without requiring model retraining or fine-tuning.

From the abstract

Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense

Read the original paper →

← Back to today's papers