Achieves up to 14.4x higher decoding throughput in long-context LLMs via a training-free framework that reuses sparse memory at semantic boundaries.
arXiv · March 13, 2026 · 2603.12038
Why it matters
By exploiting the stability of attention patterns within semantically coherent spans, it provides a practical way to drastically reduce the cost of long-context and agentic workloads without requiring model retraining or fine-tuning.
From the abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense