MAC-Attention achieves 14x attention-phase speedups and reduces KV cache accesses by 99% for long-context LLMs by reusing computation from semantically similar queries.
April 2, 2026
Original Paper
MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
arXiv · 2604.00235
The Takeaway
Long-context inference is currently bottlenecked by the KV cache memory wall; this method preserves full fidelity and accessibility while drastically reducing the IO required for decoding. It enables 128K+ context lengths on much more modest hardware without the quality degradation of eviction-based methods.
From the abstract
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that