AI & ML Efficiency Breakthrough

MAC-Attention achieves 14x attention-phase speedups and reduces KV cache accesses by 99% for long-context LLMs by reusing computation from semantically similar queries.

April 2, 2026

Original Paper

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Jinghan Yao, Sam Adé Jacobs, Walid Krichene, Masahiro Tanaka, Dhabaleswar K Panda

arXiv · 2604.00235

The Takeaway

Long-context inference is currently bottlenecked by the KV cache memory wall; this method preserves full fidelity and accessibility while drastically reducing the IO required for decoding. It enables 128K+ context lengths on much more modest hardware without the quality degradation of eviction-based methods.

From the abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that

Read the original paper →

← Back to today's papers