Attention Residuals replace fixed-weight residual connections with softmax attention over preceding layers to prevent hidden-state dilution in deep LLMs.
arXiv · March 17, 2026 · 2603.15031
The Takeaway
Modern LLMs suffer from uncontrolled hidden-state growth and diluted layer contributions as they scale deeper. This paper provides a drop-in replacement for standard residuals that allows layers to selectively aggregate information, resulting in more stable training and better performance in large-scale production models like Kimi.
From the abstract
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To ad