AI & ML Breaks Assumption

Replacing the linear Query projection in Transformers with a nonlinear residual MLP significantly improves performance with minimal parameter growth.

arXiv · March 17, 2026 · 2603.13381

Marko Karbevski

The Takeaway

This breaks the fundamental design rule that attention projections must be linear. By adding a small nonlinear bottleneck to the Query path, models can outperform baselines with 12.5% more parameters, opening a new avenue for architectural optimization.

From the abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \mathbb{R}^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_\theta(X)$, where $f_\thet