Reveals that linearized attention never converges to the NTK limit in practice, explaining its unique 'influence malleability' compared to standard networks.
arXiv · March 16, 2026 · 2603.13085
Why it matters
It challenges the conventional use of kernel frameworks to explain attention, showing that its non-convergence is actually the source of its power and its specific vulnerability to training-time adversarial attacks.
From the abstract
Understanding the theoretical foundations of attention mechanisms remains challenging due to their complex, non-linear dynamics. This work reveals a fundamental trade-off in the learning dynamics of linearized attention. Using a linearized attention mechanism with exact correspondence to a data-dependent Gram-induced kernel, both empirical and theoretical analysis through the Neural Tangent Kernel (NTK) framework shows that linearized attention does not converge to its infinite-width NTK limit,