Softmax normalization mathematically mandates the creation of attention sinks to serve as 'null states' when models need to ignore input.
arXiv · March 13, 2026 · 2603.11487
Why it matters
This paper provides the first formal proof that attention sinks (the concentration of attention on fixed tokens) are a structural necessity of softmax self-attention. It demonstrates that moving to non-normalized ReLU attention eliminates sinks, offering a clear path for designing more efficient long-context architectures without artificial 'anchor' tokens.
From the abstract
Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. We prove that computing a simple trigger-conditional behavior necessarily induces a sink in softmax self-attention models. Our results formalize a familiar intuition: normalization over a probability simplex must force attention to collapse onto a stable anchor to realize a default state (e.g., when the model needs to ignore the input). We instantiate this with a concrete task: when