A geometric fix for Rotary Positional Embeddings (RoPE) allows Transformers to generalize to long inputs out-of-the-box by preserving 'sink token' functionality.
March 20, 2026
Original Paper
Frayed RoPE and Long Inputs: A Geometric Perspective
arXiv · 2603.18017
The Takeaway
The authors identify that RoPE failure at length is caused by the breakdown of key/query cluster separation, which destroys the model's ability to use sink tokens for attention-avoidance. Their proposed modification, RoPE-ID, enables better length extrapolation on standard models with minimal overhead, changing how we handle positional encoding for long-context tasks.
From the abstract
Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate ``out of distribution,'' but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical and theoretical analysis we advance a unified geometric understanding of attention behavior with