Applying Rotary Positional Embeddings (RoPE) to only 10% of hidden dimensions is sufficient for full model convergence, enabling 10x memory savings in positional caches.
arXiv · March 13, 2026 · 2603.11611
Why it matters
It challenges the standard practice of applying RoPE to all dimensions. The finding that a small fraction of rotary dimensions provides enough signal for stability and performance allows for significant architectural simplification and memory optimization in long-context transformers.
From the abstract
Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache,