AI & ML Scaling Insight

Applying Rotary Positional Embeddings (RoPE) to only 10% of hidden dimensions is sufficient for full model convergence, enabling 10x memory savings in positional caches.

arXiv · March 13, 2026 · 2603.11611

Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

Why it matters

It challenges the standard practice of applying RoPE to all dimensions. The finding that a small fraction of rotary dimensions provides enough signal for stability and performance allows for significant architectural simplification and memory optimization in long-context transformers.

From the abstract

Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache,

Read the original paper →

← Back to today's papers