AI & ML Scaling Insight

Provides the first sharp theoretical characterization of why spectral optimizers like Muon drastically outperform SGD in storage capacity and scaling for language models.

March 30, 2026

Original Paper

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

arXiv · 2603.26554

The Takeaway

Muon has become the default optimizer for many high-end diffusion and language models; this paper explains why it saturates at larger critical batch sizes and recovers information faster, allowing practitioners to predict performance gains across model scales.

From the abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding