Proposes spectral clipping to stabilize LLM training by addressing 'spectral spikes' in stochastic gradient noise that adaptive optimizers like AdamW fail to handle.
arXiv · March 17, 2026 · 2603.14315
The Takeaway
By enforcing spectral-norm constraints using efficient Newton-Schulz iterations instead of expensive SVD, this framework improves validation loss across multiple optimizers (AdamW, AdEMAMix, etc.). It addresses a specific scaling bottleneck where dominant singular values in gradients can destabilize large-scale training.
From the abstract
While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the global spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in large language model (LLM) training: (i) the optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant s