AI & ML Scaling Insight

HyperP provides the first hyperparameter transfer laws for hypersphere optimization, ensuring stable scaling for models using the Muon optimizer.

March 31, 2026

Original Paper

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

arXiv · 2603.28743

The Takeaway

As practitioners move toward second-order optimizers like Muon, HyperP provides the necessary mathematical framework to transfer optimal learning rates across width, depth, and MoE granularity. It enables bounded stability and high compute efficiency at scales where standard AdamW often becomes unstable.

From the abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal

Read the original paper →

← Back to today's papers