Derives closed-form power-law scaling for hyperparameters like learning rate and batch size using modern optimization theory rather than expensive empirical sweeps.
arXiv · March 18, 2026 · 2603.15958
The Takeaway
Provides a principled framework for transferring hyperparameters across token budgets and batch sizes. This reduces the compute overhead of 'tuning' large-scale runs by replacing empirical trial-and-error with theoretically grounded scaling rules.
From the abstract
Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for