AI & ML Scaling Insight

Derives closed-form power-law scaling for hyperparameters like learning rate and batch size using modern optimization theory rather than expensive empirical sweeps.

arXiv · March 18, 2026 · 2603.15958

Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto

The Takeaway

Provides a principled framework for transferring hyperparameters across token budgets and batch sizes. This reduces the compute overhead of 'tuning' large-scale runs by replacing empirical trial-and-error with theoretically grounded scaling rules.

From the abstract

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for

Read the original paper →

← Back to today's papers