AI & ML Scaling Insight

Spectral Edge Dynamics (SED) provides an early-warning signal for grokking, predicting generalization up to 1,700 steps before it occurs.

arXiv · March 18, 2026 · 2603.15678

Yongzhong Xu

The Takeaway

It identifies a universal three-phase pattern in the SVD of parameter updates, offering practitioners a tool to detect training structure and predict performance shifts in large-scale transformer training.

From the abstract

Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $\sigma_k/\sigma_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GP