Spectral Edge Dynamics (SED) provides an early-warning signal for grokking, predicting generalization up to 1,700 steps before it occurs.
arXiv · March 18, 2026 · 2603.15678
The Takeaway
It identifies a universal three-phase pattern in the SVD of parameter updates, offering practitioners a tool to detect training structure and predict performance shifts in large-scale transformer training.
From the abstract
Despite hundreds of millions of parameters, transformer training trajectories evolve within only a few coherent directions. We introduce \emph{Spectral Edge Dynamics} (SED) to measure this structure: rolling-window SVD of parameter updates reveals a sharp boundary -- the \emph{spectral edge} -- between coherent optimization directions and stochastic noise, identified by the maximum consecutive singular value ratio $\sigma_k/\sigma_{k+1}$. Across a 51M-parameter TinyStories model (4~seeds) and GP