Grokking is driven by a norm-driven representational phase transition with a predictable scaling law.
arXiv · March 17, 2026 · 2603.13331
The Takeaway
It provides the first quantitative theory for the 'delay' between memorization and generalization, showing it scales inversely with weight decay and learning rate. This transforms grokking from a mysterious phenomenon into a controllable aspect of training dynamics.
From the abstract
Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior.We present a first-principles theory showing that grokking arises from a nor