AI & ML Scaling Insight

Grokking is driven by a norm-driven representational phase transition with a predictable scaling law.

arXiv · March 17, 2026 · 2603.13331

Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, Phan Thanh Duc

The Takeaway

It provides the first quantitative theory for the 'delay' between memorization and generalization, showing it scales inversely with weight decay and learning rate. This transforms grokking from a mysterious phenomenon into a controllable aspect of training dynamics.

From the abstract

Grokking is the sudden generalization that appears long after a model has perfectly memorized its training data. Although this phenomenon has been widely observed, there is still no quantitative theory explaining the length of the delay between memorization and generalization. Prior work has noted that weight decay plays an important role, but no result derives tight bounds for the delay or explains its scaling behavior.We present a first-principles theory showing that grokking arises from a nor

Read the original paper →

← Back to today's papers