AI & ML Breaks Assumption

Grokking is not the discovery of a new algorithm, but the sharpening of one already latent in the model during the memorization phase.

March 26, 2026

Original Paper

Latent Algorithmic Structure Precedes Grokking: A Mechanistic Study of ReLU MLPs on Modular Arithmetic

Anand Swaroop

arXiv · 2603.23784

The Takeaway

This mechanistic study challenges the prevailing view of grokking as a sudden phase transition. It shows that the generalized circuit exists early on as near-binary square waves, implying researchers should focus on 'sharpening' latent structures rather than waiting for 'discovery' during training.

From the abstract

Grokking-the phenomenon where validation accuracy of neural networks on modular addition of two integers rises long after training data has been memorized-has been characterized in previous works as producing sinusoidal input weight distributions in transformers and multi-layer perceptrons (MLPs). We find empirically that ReLU MLPs in our experimental setting instead learn near-binary square wave input weights, where intermediate-valued weights appear exclusively near sign-change boundaries, alo