Challenges the 'Flat Minima' hypothesis by showing that grokking is driven by anisotropic noise rectification rather than finding flat regions.
arXiv · March 17, 2026 · 2603.15492
The Takeaway
It identifies 'Spectral Gating' as the mechanism behind grokking, providing a new theoretical framework for how adaptive optimizers like AdamW navigate sharp manifolds to reach generalization. This challenges the common intuition that flatter minima are always better for algorithmic tasks.
From the abstract
Standard optimization theories struggle to explain grokking, where generalization occurs long after training convergence. While geometric studies attribute this to slow drift, they often overlook the interaction between the optimizer's noise structure and landscape curvature. This work analyzes AdamW dynamics on modular arithmetic tasks, revealing a ``Spectral Gating'' mechanism that regulates the transition from memorization to generalization.We find that AdamW operates as a variance-gated stoc