AI & ML Breaks Assumption

Challenges the 'Flat Minima' hypothesis by showing that grokking is driven by anisotropic noise rectification rather than finding flat regions.

arXiv · March 17, 2026 · 2603.15492

Pratyush Acharya, Habish Dhakal

The Takeaway

It identifies 'Spectral Gating' as the mechanism behind grokking, providing a new theoretical framework for how adaptive optimizers like AdamW navigate sharp manifolds to reach generalization. This challenges the common intuition that flatter minima are always better for algorithmic tasks.

From the abstract

Standard optimization theories struggle to explain grokking, where generalization occurs long after training convergence. While geometric studies attribute this to slow drift, they often overlook the interaction between the optimizer's noise structure and landscape curvature. This work analyzes AdamW dynamics on modular arithmetic tasks, revealing a ``Spectral Gating'' mechanism that regulates the transition from memorization to generalization.We find that AdamW operates as a variance-gated stoc