Sparsity (MoE and GQA) is found to act as a critical regulator for variance propagation, mitigating the 'curse of depth' in LLMs.
arXiv · March 17, 2026 · 2603.15389
The Takeaway
It provides a mechanistic explanation for why sparse models utilize deeper layers more effectively than dense models. This insight leads to a practical recipe for training very deep models without the performance saturation typically seen in later layers.
From the abstract
Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation