AI & ML Scaling Insight

Sparsity (MoE and GQA) is found to act as a critical regulator for variance propagation, mitigating the 'curse of depth' in LLMs.

arXiv · March 17, 2026 · 2603.15389

Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu

The Takeaway

It provides a mechanistic explanation for why sparse models utilize deeper layers more effectively than dense models. This insight leads to a practical recipe for training very deep models without the performance saturation typically seen in later layers.

From the abstract

Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation