Introduces a robust framework for optimal Mixture-of-Experts (MoE) architecture design across six orders of magnitude in compute.
March 24, 2026
Original Paper
Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization
arXiv · 2603.21862
The Takeaway
It reduces the massive 16-dimensional MoE architectural search space into manageable phases, providing practitioners with precise formulas to map any compute budget to optimal parameters, layer types, and expert counts. This addresses a major gap in modern LLM training where MoE configurations are often chosen heuristically rather than optimally.
From the abstract
Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE a