AI & ML Scaling Insight

Introduces a robust framework for optimal Mixture-of-Experts (MoE) architecture design across six orders of magnitude in compute.

March 24, 2026

Original Paper

Holistic Scaling Laws for Optimal Mixture-of-Experts Architecture Optimization

Weilin Wan, Jingtao Han, Weizhong Zhang, Cheng Jin

arXiv · 2603.21862

The Takeaway

It reduces the massive 16-dimensional MoE architectural search space into manageable phases, providing practitioners with precise formulas to map any compute budget to optimal parameters, layer types, and expert counts. This addresses a major gap in modern LLM training where MoE configurations are often chosen heuristically rather than optimally.

From the abstract

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE a

Read the original paper →

← Back to today's papers