AI & ML Paradigm Shift

Expert Threshold Routing (ET) replaces standard top-k token-choice with an independent thresholding mechanism, achieving 1.6x faster training convergence.

arXiv · March 13, 2026 · 2603.11535

Hanchi Sun, Yixin Liu, Yonghui Wu, Lichao Sun

Why it matters

By allowing tokens to be routed to a dynamic number of experts based on an moving average threshold, this method eliminates the need for auxiliary load-balancing losses and is fully causal. This represents a more scalable and flexible architecture for future Mixture-of-Experts (MoE) models.

From the abstract

Token-choice Mixture-of-Experts (TC-MoE) routes each token to a fixed number of experts, limiting dynamic computation allocation and requiring auxiliary losses to maintain load balance. We propose Expert Threshold (ET) routing, where each expert maintains an exponential moving average (EMA) threshold estimated from the global token distribution. At both training and inference, each token is independently routed to an expert if its score exceeds the expert's threshold, enabling dynamic computatio

Read the original paper →

← Back to today's papers