MoE-Sieve reduces Mixture-of-Experts LoRA fine-tuning parameters and training time by ~70% by only adapting the most-frequently activated 'hot' experts.
March 26, 2026
Original Paper
MoE-Sieve: Routing-Guided LoRA for Efficient MoE Fine-Tuning
arXiv · 2603.24044
The Takeaway
It challenges the practice of applying LoRA to all experts in an MoE model, proving that adapting 'cold' experts is often counterproductive and adds gradient noise. This significantly lowers the hardware barrier for fine-tuning massive MoE models like Mixtral or DeepSeek without sacrificing performance.
From the abstract
Standard LoRA fine-tuning of Mixture-of-Experts (MoE) models applies adapters to every expert, yet our profiling shows that per-layer expert routing is highly skewed: a small subset of experts handles most tokens in each layer, while many others are rarely activated ("cold"). We propose MoE-Sieve, a simple routing-guided framework for LoRA fine-tuning, and pair it with a systematic profiling study of expert routing across architectures and tasks. The method is simple: profile routing counts on a