ExFusion enables Transformer models to gain the capacity of Mixture-of-Experts during training while remaining a standard dense model for deployment.
March 31, 2026
Original Paper
ExFusion: Efficient Transformer Training via Multi-Experts Fusion
arXiv · 2603.27965
The Takeaway
It 'upcycles' dense FFNs into multi-expert layers during training and then fuses them back into a single unified expert for inference. This allows practitioners to capture MoE performance gains without the memory overhead or deployment complexity of actual MoE architectures.
From the abstract
Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusio