AI & ML Efficiency Breakthrough

ExFusion enables Transformer models to gain the capacity of Mixture-of-Experts during training while remaining a standard dense model for deployment.

March 31, 2026

Original Paper

ExFusion: Efficient Transformer Training via Multi-Experts Fusion

Jiacheng Ruan, Daize Dong, Xiaoye Qu, Tong Zhu, Ting Liu, Yuzhuo Fu, Yu Cheng, Suncheng Xiang

arXiv · 2603.27965

The Takeaway

It 'upcycles' dense FFNs into multi-expert layers during training and then fuses them back into a single unified expert for inference. This allows practitioners to capture MoE performance gains without the memory overhead or deployment complexity of actual MoE architectures.

From the abstract

Mixture-of-Experts (MoE) models substantially improve performance by increasing the capacity of dense architectures. However, directly training MoE models requires considerable computational resources and introduces extra overhead in parameter storage and deployment. Therefore, it is critical to develop an approach that leverages the multi-expert capability of MoE to enhance performance while incurring minimal additional cost. To this end, we propose a novel pre-training approach, termed ExFusio