AI & ML Efficiency Breakthrough

Eliminates the 2.5x latency penalty of dynamic adapters in LLMs via pre-gating and fused CUDA kernels.

arXiv · March 13, 2026 · 2603.11873

Qiyang Li, Rui Kong, Yuchen Li, Hengyi Cai, Shuaiqiang Wang, Linghe Kong, Guihai Chen, Dawei Yin

Why it matters

Mixture-of-Experts with adapters usually suffers from fragmented kernel launches. AdaFuse's 'decide-once, apply-everywhere' routing allows for a single fused switching operation, making dynamic sparse models as fast as static backbones for the first time.

From the abstract

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation