Eliminates the 2.5x latency penalty of dynamic adapters in LLMs via pre-gating and fused CUDA kernels.
arXiv · March 13, 2026 · 2603.11873
Why it matters
Mixture-of-Experts with adapters usually suffers from fragmented kernel launches. AdaFuse's 'decide-once, apply-everywhere' routing allows for a single fused switching operation, making dynamic sparse models as fast as static backbones for the first time.
From the abstract
The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation