AI & ML Efficiency Breakthrough

Modality-level disaggregation enables cost-optimal MLLM serving across heterogeneous GPUs over commodity PCIe, bypassing the need for expensive NVLink interconnects.

arXiv · March 16, 2026 · 2603.12707

Donglin Yu

Why it matters

By partitioning models at the vision-to-language boundary, this approach reduces cross-device transfer volume by O(Depth) compared to standard KV-cache disaggregation. Practitioners can now achieve up to 40% cost savings by using mixed-tier GPU clusters (e.g., combining older and newer GPUs) without significant latency penalties.

From the abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from $O(L * s_ctx)$ bytes (GB-scale KV c