AI & ML Practical Magic

A hidden copy engine inside NVIDIA chips can make the cost of coordinating AI models effectively zero.

April 24, 2026

Original Paper

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

arXiv · 2604.19654

The Takeaway

Distributing a large AI model across multiple chips creates a massive communication bottleneck that slows down training. This new method offloads the coordination work to a specialized part of the hardware that usually sits idle. By using the NVIDIA Hopper NVLink Copy Engine, the model can balance its workload without using any main processing cycles. This makes the heavy communication required for Mixture-of-Experts models nearly free. Data centers can now train larger and more complex models without the usual performance penalty.

From the abstract

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard to hide. Especially on modern bulk-transfer backends such as DeepEP. We make a simple but consequential observation: on the NVIDIA Hopper architecture the NVLink Copy Engine can move data between intra-node GPUs without consuming any SM cycles, effectively providing a nearly free communication channe