A predictive scheduling system for multi-agent workflows that optimizes serving across heterogeneous LLM clusters (mixing large and small models).
March 24, 2026
Original Paper
Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs
arXiv · 2603.22206
The Takeaway
Most serving systems assume homogeneous hardware/models; Chimera uses semantic routing and load balancing to leverage heterogeneous deployments, reducing latency by up to 2.4x. This is critical for enterprises managing mixed-tier model pipelines.
From the abstract
Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling