AI & ML Efficiency Breakthrough

A predictive scheduling system for multi-agent workflows that optimizes serving across heterogeneous LLM clusters (mixing large and small models).

March 24, 2026

Original Paper

Chimera: Latency- and Performance-Aware Multi-agent Serving for Heterogeneous LLMs

Kangqi Ni, Wenyue Hua, Xiaoxiang Shi, Jiang Guo, Shiyu Chang, Tianlong Chen

arXiv · 2603.22206

The Takeaway

Most serving systems assume homogeneous hardware/models; Chimera uses semantic routing and load balancing to leverage heterogeneous deployments, reducing latency by up to 2.4x. This is critical for enterprises managing mixed-tier model pipelines.

From the abstract

Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling

Read the original paper →

← Back to today's papers