Batch-level query routing for LLMs allows for strict cost and capacity control that per-query methods cannot achieve.
March 31, 2026
Original Paper
Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints
arXiv · 2603.26796
The Takeaway
Most LLM routers decide on a per-query basis, failing under adversarial or non-uniform batching. This framework jointly optimizes model assignment for an entire batch, improving accuracy by up to 24% while respecting hard GPU resource and budget constraints.
From the abstract
We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertaint