AI & ML Efficiency Breakthrough

Batch-level query routing for LLMs allows for strict cost and capacity control that per-query methods cannot achieve.

March 31, 2026

Original Paper

Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Jelena Markovic-Voronov, Kayhan Behdin, Yuanda Xu, Zhengze Zhou, Zhipeng Wang, Rahul Mazumder

arXiv · 2603.26796

The Takeaway

Most LLM routers decide on a per-query basis, failing under adversarial or non-uniform batching. This framework jointly optimizes model assignment for an entire batch, improving accuracy by up to 24% while respecting hard GPU resource and budget constraints.

From the abstract

We study the problem of routing queries to large language models (LLMs) under cost, GPU resources, and concurrency constraints. Prior per-query routing methods often fail to control batch-level cost, especially under non-uniform or adversarial batching. To address this, we propose a batch-level, resource-aware routing framework that jointly optimizes model assignment for each batch while respecting cost and model capacity limits. We further introduce a robust variant that accounts for uncertaint

Read the original paper →

← Back to today's papers