Adaptive VLM Routing reduces inference costs for Computer Use Agents by up to 78% with negligible accuracy loss.
arXiv · March 16, 2026 · 2603.12823
Why it matters
Current agents route every UI action to expensive models like GPT-4o; this framework uses a lightweight routing layer to escalate only difficult tasks. It provides a practical blueprint for deploying high-reliability agents at a fraction of the current token cost.
From the abstract
Computer Use Agents (CUAs) translate natural-language instructions into Graphical User Interface (GUI) actions such as clicks, keystrokes, and scrolls by relying on a Vision-Language Model (VLM) to interpret screenshots and predict grounded tool calls. However, grounding accuracy varies dramatically across VLMs, while current CUA systems typically route every action to a single fixed model regardless of difficulty. We propose \textbf{Adaptive VLM Routing} (AVR), a framework that inserts a lightw