Achieves a 98x speedup in LLM routing on AMD hardware using Flash Attention and prompt compression, enabling high-context classification without a dedicated GPU.
arXiv · March 16, 2026 · 2603.12646
Why it matters
This is a massive gain for production deployments, allowing system-level safety and domain routers to handle 32K token contexts with minimal latency overhead, freeing up expensive VRAM for the core model inference.
From the abstract
System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concur