TARo introduces a learnable token-level router that steers frozen LLMs toward structured reasoning at test-time without retraining.
arXiv · March 20, 2026 · 2603.18411
The Takeaway
It extends test-time alignment from simple preference optimization to complex math and clinical reasoning. The ability to generalize from small to large backbones without retraining makes it a powerful, lightweight tool for enhancing reasoning in frozen production models.
From the abstract
Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathemat