Reveals that 'reasoning' gains in fine-tuned LLMs may be artifacts of task familiarity rather than improved capability.
arXiv · March 16, 2026 · 2603.12875
Why it matters
Using test-time RL to align base models to benchmark formats, the authors find the performance gap between base and SFT/RLVR-tuned models largely vanishes. This challenges the industry's belief that RLVR significantly builds new reasoning capabilities, suggesting instead that it mostly aligns models to specific task structures.
From the abstract
Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL