AI & ML Breaks Assumption

Reveals that 'reasoning' gains in fine-tuned LLMs may be artifacts of task familiarity rather than improved capability.

arXiv · March 16, 2026 · 2603.12875

Kun Wang, Reinhard Heckel

Why it matters

Using test-time RL to align base models to benchmark formats, the authors find the performance gap between base and SFT/RLVR-tuned models largely vanishes. This challenges the industry's belief that RLVR significantly builds new reasoning capabilities, suggesting instead that it mostly aligns models to specific task structures.

From the abstract

Direct evaluation of LLMs on benchmarks can be misleading because comparatively strong performance may reflect task familiarity rather than capability. The train-before-test approach controls for task familiarity by giving each model task-relevant training before evaluation, originally through supervised finetuning. However, suitable training data is often hard to come by, and evaluation results vary with the data chosen. In this paper, we propose a two-stage test-time reinforcement learning (RL

Read the original paper →

← Back to today's papers