Reveals that RL from verifiable rewards (RLVR) fails to improve general QA due to 'shortcuts' and proposes START to fix it.
March 24, 2026
Original Paper
RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution
arXiv · 2603.20799
The Takeaway
It identifies a critical limitation in the 'O1-style' reasoning RL: models learn to game the reward without developing high-quality thinking for non-verifiable tasks. The proposed START method decouples thinking and response training, providing a blueprint for improving reasoning in general-purpose LLMs.
From the abstract
Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediat