AI & ML New Capability

SCRL introduces the first negative supervision mechanism for Test-Time Reinforcement Learning, preventing LLMs from reinforcing 'consensus lies'.

March 23, 2026

Original Paper

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan

arXiv · 2603.19880

The Takeaway

Current test-time scaling methods rely on majority voting, which fails when models are collectively wrong. SCRL adds entropy-gated negative labeling to prune incorrect trajectories, making reasoning models significantly more robust during inference-time compute scaling.

From the abstract

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. I