A self-supervised RLVR method that escapes the 'spurious majority' trap by using a temporary unlearning process for exploration.
arXiv · March 18, 2026 · 2603.16223
The Takeaway
Unsupervised reinforcement learning often converges on popular but incorrect answers (majority vote bias). By using an 'explorer' model with temporary unlearning to generate diverse signals, this method allows for scalable reasoning improvements in LLMs without requiring labeled data or external supervision.
From the abstract
Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method