AI & ML New Capability

A self-supervised RLVR method that escapes the 'spurious majority' trap by using a temporary unlearning process for exploration.

arXiv · March 18, 2026 · 2603.16223

Kaixuan Du, Meng Cao, Hang Zhang, Yukun Wang, Xiangzhou Huang, Ni Li

The Takeaway

Unsupervised reinforcement learning often converges on popular but incorrect answers (majority vote bias). By using an 'explorer' model with temporary unlearning to generate diverse signals, this method allows for scalable reasoning improvements in LLMs without requiring labeled data or external supervision.

From the abstract

Current label-free RLVR approaches for large language models (LLMs), such as TTRL and Self-reward, have demonstrated effectiveness in improving the performance of LLMs on complex reasoning tasks. However, these methods rely heavily on accurate pseudo-label estimation and converge on spurious yet popular answers, thereby trapping in a dominant mode and limiting further improvements. Building on this, we propose Dual Consensus Reinforcement Learning (DCRL), a novel self-supervised training method

Read the original paper →

← Back to today's papers