Provides a geometric 'manifold envelopment' framework to explain why unsupervised RL for mathematical reasoning often collapses and how to stabilize it.
arXiv · March 18, 2026 · 2603.16578
The Takeaway
Practitioners trying to scale reasoning without expensive ground-truth labels often face policy collapse. This paper identifies the specific foundational logical priors required for success and offers a diagnostic tool to predict training stability.
From the abstract
Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite