AI & ML Scaling Insight

Provides a geometric 'manifold envelopment' framework to explain why unsupervised RL for mathematical reasoning often collapses and how to stabilize it.

arXiv · March 18, 2026 · 2603.16578

Zelin Zhang, Fei Cheng, Chenhui Chu

The Takeaway

Practitioners trying to scale reasoning without expensive ground-truth labels often face policy collapse. This paper identifies the specific foundational logical priors required for success and offers a diagnostic tool to predict training stability.

From the abstract

Although outcome-based reinforcement learning (RL) significantly advances the mathematical reasoning capabilities of Large Language Models (LLMs), its reliance on computationally expensive ground-truth annotations imposes a severe scalability bottleneck. Unsupervised RL guided by intrinsic rewards offers a scalable alternative, yet it suffers from opaque training dynamics and catastrophic instability, such as policy collapse and reward hacking. In this paper, we first design and evaluate a suite

Read the original paper →

← Back to today's papers