AI & ML Paradigm Shift

HAPO resolves the advantage collapse problem in sparse-reward RL for reasoning models using a Thompson-sampled hindsight mechanism.

arXiv · March 13, 2026 · 2603.11321

Yuning Wu, Ke Wang, Devin Chen, Kai Wei

Why it matters

This is highly relevant for post-training 'reasoning' LLMs. By selectively anchoring to teacher demonstrations only during failures and annealing that signal, it allows models to surpass teacher performance without the instability of pure on-policy RL.

From the abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for post-training reasoning models. However, group-based methods such as Group Relative Policy Optimization (GRPO) face a critical dilemma in sparse-reward settings: pure Reinforcement Learning (RL) suffers from advantage collapse and high-variance gradient estimation, while mixed-policy optimization introduces persistent distributional bias. To resolve this dilemma, we introduce Hindsight-Anchored Policy O