Proposes Modulated Hazard-aware Policy Optimization (MHPO) to solve the instability and mode collapse common in GRPO-based reinforcement learning.
March 19, 2026
Original Paper
MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning
arXiv · 2603.16929
The Takeaway
Unlike hard clipping used in current LLM alignment, MHPO uses a differentiable 'Log-Fidelity Modulator' and hazard functions from survival analysis to regulate policy shifts. This provides a mathematically stable alternative for large-scale RL training, preventing the abrupt policy erosions that currently plague reasoning-heavy models.
From the abstract
Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these c