AI & ML Paradigm Shift

Proposes Modulated Hazard-aware Policy Optimization (MHPO) to solve the instability and mode collapse common in GRPO-based reinforcement learning.

March 19, 2026

Original Paper

MHPO: Modulated Hazard-aware Policy Optimization for Stable Reinforcement Learning

Hongjun Wang, Wei Liu, Weibo Gu, Xing Sun, Kai Han

arXiv · 2603.16929

The Takeaway

Unlike hard clipping used in current LLM alignment, MHPO uses a differentiable 'Log-Fidelity Modulator' and hazard functions from survival analysis to regulate policy shifts. This provides a mathematically stable alternative for large-scale RL training, preventing the abrupt policy erosions that currently plague reasoning-heavy models.

From the abstract

Regulating the importance ratio is critical for the training stability of Group Relative Policy Optimization (GRPO) based frameworks. However, prevailing ratio control methods, such as hard clipping, suffer from non-differentiable boundaries and vanishing gradient regions, failing to maintain gradient fidelity. Furthermore, these methods lack a hazard-aware mechanism to adaptively suppress extreme deviations, leaving the optimization process vulnerable to abrupt policy shifts. To address these c

Read the original paper →

← Back to today's papers