AI & ML Breaks Assumption

Argues that probability gradients are superior to standard log-probability gradients for RL training, proposing a new optimization method (DGPO) to solve divergence in soft clipping.

arXiv · March 17, 2026 · 2603.14389

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng

The Takeaway

As practitioners increasingly use RL with Verifiable Rewards (RLVR) to boost LLM reasoning, standard training often becomes unstable as probabilities approach zero. This change in the optimization primitive allows for deeper exploration and more stable training for high-reasoning models.

From the abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_\theta\l