Argues that probability gradients are superior to standard log-probability gradients for RL training, proposing a new optimization method (DGPO) to solve divergence in soft clipping.
arXiv · March 17, 2026 · 2603.14389
The Takeaway
As practitioners increasingly use RL with Verifiable Rewards (RLVR) to boost LLM reasoning, standard training often becomes unstable as probabilities approach zero. This change in the optimization primitive allows for deeper exploration and more stable training for high-reasoning models.
From the abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via ``hard clipping'', which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent ``soft clipping'' methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_\theta\l