Policy Improvement Reinforcement Learning (PIRL) shifts the training objective from reward maximization to explicit maximization of policy progress across iterations.
April 2, 2026
Original Paper
Policy Improvement Reinforcement Learning
arXiv · 2604.00860
The Takeaway
By moving from open-loop to closed-loop RL, it prevents the training collapse and drift common in RLVR (Reinforcement Learning with Verifiable Rewards). This results in more stable and self-correcting post-training for reasoning models.
From the abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize policies based on instantaneous group-level or batch-level statistics without ever verifying whether the resulting update actually improved the model. This open-loop design -- updating in isolation at each step, guided only by within-group (batch) reward signals -- means o