Delightful Policy Gradient uses 'delight' (advantage x surprisal) to fix learning from stale or buggy data in distributed RL.
March 24, 2026
Original Paper
Delightful Distributed Policy Gradient
arXiv · 2603.20521
The Takeaway
It solves the core problem of negative learning from high-surprisal failures in off-policy distributed training. By gating updates based on 'delight,' it outperforms standard importance sampling and is significantly more robust to actor bugs and reward corruption.
From the abstract
Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful P