AI & ML Paradigm Shift

Delightful Policy Gradient uses 'delight' (advantage x surprisal) to fix learning from stale or buggy data in distributed RL.

March 24, 2026

Original Paper

Delightful Distributed Policy Gradient

Ian Osband

arXiv · 2603.20521

The Takeaway

It solves the core problem of negative learning from high-surprisal failures in off-policy distributed training. By gating updates based on 'delight,' it outperforms standard importance sampling and is significantly more robust to actor bugs and reward corruption.

From the abstract

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful P

Read the original paper →

← Back to today's papers