Challenges the standard practice of deep PPO training by proving that consensus aggregation of 'wider' parallel runs is 8x more sample efficient than multiple epochs.
arXiv · March 16, 2026 · 2603.12596
Why it matters
It identifies that deep PPO epochs consume trust region budget with 'wasteful' Fisher-orthogonal residuals. By optimizing parallel replicates and aggregating in natural parameter space, practitioners can achieve significantly higher stability and performance without extra environment interactions.
From the abstract
Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but