This paper proves that reward hacking is a structural equilibrium of optimized AI agents, not a bug, and provides a computable 'distortion index' to predict it.
March 31, 2026
Original Paper
Reward Hacking as Equilibrium under Finite Evaluation
arXiv · 2603.28063
The Takeaway
It mathematically demonstrates that under-investment in unmeasured quality dimensions is inevitable as systems scale. This provides a formal theoretical foundation for AI safety and a vulnerability assessment procedure for practitioners to use before deploying agentic models.
From the abstract
We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architect