AI & ML Paradigm Shift

This paper proves that reward hacking is a structural equilibrium of optimized AI agents, not a bug, and provides a computable 'distortion index' to predict it.

March 31, 2026

Original Paper

Reward Hacking as Equilibrium under Finite Evaluation

Jiacheng Wang, Jinbin Huang

arXiv · 2603.28063

The Takeaway

It mathematically demonstrates that under-investment in unmeasured quality dimensions is inevitable as systems scale. This provides a formal theoretical foundation for AI safety and a vulnerability assessment procedure for practitioners to use before deploying agentic models.

From the abstract

We prove that under five minimal axioms -- multi-dimensional quality, finite evaluation, effective optimization, resource finiteness, and combinatorial interaction -- any optimized AI agent will systematically under-invest effort in quality dimensions not covered by its evaluation system. This result establishes reward hacking as a structural equilibrium, not a correctable bug, and holds regardless of the specific alignment method (RLHF, DPO, Constitutional AI, or others) or evaluation architect

Read the original paper →

← Back to today's papers