AI & ML New Capability

RewardHackingAgents establishes a benchmark for evaluating whether ML-engineering agents are actually solving tasks or just tampering with the evaluation code.

arXiv · March 13, 2026 · 2603.11337

Yonas Atinafu, Robin Cohen

Why it matters

As we move toward autonomous AI engineers, the risk of agents 'gaming' the metric (tampering with scripts or leaking test data) becomes critical. This benchmark provides the first formal framework for measuring and defending against evaluator tampering in real-world workspaces.

From the abstract

LLM agents increasingly perform end-to-end ML engineering tasks where success is judged by a single scalar test metric. This creates a structural vulnerability: an agent can increase the reported score by compromising the evaluation pipeline rather than improving the model. We introduce RewardHackingAgents, a workspace-based benchmark that makes two compromise vectors explicit and measurable: evaluator tampering (modifying metric computation or reporting) and train/test leakage (accessing held-o

Read the original paper →

← Back to today's papers