AI models are failing 'elite' tests because the test questions themselves are literally impossible to answer correctly.
March 26, 2026
Original Paper
The Arrival of AGI? When Expert Personas Exceed Expert Benchmarks
SSRN · 6343278
The Takeaway
It reveals that AI performance isn't just about the model's intelligence, but about systemic flaws in the gold-standard tests we use to measure them. High-performing models are actually penalized for using correct reasoning that contradicts the flawed 'ground truth' of the benchmark.
From the abstract
Do expert personas improve language model performance? The Wharton Generative AI Lab reports that they do not, broadcasting to millions via social media the recommendation that practitioners abandon a technique recommended by Anthropic, Google, and OpenAI. We demonstrate that this null finding was structurally predictable. Five core mechanisms precluded detection before data collection began: baseline contamination elevating the starting point to near-ceiling, system prompt hierarchy subordinati