The tests we use to rank the world's best AI coders are so bad that the AI can pass even when its code doesn't actually work.
April 3, 2026
Original Paper
Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites
arXiv · 2604.01518
The Takeaway
Over 75% of the test cases are flawed, allowing AI to look much smarter than it actually is. This means the reported capabilities of 'expert' AI coding agents are systematically inflated and unreliable.
From the abstract
Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as