AI & ML Paradigm Challenge

The tests we use to rank the world's best AI coders are so bad that the AI can pass even when its code doesn't actually work.

April 3, 2026

Original Paper

Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Chenglin Li, Yisen Xu, Zehao Wang, Shin Hwei Tan, Tse-Hsun, Chen

arXiv · 2604.01518

The Takeaway

Over 75% of the test cases are flawed, allowing AI to look much smarter than it actually is. This means the reported capabilities of 'expert' AI coding agents are systematically inflated and unreliable.

From the abstract

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as

Read the original paper →

← Back to today's papers