AI & ML Paradigm Shift

Proves that LLM agent capability (pass@1) and reliability (consistency) diverge systematically, with frontier models often having the highest 'meltdown' rates.

April 1, 2026

Original Paper

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Aaditya Khanal, Yangyang Tao, Junxiu Zhou

arXiv · 2603.29231

The Takeaway

It introduces a 'Reliability Science' framework that reveals why high-performing models can fail in production environments. Crucially, it finds that memory scaffolds often hurt long-horizon performance and that multi-rank inversions occur when evaluating by reliability versus simple success rates.

From the abstract

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deploymentsrequire reliability -- consistent success across repeated attempts on tasks of varying duration. We show theseproperties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind tothis divergence.We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve(RDC), Variance Amplification

Read the original paper →

← Back to today's papers