Proves that LLM agent capability (pass@1) and reliability (consistency) diverge systematically, with frontier models often having the highest 'meltdown' rates.
April 1, 2026
Original Paper
Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents
arXiv · 2603.29231
The Takeaway
It introduces a 'Reliability Science' framework that reveals why high-performing models can fail in production environments. Crucially, it finds that memory scaffolds often hurt long-horizon performance and that multi-rank inversions occur when evaluating by reliability versus simple success rates.
From the abstract
Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deploymentsrequire reliability -- consistent success across repeated attempts on tasks of varying duration. We show theseproperties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind tothis divergence.We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve(RDC), Variance Amplification