AI & ML Scaling Insight

Shows that standard task-completion benchmarks fail to distinguish agent capabilities and proposes 'Working Memory Fidelity' as a more predictive metric.

March 31, 2026

Original Paper

Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, Kazunori D Yamada

arXiv · 2603.27343

The Takeaway

It identifies cumulative state tracking under load as the primary bottleneck for LLM agents, providing a calibration-based probe that predicts performance across 20+ open-weight models better than traditional benchmarks.

From the abstract

Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF