Shows that standard task-completion benchmarks fail to distinguish agent capabilities and proposes 'Working Memory Fidelity' as a more predictive metric.
March 31, 2026
Original Paper
Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance
arXiv · 2603.27343
The Takeaway
It identifies cumulative state tracking under load as the primary bottleneck for LLM agents, providing a calibration-based probe that predicts performance across 20+ open-weight models better than traditional benchmarks.
From the abstract
Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF