Discovers 'silent commitment failure,' where some model architectures produce confident, incorrect outputs with zero detectable warning signals before execution.
March 24, 2026
Original Paper
Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures
arXiv · 2603.21415
The Takeaway
The paper introduces 'governability' and reveals that a model's ability to signal its own errors is largely fixed during pretraining rather than fine-tuning. This challenges the assumption that safety and error detection can be easily 'added' to any model via post-training.
From the abstract
As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies