Provides causal evidence that LLMs use internal confidence signals to drive behavioral decisions like abstention, rather than just as a side-effect of output generation.
March 24, 2026
Original Paper
Causal Evidence that Language Models use Confidence to Drive Behavior
arXiv · 2603.22161
The Takeaway
This challenges the view of LLMs as purely surface-level predictors by demonstrating a 'metacognitive' two-stage process where internal confidence is causal to behavior. For practitioners building agents, it validates that internal uncertainty signals are reliable triggers for human-in-the-loop escalation.
From the abstract
Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstentionthis http URL1 established internal confidence estimates in the absence of an abstention opti