AI & ML Breaks Assumption

Provides causal evidence that LLMs use internal confidence signals to drive behavioral decisions like abstention, rather than just as a side-effect of output generation.

March 24, 2026

Original Paper

Causal Evidence that Language Models use Confidence to Drive Behavior

Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean

arXiv · 2603.22161

The Takeaway

This challenges the view of LLMs as purely surface-level predictors by demonstrating a 'metacognitive' two-stage process where internal confidence is causal to behavior. For practitioners building agents, it validates that internal uncertainty signals are reliable triggers for human-in-the-loop escalation.

From the abstract

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstentionthis http URL1 established internal confidence estimates in the absence of an abstention opti

Read the original paper →

← Back to today's papers