The 'Reasoning Contamination Effect' shows that Chain-of-Thought (CoT) reasoning actually disrupts a model's internal confidence signal, leading to poorer calibration.
March 27, 2026
Original Paper
Closing the Confidence-Faithfulness Gap in Large Language Models
arXiv · 2603.25052
The Takeaway
Researchers found that internal accuracy and verbalized confidence signals are orthogonal in the model's geometry, and reasoning further misaligns them. This provides a mechanistic explanation for why smarter models often sound more overconfident and offers a steering-based fix.
From the abstract
Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent