AI & ML Breaks Assumption

Chain-of-thought (CoT) reasoning in Vision-Language Models systematically degrades the reliability of uncertainty estimates, making models dangerously overconfident.

arXiv · March 18, 2026 · 2603.16728

Robert Welch, Emir Konuk, Kevin Smith

The Takeaway

Practitioners often assume better reasoning leads to better reliability, but this work shows that token probabilities reflect consistency with the model's own reasoning trace rather than actual correctness. For high-stakes deployments, this means standard uncertainty quantification methods become invalid when CoT is enabled.

From the abstract

Vision-language models (VLMs) are increasingly deployed in high-stakes settings where reliable uncertainty quantification (UQ) is as important as predictive accuracy. Extended reasoning via chain-of-thought (CoT) prompting or reasoning-trained models has become ubiquitous in modern VLM pipelines, yet its effect on UQ reliability remains poorly understood. We show that reasoning consistently degrades the quality of most uncertainty estimates, even when it improves task accuracy. We identify impli