Step-by-step thinking makes an AI worse at figuring out where objects are located in a photo.
Chain-of-Thought prompting causes multimodal models to hallucinate visual details based on their text-based expectations. While this technique helps with math, it actively degrades a model's ability to perform spatial reasoning. The AI begins to trust its own written descriptions more than the actual pixels in the image. This proves that thinking out loud is not a universal solution for improving AI performance. Engineers must develop new prompting methods for vision tasks that prevent text logic from overriding visual evidence.
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
arXiv · 2604.16060
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that