AI & ML Breaks Assumption

Identifies that extended reasoning in Multimodal LLMs causes 'attention dispersion,' where models literally lose focus on visual inputs as the reasoning chain lengthens.

arXiv · March 17, 2026 · 2603.14184

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

The Takeaway

It challenges the assumption that more 'thought' steps necessarily improve multimodal performance, revealing a perceptual trade-off. The proposed training-free VRGA framework allows practitioners to maintain visual grounding during complex reasoning without expensive fine-tuning.

From the abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoni