Detects and mitigates Vision-Language Model hallucinations at inference time by analyzing visual attention entropy rather than text outputs.
arXiv · March 18, 2026 · 2603.16558
The Takeaway
Most hallucination methods rely on language priors, but this paper proves that 'uncertainty' is often visible in the visual attention maps. It provides a training-free reliability score and an attention-adjustment method that reduces hallucinations in real-world scenarios like robotics.
From the abstract
Large Vision-Language Models (LVLMs) achieve strong performance on many multimodal tasks, but object hallucinations severely undermine their reliability. Most existing studies focus on the text modality, attributing hallucinations to overly strong language priors and insufficient visual grounding. In contrast, we observe that abnormal attention patterns within the visual modality can also give rise to hallucinated objects. Building on this observation, we propose Segmentation-based Attention Ent