SAGE mitigates multimodal hallucinations by monitoring 'attention sinks' and dynamically modulating self-attention during the decoding process.
March 31, 2026
Original Paper
SAGE: Sink-Aware Grounded Decoding for Multimodal Hallucination Mitigation
arXiv · 2603.27898
The Takeaway
Unlike post-hoc verification or expensive retraining, SAGE intervenes in real-time by identifying when the model is over-attending to semantically weak tokens (sinks). It offers a training-free way to improve VLM reliability by grounding generation in visual features only when needed.
From the abstract
Large vision-language models (VLMs) frequently suffer from hallucinations, generating content that is inconsistent with visual inputs. Existing methods typically address this problem through post-hoc filtering, additional training objectives, or external verification, but they do not intervene during the decoding process when hallucinations arise. In this work, we introduce SAGE, a Sink-Aware Grounded Decoding framework that mitigates hallucinations by dynamically modulating self-attention durin