Vision models will ignore a picture of a cat and claim it is a dog if the word 'dog' is written over the image.
April 23, 2026
Original Paper
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
arXiv · 2604.17375
The Takeaway
Text overlays act as a total hijack for the vision systems in modern AI. When the visual content and the written text disagree, the model almost always believes the text. This is a dangerous blind spot for security and safety. A simple label could trick a vision system into ignoring a clear hazard. It shows that these models prioritize symbolic information over raw sensory data. Fixing this requires training models that trust their eyes as much as their ears. We cannot assume an AI sees the world as it really is when labels are present.
From the abstract
Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH).