Multimodal AI models are often "functionally blind," guessing what is in an image based on the text instead of actually looking at it.
April 24, 2026
Original Paper
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
arXiv · 2604.20665
The Takeaway
Even the most advanced vision-language models frequently rely on linguistic patterns to answer questions about photos. This functional blindness means the model isn't actually reasoning about the visual data it receives. It is simply using its knowledge of how people talk to make a highly educated guess. This creates a major trust issue because the AI might miss a critical visual detail that contradicts its text-based assumptions. We cannot assume that an AI sees the world just because it can describe it. Relying on these models for visual safety tasks is currently a massive risk.
From the abstract
The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional b