AI & ML Nature Is Weird

VLMs fail at simple counting because their language layers 'talk' them into ignoring the visual evidence.

April 15, 2026

Original Paper

Counting to Four is still a Chore for VLMs

arXiv · 2604.10039

The Takeaway

This study demonstrates that Vision-Language Models don't fail at counting because they can't 'see'—they fail because the text-based priors override the visual data. Essentially, the model 'sees' four items but its language training 'tells' it there should be two, and it goes with the text. This reveals a bizarre internal conflict in multimodal AI. It suggests that scaling up the visual encoder won't fix basic errors as long as the language model remains overconfident. To build better vision systems, we have to find a way to let the 'eyes' win the argument over the 'mouth.'

From the abstract

Vision--language models (VLMs) have achieved impressive performance on complex multimodal reasoning tasks, yet they still fail on simple grounding skills such as object counting. Existing evaluations mostly assess only final outputs, offering limited insight into where these failures arise inside the model. In this work, we present an empirical study of VLM counting behavior through both behavioral and mechanistic analysis. We introduce COUNTINGTRICKS, a controlled evaluation suite of simple sha