AI & ML Breaks Assumption

Identifies a structural 'affordance gap' in Vision-Language Models, proving they fail at embodied scene understanding regardless of scale or prompt engineering.

March 30, 2026

Original Paper

The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene

arXiv · 2603.26589

The Takeaway

It challenges the distributional hypothesis that images and text are sufficient for full visual cognition, showing that current VLMs lack critical knowledge of physical interaction (affordances), which is essential for robotics and spatial AI.

From the abstract

What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over