Reveals that state-of-the-art MLLMs fail to maintain stable spatial representations under simple counterfactual viewpoint changes.
March 24, 2026
Original Paper
CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
arXiv · 2603.21114
The Takeaway
The paper demonstrates that high single-view spatial accuracy in MLLMs is misleading; models frequently violate 360-degree cycle consistency. This is a critical insight for anyone using MLLMs for spatial reasoning or robotics where viewpoint stability is essential.
From the abstract
Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agree