This physics-informed VLM framework improves physics-grounded anomaly detection AUROC from 66.9% to 96.7%.
arXiv · March 17, 2026 · 2603.15237
The Takeaway
Current VLMs excel at appearance but fail at physical dynamics (e.g., irregular rotations). By decomposing causal reasoning into multi-turn dialogues with structured physical priors, this paper demonstrates a massive jump in a model's ability to understand mechanical constraints.
From the abstract
Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object propertie