STEVO-Bench reveals that current 'video world models' fail to simulate physical processes when the camera looks away or lights go out.
arXiv · March 16, 2026 · 2603.13215
Why it matters
It systematically debunks the notion that video generative models are true world models by proving they cannot decouple state evolution (like melting ice) from visual observation.
From the abstract
Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light,