AI & ML Breaks Assumption

STEVO-Bench reveals that current 'video world models' fail to simulate physical processes when the camera looks away or lights go out.

arXiv · March 16, 2026 · 2603.13215

Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari

Why it matters

It systematically debunks the notion that video generative models are true world models by proving they cannot decouple state evolution (like melting ice) from visual observation.

From the abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light,