Video models perform reasoning during the diffusion denoising steps rather than sequentially across video frames.
March 18, 2026
Original Paper
Demystifing Video Reasoning
arXiv · 2603.16870
The Takeaway
It uncovers the 'Chain-of-Steps' mechanism, revealing that video models explore solutions during early denoising and converge later. This insight allows for training-free strategies to improve model reasoning by ensembling or manipulating the denoising process.
From the abstract
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative a