AI & ML New Capability

Wan-R1 successfully applies Group Relative Policy Optimization (GRPO) to flow-based video models to enable verifiable spatial reasoning.

March 31, 2026

Original Paper

Wan-R1: Verifiable-Reinforcement Learning for Video Reasoning

Ming Liu, Yunbei Zhang, Shilong Liu, Liwen Wang, Wensheng Zhang

arXiv · 2603.27866

The Takeaway

It brings the 'R1' reinforcement learning paradigm to video generation, solving the problem of reward hacking in multimodal models by using verifiable objective task metrics. This allows video models to actually solve complex 3D mazes and navigation tasks rather than just generating pretty pixels.

From the abstract

Video generation models produce visually coherent content but struggle with tasks requiring spatial reasoning and multi-step planning. Reinforcement learning (RL) offers a path to improve generalization, but its effectiveness in video reasoning hinges on reward design -- a challenge that has received little systematic study. We investigate this problem by adapting Group Relative Policy Optimization (GRPO) to flow-based video models and training them on maze-solving and robotic navigation tasks.