AI & ML New Capability

VAMPO optimizes visual dynamics in video models using policy gradients to fix precision-critical errors in robotic manipulation.

March 23, 2026

Original Paper

VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, Han Zhao, Shangke Lyu, Zhaoxin Fan, Haoang Li, Ran Cheng, Cheng Chi, Huibin Ge, Yaozhi Luo, Donglin Wang

arXiv · 2603.19370

The Takeaway

Standard diffusion models are trained on likelihood, which often misses subtle contact physics. By treating denoising as a sequential decision process and optimizing for latent expert rewards, this framework produces video predictions that are physically grounded for robot control.

From the abstract

Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in