AI & ML Breaks Assumption

Fast-WAM proves that World Action Models do not actually need to generate future 'imagination' frames at test-time to achieve state-of-the-art performance in embodied control.

arXiv · March 18, 2026 · 2603.16666

Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao

The Takeaway

It shows that while video co-training is essential during the learning phase, the expensive test-time video denoising used by existing WAMs is unnecessary. By removing this requirement, the authors achieved a 4x speedup in inference latency (190ms), making high-performance world-model-based control viable for real-time robotics.

From the abstract

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit f