AI & ML Efficiency Breakthrough

Synthetic videos of simple geometric shapes are more effective than massive real-world datasets for teaching video-language models fundamental temporal reasoning.

March 19, 2026

Original Paper

Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos

Songtao Jiang, Sibo Song, Chenyi Zhou, Yuan Wang, Ruizhe Chen, Tongkun Guan, Ruilin Luo, Yan Zhang, Zhihang Tang, Yuchong Sun, Hang Zhang, Zhibo Yang, Shuai Bai, Junyang Lin, Zuozhu Liu

arXiv · 2603.17693

The Takeaway

The paper demonstrates that 'temporal primitives' (speed, direction, state) learned from 7.7K synthetic samples outperform models trained on much larger real-world datasets like Video-R1. This shifts the focus from scale to structural curriculum in video reasoning, proving that abstract primitives transfer better to real scenarios than noisy real data.

From the abstract

The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data genera