AI & ML Scaling Insight

Video fine-tuning consistently degrades static image understanding in multimodal LLMs, revealing a zero-sum trade-off between spatial and temporal capabilities.

March 19, 2026

Original Paper

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu

arXiv · 2603.17541

The Takeaway

It identifies a 'spatial cost' to temporal gains, proving that increasing video frame budgets during SFT often harms fine-grained image benchmarks. This changes how practitioners should approach joint image-video training and suggests adaptive 'Hybrid-Frame' strategies are necessary.

From the abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame