Video fine-tuning consistently degrades static image understanding in multimodal LLMs, revealing a zero-sum trade-off between spatial and temporal capabilities.
March 19, 2026
Original Paper
Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models
arXiv · 2603.17541
The Takeaway
It identifies a 'spatial cost' to temporal gains, proving that increasing video frame budgets during SFT often harms fine-grained image benchmarks. This changes how practitioners should approach joint image-video training and suggests adaptive 'Hybrid-Frame' strategies are necessary.
From the abstract
Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame