AI & ML Scaling Insight

Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.

March 24, 2026

Original Paper

CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

Nan Zhou, Huiqun Wang, Yaoyan Zheng, Di Huang

arXiv · 2603.21077

The Takeaway

It answers the 'frozen vs. fine-tuned' vision encoder debate by showing that vanilla fine-tuning is unstable due to context-agnostic updates. By using a Contextual Mixture-of-Experts, practitioners can unlock significantly better performance from smaller vision-language models without increasing parameter counts.

From the abstract

Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outper