Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.
March 24, 2026
Original Paper
CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models
arXiv · 2603.21077
The Takeaway
It answers the 'frozen vs. fine-tuned' vision encoder debate by showing that vanilla fine-tuning is unstable due to context-agnostic updates. By using a Contextual Mixture-of-Experts, practitioners can unlock significantly better performance from smaller vision-language models without increasing parameter counts.
From the abstract
Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous training setups hinder a unified understanding of visual fine-tuning (VFT) in MLLMs. Through a configuration-aligned benchmark, we find that existing VFT methods fail to consistently outper