AI & ML Scaling Insight

Discovers that language-centric training in Multimodal LLMs actively degrades their internal visual representation quality.

March 24, 2026

Original Paper

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng

arXiv · 2603.20808

The Takeaway

This paper identifies 'visual representation degradation' as a primary reason MLLMs underperform on purely visual tasks. It introduces Predictive Regularization (PRe) to preserve visual features during training, offering a way to scale multimodal models without sacrificing core visual competence.

From the abstract

While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure