AI & ML Breaks Assumption

Identifies that the 'reasoning tax' in vision-language fine-tuning is caused by lost access to depth-wise representations and fixes it with a lightweight adapter.

March 30, 2026

Original Paper

Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation

Yiming Ren, Yujiu Yang, Junjie Wang

arXiv · 2603.26330

The Takeaway

The paper shows that fine-tuning for perception often breaks the model's ability to retrieve reasoning-critical features from different layers. By adding Input-Adaptive Depth Aggregation (IADA), practitioners can improve both perception and reasoning simultaneously, avoiding the typical trade-off.

From the abstract

Supervised fine-tuning (SFT) on visual instruction data often improves perceptual capabilities in vision-language models (VLMs) while degrading reasoning performance, creating a persistent reasoning tax during post-training. We investigate whether this degradation is related to disrupted access to depth-wise representations, and find that even fixed cross-depth aggregation substantially restores reasoning, suggesting that preserved cross-depth access is an important missing factor in VLM fine-tu