The 'Scaffold Effect' reveals that Vision-Language Models in clinical settings often fabricate reasoning based on prompt framing rather than actual visual data.
March 31, 2026
Original Paper
The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation
arXiv · 2603.28387
The Takeaway
The study finds that merely mentioning MRI availability in a prompt causes models to 'see' signals that aren't there, leading to false performance gains. This challenges the validity of current multimodal benchmarks in high-stakes domains and highlights a severe form of modality collapse.
From the abstract
Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to