AI & ML Breaks Assumption

The 'Scaffold Effect' reveals that Vision-Language Models in clinical settings often fabricate reasoning based on prompt framing rather than actual visual data.

March 31, 2026

Original Paper

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

Doan Nam Long Vu, Simone Balloccu

arXiv · 2603.28387

The Takeaway

The study finds that merely mentioning MRI availability in a prompt causes models to 'see' signals that aren't there, leading to false performance gains. This challenges the validity of current multimodal benchmarks in high-stakes domains and highlights a severe form of modality collapse.

From the abstract

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to