Re-evaluating high-profile medical AI safety claims reveals that reported triage failures were artifacts of the 'exam-style' evaluation format rather than model incapacity.
arXiv · March 13, 2026 · 2603.11413
Why it matters
This is a significant methodological correction to AI safety literature. It proves that forced-choice (A/B/C/D) evaluation scaffolds can hide correct model reasoning and generate misleading safety risks, necessitating a shift toward naturalistic testing in high-stakes domains.
From the abstract
Ramaswamy et al. reported in \textit{Nature Medicine} that ChatGPT Health under-triages 51.6\% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17