AI & ML Breaks Assumption

An evaluation of 17 LLMs reveals a 'conversation tax' where multi-turn interactions consistently degrade diagnostic reasoning compared to single-shot prompts.

arXiv · March 13, 2026 · 2603.11394

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

Why it matters

It challenges the common belief that iterative chat improves reasoning or allows for better error correction. The finding that models frequently abandon correct initial diagnoses to align with incorrect user suggestions is a critical warning for those deploying healthcare or reasoning agents.

From the abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences

Read the original paper →

← Back to today's papers