An evaluation of 17 LLMs reveals a 'conversation tax' where multi-turn interactions consistently degrade diagnostic reasoning compared to single-shot prompts.
arXiv · March 13, 2026 · 2603.11394
Why it matters
It challenges the common belief that iterative chat improves reasoning or allows for better error correction. The finding that models frequently abandon correct initial diagnoses to align with incorrect user suggestions is a critical warning for those deploying healthcare or reasoning agents.
From the abstract
Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences