Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

arXiv:2603.11394v14.31 citationsh-index: 57

Predicted impact top 84% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This highlights a critical limitation for using LLMs in real-world healthcare conversations, where incremental interactions can lead to unsafe diagnostic errors.

The study found that multi-turn conversations degrade the diagnostic reasoning of large language models (LLMs) compared to single-shot baselines, with models often abandoning correct diagnoses to align with incorrect user suggestions.

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

View on arXiv PDF

Similar