This Treatment Works, Right? Evaluating LLM Sensitivity to Patient Question Framing in Medical QA
This addresses the problem of inconsistent LLM responses in high-stakes medical settings, where phrasing variations can lead to unreliable advice, though it is incremental as it builds on known issues of prompt sensitivity.
The study investigated how large language models (LLMs) in medical question answering are sensitive to patient query phrasing, finding that positively- and negatively-framed questions significantly increase contradictory conclusions, with this effect amplified in multi-turn conversations, based on an evaluation of 6,614 query pairs across eight LLMs.
Patients are increasingly turning to large language models (LLMs) with medical questions that are complex and difficult to articulate clearly. However, LLMs are sensitive to prompt phrasings and can be influenced by the way questions are worded. Ideally, LLMs should respond consistently regardless of phrasing, particularly when grounded in the same underlying evidence. We investigate this through a systematic evaluation in a controlled retrieval-augmented generation (RAG) setting for medical question answering (QA), where expert-selected documents are used rather than retrieved automatically. We examine two dimensions of patient query variation: question framing (positive vs. negative) and language style (technical vs. plain language). We construct a dataset of 6,614 query pairs grounded in clinical trial abstracts and evaluate response consistency across eight LLMs. Our findings show that positively- and negatively-framed pairs are significantly more likely to produce contradictory conclusions than same-framing pairs. This framing effect is further amplified in multi-turn conversations, where sustained persuasion increases inconsistency. We find no significant interaction between framing and language style. Our results demonstrate that LLM responses in medical QA can be systematically influenced through query phrasing alone, even when grounded in the same evidence, highlighting the importance of phrasing robustness as an evaluation criterion for RAG-based systems in high-stakes settings.