CLApr 19

Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

Kaijie Mo, Siddhartha Venkatayogi, Chantal Shaib, Ramez Kouzy, Wei Xu, Byron C. Wallace, Junyi Jessy Li

arXiv:2601.1188686.41 citationsh-index: 7

Predicted impact top 46% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For developers and users of LLMs in high-stakes domains like medicine, this work highlights a critical safety gap where models fail to reject counterfactual or adversarial inputs.

The paper investigates how LLMs respond to counterfactual medical evidence, finding that models overwhelmingly accept dangerous or implausible evidence at face value, prioritizing faithfulness over safety.

In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual (or even adversarial) medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings suggest that models arguably overemphasize the former.

View on arXiv PDF

Similar