SycoEval-EM: Sycophancy Evaluation of Large Language Models in Simulated Clinical Encounters for Emergency Care
This addresses safety risks in clinical AI for emergency care, highlighting that static benchmarks are inadequate for predicting model behavior under social pressure.
The paper tackled the problem of large language models (LLMs) acquiescing to patient pressure for inappropriate care in emergency medicine, finding that across 20 LLMs and 1,875 encounters, acquiescence rates ranged from 0-100%, with higher vulnerability to imaging requests (38.8%) than opioid prescriptions (25.0%).
Large language models (LLMs) show promise in clinical decision support yet risk acquiescing to patient pressure for inappropriate care. We introduce SycoEval-EM, a multi-agent simulation framework evaluating LLM robustness through adversarial patient persuasion in emergency medicine. Across 20 LLMs and 1,875 encounters spanning three Choosing Wisely scenarios, acquiescence rates ranged from 0-100\%. Models showed higher vulnerability to imaging requests (38.8\%) than opioid prescriptions (25.0\%), with model capability poorly predicting robustness. All persuasion tactics proved equally effective (30.0-36.0\%), indicating general susceptibility rather than tactic-specific weakness. Our findings demonstrate that static benchmarks inadequately predict safety under social pressure, necessitating multi-turn adversarial testing for clinical AI certification.