AIJun 20, 2025

The MedPerturb Dataset: What Non-Content Perturbations Reveal About Human and Clinical LLM Decision Making

Abinitha Gourabathina, Yuexing Hao, Walter Gerych, Marzyeh Ghassemi

arXiv:2506.17163v14 citationsh-index: 7

Originality Incremental advance

AI Analysis

This work addresses the need for better evaluation frameworks to ensure safe deployment of medical LLMs in clinical settings, though it is incremental as it builds on existing robustness concerns.

The paper tackles the problem of evaluating clinical robustness in medical LLMs by introducing the MedPerturb dataset, which reveals that LLMs are more sensitive to gender and style perturbations while humans are more sensitive to format changes, highlighting differences in decision-making under real-world variability.

Clinical robustness is critical to the safe deployment of medical Large Language Models (LLMs), but key questions remain about how LLMs and humans may differ in response to the real-world variability typified by clinical settings. To address this, we introduce MedPerturb, a dataset designed to systematically evaluate medical LLMs under controlled perturbations of clinical input. MedPerturb consists of clinical vignettes spanning a range of pathologies, each transformed along three axes: (1) gender modifications (e.g., gender-swapping or gender-removal); (2) style variation (e.g., uncertain phrasing or colloquial tone); and (3) format changes (e.g., LLM-generated multi-turn conversations or summaries). With MedPerturb, we release a dataset of 800 clinical contexts grounded in realistic input variability, outputs from four LLMs, and three human expert reads per clinical context. We use MedPerturb in two case studies to reveal how shifts in gender identity cues, language style, or format reflect diverging treatment selections between humans and LLMs. We find that LLMs are more sensitive to gender and style perturbations while human annotators are more sensitive to LLM-generated format perturbations such as clinical summaries. Our results highlight the need for evaluation frameworks that go beyond static benchmarks to assess the similarity between human clinician and LLM decisions under the variability characteristic of clinical settings.

View on arXiv PDF

Similar