CLFeb 4

Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

arXiv:2602.04392v1h-index: 3
AI Analysis

This addresses the problem of potential bias amplification in healthcare AI systems, which could affect patient care and safety, though it is incremental as it builds on existing concerns about bias in LLMs.

The study evaluated sex bias in clinical reasoning by large language models (LLMs) using clinician-authored vignettes, finding that models like ChatGPT, Claude, DeepSeek, and Gemini exhibited significant and model-specific sex-assignment skews, with percentages ranging from 36% to 70% female assignment depending on the model.

Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes