From Promising Capability to Pervasive Bias: Assessing Large Language Models for Emergency Department Triage
This work addresses the problem of bias and robustness in LLMs for clinical decision support, which is critical for healthcare providers and patients, though it is incremental in exploring an underexplored application area.
The study investigated the application of Large Language Models (LLMs) to emergency department triage, finding that LLMs exhibit superior robustness to distribution shifts and missing data but also encode demographic biases, with sex-based differences most pronounced in certain racial groups.
Large Language Models (LLMs) have shown promise in clinical decision support, yet their application to triage remains underexplored. We systematically investigate the capabilities of LLMs in emergency department triage through two key dimensions: (1) robustness to distribution shifts and missing data, and (2) counterfactual analysis of intersectional biases across sex and race. We assess multiple LLM-based approaches, ranging from continued pre-training to in-context learning, as well as machine learning approaches. Our results indicate that LLMs exhibit superior robustness, and we investigate the key factors contributing to the promising LLM-based approaches. Furthermore, in this setting, we identify gaps in LLM preferences that emerge in particular intersections of sex and race. LLMs generally exhibit sex-based differences, but they are most pronounced in certain racial groups. These findings suggest that LLMs encode demographic preferences that may emerge in specific clinical contexts or particular combinations of characteristics.