CLFeb 18, 2025

Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare

arXiv:2502.13319v28 citationsh-index: 11EMNLP
Originality Synthesis-oriented
AI Analysis

This addresses bias in AI for healthcare, potentially improving fairness in clinical applications, but it is incremental as it applies existing interpretability methods to a new domain.

The researchers tackled the problem of demographic bias in large language models (LLMs) for healthcare by using mechanistic interpretability to identify and manipulate sociodemographic representations, finding that gender information is localized in MLP layers and can be altered to influence clinical vignettes and predictions like depression risk.

We know from prior work that LLMs encode social biases, and that this manifests in clinical tasks. In this work we adopt tools from mechanistic interpretability to unveil sociodemographic representations and biases within LLMs in the context of healthcare. Specifically, we ask: Can we identify activations within LLMs that encode sociodemographic information (e.g., gender, race)? We find that gender information is highly localized in MLP layers and can be reliably manipulated at inference time via patching. Such interventions can surgically alter generated clinical vignettes for specific conditions, and also influence downstream clinical predictions which correlate with gender, e.g., patient risk of depression. We find that representation of patient race is somewhat more distributed, but can also be intervened upon, to a degree. To our knowledge, this is the first application of mechanistic interpretability methods to LLMs for healthcare.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes