CY AIApr 19

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

arXiv:2604.1735918.1h-index: 4

Predicted impact top 92% in CY · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and clinicians using LLMs for patient simulation, this work reveals that current models fail to represent real population health distributions, risking pathologization or erasure of genuine needs.

PsychBench audits epidemiological fidelity of LLM mental health simulations across 28,800 profiles from four models, finding a coherence-fidelity dissociation: models produce clinically plausible individuals but misrepresent population distributions, with variance compression of 14-62%, diagnostic threshold crossing in 36.66% of cases, and systematic calibration biases (e.g., overestimating depression by 3.6-6.1 points for most groups, but underestimating for transgender women by -5.42 points).

Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.

View on arXiv PDF

Similar