CLAug 3, 2023
The Capability of Large Language Models to Measure Psychiatric FunctioningIsaac R. Galatzer-Levy, Daniel McDuff, Vivek Natarajan et al.
The current work investigates the capability of Large language models (LLMs) that are explicitly trained on large corpuses of medical knowledge (Med-PaLM 2) to predict psychiatric functioning from patient interviews and clinical descriptions without being trained to do so. To assess this, n = 145 depression and n =115 PTSD assessments and n = 46 clinical case studies across high prevalence/high comorbidity disorders (Depressive, Anxiety, Psychotic, trauma and stress, Addictive disorders) were analyzed using prompts to extract estimated clinical scores and diagnoses. Results demonstrate that Med-PaLM 2 is capable of assessing psychiatric functioning across a range of psychiatric conditions with the strongest performance being the prediction of depression scores based on standardized assessments (Accuracy range= 0.80 - 0.84) which were statistically indistinguishable from human clinical raters t(1,144) = 1.20; p = 0.23. Results show the potential for general clinical language models to flexibly predict psychiatric risk based on free descriptions of functioning from both patients and clinicians.
CLDec 23, 2025
Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue OptimizationZiyi Zhu, Olivier Tieleman, Caitlin A. Stamatis et al. · cambridge
Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.
HCFeb 24
Talking to a Human as an Attitudinal Barrier: A Mixed Methods Evaluation of Stigma, Access, and the Appeal of AI Mental Health SupportCaitlin A. Stamatis, Emma C. Wolfe, Matteo Malgaroli et al.
Background: Many people who could benefit from therapy do not receive it. Conversational AI is increasingly used for mental health support, yet it is unclear which barriers AI helps mitigate. We examined whether evaluation-sensitive (shame/stigma) and structural barriers (cost/coverage/access) to psychotherapy predict perceived helpfulness of an AI mental health conversational tool (Ash), and whether effects differ by prior therapy experience or user engagement. Methods: Participants (n=395) rated Ash's helpfulness (1-5) and described barriers to therapy. Open-text responses were coded for shame/stigma, access, and cost/coverage themes. Linear regressions examined associations between barriers and perceived helpfulness, adjusting for demographics and mental health, with moderation by therapy experience. Results: Shame/stigma (B=.45, p<.001) and access barriers (B=.31, p=.020) predicted higher perceived helpfulness but cost/coverage did not (B=.13, p=.262). Prior therapy experience moderated the shame effect (interaction B=.56, p=.036): shame predicted higher helpfulness among therapy-experienced users ($Δ$=.62, p<.001) but not therapy-naive users ($Δ$=.03, p=.877). Among therapy-experienced participants (n=258), shame/stigma (B=.75, p<.001) and access barriers (B=.51, p=.006) predicted rating Ash more favorably. Access barriers predicted higher engagement (IRR=1.64, p<.001) and cost/coverage barriers predicted 70% more sessions (IRR=1.70, p<.001). Shame/stigma was not associated with total sessions (IRR=.80, p=.094). Conclusions: AI mental health support was perceived as most helpful by users facing shame/stigma and access barriers, particularly for therapy-experienced individuals. Access and cost barriers were most predictive of usage intensity, suggesting unmet needs. Findings highlight the importance of aligning AI tools for emotional support with user-reported barriers.
CLFeb 17
Language Markers of Emotion Flexibility Predict Depression and Anxiety Treatment OutcomesBenjamin Brindle, George A. Bonanno, Thomas Derrick Hull et al.
Predicting treatment non-response for anxiety and depression is challenging, in part because of sparse symptom assessments in real-world care. We examined whether passively captured, fine-grained emotions serve as linguistic markers of treatment outcomes by analyzing 12 weeks of de-identified teletherapy transcripts from 12,043 U.S. patients with moderate-to-severe anxiety and depression symptoms. A transformer-based small language model extracted patients' emotions at the talk-turn level; a state-space model (VISTA-SSM) clustered subgroups based on emotion dynamics over time and produced temporal networks. Two groups emerged: an improving group (n=8,230) and a non-response group (n=3,813) showing increased odds of symptom deterioration, and lower likelihood of clinically significant improvement. Temporal networks indicated that sadness and fear exerted most influence on emotion dynamics in non-responders, whereas improving patients showed balanced joy, sadness, and neutral expressions. Findings suggest that linguistic markers of emotional inflexibility can serve as scalable, interpretable, and theoretically grounded indicators for treatment risk stratification.
CLMay 20, 2024
Can AI Relate: Testing Large Language Model Response for Mental Health SupportSaadia Gabriel, Isha Puri, Xuhai Xu et al.
Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. A proposed deployment use case is psychotherapy, where a LLM-powered chatbot can treat a patient undergoing a mental health crisis. Deployment of LLMs for mental health response could hypothetically broaden access to psychotherapy and provide new possibilities for personalizing care. However, recent high-profile failures, like damaging dieting advice offered by the Tessa chatbot to patients with eating disorders, have led to doubt about their reliability in high-stakes and safety-critical settings. In this work, we develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment. Our framework measures equity in empathy and adherence of LLM responses to motivational interviewing theory. Using human evaluation with trained clinicians and automatic quality-of-care metrics grounded in psychology research, we compare the responses provided by peer-to-peer responders to those provided by a state-of-the-art LLM. We show that LLMs like GPT-4 use implicit and explicit cues to infer patient demographics like race. We then show that there are statistically significant discrepancies between patient subgroups: Responses to Black posters consistently have lower empathy than for any other demographic group (2%-13% lower than the control group). Promisingly, we do find that the manner in which responses are generated significantly impacts the quality of the response. We conclude by proposing safety guidelines for the potential deployment of LLMs for mental health response.
LGFeb 17
Multi-Objective Alignment of Language Models for Personalized PsychotherapyMehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna et al.
Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria -- empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy -- and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.
70.1HCApr 30
Engagement Phenotypes for a Sample of 102,684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical OutcomesEmma C. Wolfe, Ting Su, Olivier Tieleman et al.
Background: Conversational AI chatbots are emerging as scalable mental health tools, but little is known about real world engagement or its relationship to clinical outcomes. Objective: To characterize engagement phenotypes among users of Ash, a purpose-built AI mental health chatbot, and examine associations with clinical change and working alliance. Methods: K-means clustering across eight behavioral features identified engagement phenotypes among 102,684 users. Subsamples completed the PHQ-9 (n=298), GAD-7 (n=298), and MSPSS (social support; n=194) baseline and 3 weeks; 11,437 users completed baseline Working Alliance Inventory (WAI). Results: Five engagement phenotypes emerged: Early Dropouts (52.2%), Power Users (1.6%), Intensive Users (4.1%), Weekly Users (25.3%), and a novel Concentrated User pattern (16.8%); across users, 66.9% had at least one overnight session (9pm-5am). Significant pre-post improvements occurred in depression (d = -0.51), anxiety (d = -0.57), and social support (d = 0.22). An observed dose-response gradient in self-reported depression improvement was replicated in a larger sample with model-predicted PHQ-9 (n = 23,813; Power Users d = -0.54; Early Dropouts d = -0.13). Higher working alliance predicted depression improvement and moderated the engagement-social support relationship. Conclusions: Engagement with AI mental health tools is multidimensional, and different clinical outcomes respond to different dimensions of use. Findings caution against treating session counts as a primary engagement metric and offer naturalistic evidence for the clinical value of purpose-built conversational AI.