79.0CLMay 26
Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static TaxonomiesAbeer Badawi, Will Aitken, Lydia Sequeira et al.
Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.
AIJan 26Code
Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human EvaluationAbeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi et al.
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.
CLOct 21, 2025Code
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM EvaluationAbeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar et al.
Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/
MMOct 26, 2024
A Novel Multimodal System to Predict Agitation in People with Dementia Within Clinical Settings: A Proof of ConceptAbeer Badawi, Somayya Elmoghazy, Samira Choudhury et al.
Dementia is a neurodegenerative condition that combines several diseases and impacts millions around the world and those around them. Although cognitive impairment is profoundly disabling, it is the noncognitive features of dementia, referred to as Neuropsychiatric Symptoms (NPS), that are most closely associated with a diminished quality of life. Agitation and aggression (AA) in people living with dementia (PwD) contribute to distress and increased healthcare demands. Current assessment methods rely on caregiver intervention and reporting of incidents, introducing subjectivity and bias. Artificial Intelligence (AI) and predictive algorithms offer a potential solution for detecting AA episodes in PwD when utilized in real-time. We present a 5-year study system that integrates a multimodal approach, utilizing the EmbracePlus wristband and a video detection system to predict AA in severe dementia patients. We conducted a pilot study with three participants at the Ontario Shores Mental Health Institute to validate the functionality of the system. The system collects and processes raw and digital biomarkers from the EmbracePlus wristband to accurately predict AA. The system also detected pre-agitation patterns at least six minutes before the AA event, which was not previously discovered from the EmbracePlus wristband. Furthermore, the privacy-preserving video system uses a masking tool to hide the features of the people in frames and employs a deep learning model for AA detection. The video system also helps identify the actual start and end time of the agitation events for labeling. The promising results of the preliminary data analysis underscore the ability of the system to predict AA events. The ability of the proposed system to run autonomously in real-time and identify AA and pre-agitation symptoms without external assistance represents a significant milestone in this research field.
AIDec 26, 2024
Leveraging Self-Training and Variational Autoencoder for Agitation Detection in People with Dementia Using Wearable SensorsAbeer Badawi, Somayya Elmoghazy, Samira Choudhury et al.
Dementia is a neurodegenerative disorder that has been growing among elder people over the past decades. This growth profoundly impacts the quality of life for patients and caregivers due to the symptoms arising from it. Agitation and aggression (AA) are some of the symptoms of people with severe dementia (PwD) in long-term care or hospitals. AA not only causes discomfort but also puts the patients or others at potential risk. Existing monitoring solutions utilizing different wearable sensors integrated with Artificial Intelligence (AI) offer a way to detect AA early enough for timely and adequate medical intervention. However, most studies are limited by the availability of accurately labeled datasets, which significantly affects the efficacy of such solutions in real-world scenarios. This study presents a novel comprehensive approach to detect AA in PwD using physiological data from the Empatica E4 wristbands. The research creates a diverse dataset, consisting of three distinct datasets gathered from 14 participants across multiple hospitals in Canada. These datasets have not been extensively explored due to their limited labeling. We propose a novel approach employing self-training and a variational autoencoder (VAE) to detect AA in PwD effectively. The proposed approach aims to learn the representation of the features extracted using the VAE and then uses a semi-supervised block to generate labels, classify events, and detect AA. We demonstrate that combining Self-Training and Variational Autoencoder mechanism significantly improves model performance in classifying AA in PwD. Among the tested techniques, the XGBoost classifier achieved the highest accuracy of 90.16\%. By effectively addressing the challenge of limited labeled data, the proposed system not only learns new labels but also proves its superiority in detecting AA.
HCFeb 21, 2025
Position: Beyond Assistance -- Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health CareAbeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang et al.
This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centered assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements, rather than replaces, human expertise.