Abeer Badawi

h-index7

5papers

23citations

Novelty42%

AI Score44

Ranked #49,976 of 194,257 authors (top 26%)#2,987 in AI (top 24%)

5 Papers

6.0AIJan 26Code

Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation

Abeer Badawi, Md Tahmid Rahman Laskar, Elahe Rahimi et al.

The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support. While LLMs offer potential for accessible emotional assistance, their reliability, therapeutic relevance, and alignment with human standards remain challenging to address. This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue. Our approach involved curating a dataset of 500 mental health conversations from datasets with real-world scenario questions and evaluating the responses generated by nine diverse LLMs, including closed source and open source models. More specifically, these responses were evaluated by two psychiatric trained experts, who independently rated each on a 5 point Likert scale across a comprehensive 6 attribute rubric. This rubric captures Cognitive Support and Affective Resonance, providing a multidimensional perspective on therapeutic quality. Our analysis reveals that LLMs provide strong cognitive reliability by producing safe, coherent, and clinically appropriate information, but they demonstrate unstable affective alignment. Although closed source models (e.g., GPT-4o) offer balanced therapeutic responses, open source models show greater variability and emotional flatness. We reveal a persistent cognitive-affective gap and highlight the need for failure aware, clinically grounded evaluation frameworks that prioritize relational sensitivity alongside informational accuracy in mental health oriented LLMs. We advocate for balanced evaluation protocols with human in the loop that center on therapeutic sensitivity and provide a framework to guide the responsible design and clinical oversight of mental health oriented conversational AI.

17.0CLOct 21, 2025Code

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar et al.

Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

1.2MMOct 26, 2024

A Novel Multimodal System to Predict Agitation in People with Dementia Within Clinical Settings: A Proof of Concept

Abeer Badawi, Somayya Elmoghazy, Samira Choudhury et al.

Dementia is a neurodegenerative condition that combines several diseases and impacts millions around the world and those around them. Although cognitive impairment is profoundly disabling, it is the noncognitive features of dementia, referred to as Neuropsychiatric Symptoms (NPS), that are most closely associated with a diminished quality of life. Agitation and aggression (AA) in people living with dementia (PwD) contribute to distress and increased healthcare demands. Current assessment methods rely on caregiver intervention and reporting of incidents, introducing subjectivity and bias. Artificial Intelligence (AI) and predictive algorithms offer a potential solution for detecting AA episodes in PwD when utilized in real-time. We present a 5-year study system that integrates a multimodal approach, utilizing the EmbracePlus wristband and a video detection system to predict AA in severe dementia patients. We conducted a pilot study with three participants at the Ontario Shores Mental Health Institute to validate the functionality of the system. The system collects and processes raw and digital biomarkers from the EmbracePlus wristband to accurately predict AA. The system also detected pre-agitation patterns at least six minutes before the AA event, which was not previously discovered from the EmbracePlus wristband. Furthermore, the privacy-preserving video system uses a masking tool to hide the features of the people in frames and employs a deep learning model for AA detection. The video system also helps identify the actual start and end time of the agitation events for labeling. The promising results of the preliminary data analysis underscore the ability of the system to predict AA events. The ability of the proposed system to run autonomously in real-time and identify AA and pre-agitation symptoms without external assistance represents a significant milestone in this research field.

4.2AIDec 26, 2024

Leveraging Self-Training and Variational Autoencoder for Agitation Detection in People with Dementia Using Wearable Sensors

Abeer Badawi, Somayya Elmoghazy, Samira Choudhury et al.

Dementia is a neurodegenerative disorder that has been growing among elder people over the past decades. This growth profoundly impacts the quality of life for patients and caregivers due to the symptoms arising from it. Agitation and aggression (AA) are some of the symptoms of people with severe dementia (PwD) in long-term care or hospitals. AA not only causes discomfort but also puts the patients or others at potential risk. Existing monitoring solutions utilizing different wearable sensors integrated with Artificial Intelligence (AI) offer a way to detect AA early enough for timely and adequate medical intervention. However, most studies are limited by the availability of accurately labeled datasets, which significantly affects the efficacy of such solutions in real-world scenarios. This study presents a novel comprehensive approach to detect AA in PwD using physiological data from the Empatica E4 wristbands. The research creates a diverse dataset, consisting of three distinct datasets gathered from 14 participants across multiple hospitals in Canada. These datasets have not been extensively explored due to their limited labeling. We propose a novel approach employing self-training and a variational autoencoder (VAE) to detect AA in PwD effectively. The proposed approach aims to learn the representation of the features extracted using the VAE and then uses a semi-supervised block to generate labels, classify events, and detect AA. We demonstrate that combining Self-Training and Variational Autoencoder mechanism significantly improves model performance in classifying AA in PwD. Among the tested techniques, the XGBoost classifier achieved the highest accuracy of 90.16\%. By effectively addressing the challenge of limited labeled data, the proposed system not only learns new labels but also proves its superiority in detecting AA.

4.1HCFeb 21, 2025

Position: Beyond Assistance -- Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care

Abeer Badawi, Md Tahmid Rahman Laskar, Jimmy Xiangji Huang et al.

This position paper argues for a fundamental shift in how Large Language Models (LLMs) are integrated into the mental health care domain. We advocate for their role as co-creators rather than mere assistive tools. While LLMs have the potential to enhance accessibility, personalization, and crisis intervention, their adoption remains limited due to concerns about bias, evaluation, over-reliance, dehumanization, and regulatory uncertainties. To address these challenges, we propose two structured pathways: SAFE-i (Supportive, Adaptive, Fair, and Ethical Implementation) Guidelines for ethical and responsible deployment, and HAAS-e (Human-AI Alignment and Safety Evaluation) Framework for multidimensional, human-centered assessment. SAFE-i provides a blueprint for data governance, adaptive model engineering, and real-world integration, ensuring LLMs align with clinical and ethical standards. HAAS-e introduces evaluation metrics that go beyond technical accuracy to measure trustworthiness, empathy, cultural sensitivity, and actionability. We call for the adoption of these structured approaches to establish a responsible and scalable model for LLM-driven mental health support, ensuring that AI complements, rather than replaces, human expertise.