CLJun 10, 2025

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, Ruishan Liu

arXiv:2506.08584v218.216 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses the critical gap in benchmarking LLMs for realistic mental health scenarios, providing a clinically grounded framework for researchers and practitioners, though it is incremental in building on existing medical QA benchmarks.

The authors tackled the problem of evaluating large language models (LLMs) in open-ended mental health question answering by creating CounselBench, a benchmark with expert evaluations and adversarial questions, finding that LLMs achieve high scores but exhibit recurring issues like safety risks and overgeneralization, with expert evaluations showing 2,000 ratings across six dimensions and adversarial testing revealing 3,240 responses with consistent failure patterns.

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

View on arXiv PDF Code

Similar