CY AI SIOct 17, 2025

VERA-MH Concept Paper

Luca Belli, Kate Bentley, Will Alexander, Emily Ward, Matt Hawrilenko, Kelly Johnston, Mill Brown, Adam Chekroud

arXiv:2510.15297v24 citationsh-index: 4

Originality Incremental advance

AI Analysis

This addresses the need for ethical and responsible AI in mental health care, particularly for high-risk scenarios like suicide prevention, though it is incremental as it builds on existing evaluation methods with automation.

The paper tackles the problem of evaluating the safety of AI chatbots in mental health contexts, specifically for suicide risk, by introducing VERA-MH, an automated system that uses simulated conversations and scoring to assess chatbots like GPT-5 and Claude models, with preliminary evaluations conducted for design development.

We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation.

View on arXiv PDF

Similar