TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
This work is significant for developers, clinicians, and policymakers as it provides a systematic approach to identify and mitigate relational safety failures in mental health chatbots, moving beyond single-turn crisis response evaluations.
This paper addresses the critical need for relational safety in mental health chatbots by introducing TherapyProbe, a methodology that uses adversarial multi-agent simulation to explore chatbot conversation trajectories. The research identifies 23 relational safety failure archetypes, such as "validation spirals" and "empathy fatigue," and provides corresponding design recommendations.
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.