Emotional Support Evaluation Framework via Controllable and Diverse Seeker Simulator
This work provides a more faithful evaluation framework for emotional support chatbots, benefiting researchers and developers in AI and mental health applications, though it is incremental as it builds on existing simulator-based evaluation methods.
The paper tackled the problem of evaluating emotional support chatbots by addressing limitations in existing seeker simulators, such as lack of behavioral diversity and controllability, and introduced a controllable simulator using psychological and linguistic features with a Mixture-of-Experts architecture, achieving superior profile adherence and diversity while uncovering performance degradations in 7 supporter models.
As emotional support chatbots have recently gained significant traction across both research and industry, a common evaluation strategy has emerged: use help-seeker simulators to interact with supporter chatbots. However, current simulators suffer from two critical limitations: (1) they fail to capture the behavioral diversity of real-world seekers, often portraying them as overly cooperative, and (2) they lack the controllability required to simulate specific seeker profiles. To address these challenges, we present a controllable seeker simulator driven by nine psychological and linguistic features that underpin seeker behavior. Using authentic Reddit conversations, we train our model via a Mixture-of-Experts (MoE) architecture, which effectively differentiates diverse seeker behaviors into specialized parameter subspaces, thereby enhancing fine-grained controllability. Our simulator achieves superior profile adherence and behavioral diversity compared to existing approaches. Furthermore, evaluating 7 prominent supporter models with our system uncovers previously obscured performance degradations. These findings underscore the utility of our framework in providing a more faithful and stress-tested evaluation for emotional support chatbots.