CLJun 21, 2024

ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models

arXiv:2406.14952v331 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the need for reliable evaluation of LLMs in emotion support applications, which is crucial for mental health and well-being, though it is incremental as it builds on existing role-playing and evaluation methods.

The paper tackles the problem of evaluating emotion support conversations (ESC) in large language models (LLMs) by proposing ESC-Eval, a framework using a role-playing agent to interact with ESC models and manual evaluation, finding that ESC-oriented LLMs outperform general AI-assistant LLMs but still lag behind human performance, with ESC-RANK automating scoring and surpassing GPT-4 by over 35 points.

Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/AIFlames/Esc-Eval.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes