TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent
This addresses the need for more reliable emotional support agents by integrating external tools, though it is incremental as it builds on existing ESC systems.
The authors tackled the problem of emotional support conversation systems lacking factual grounding by introducing TEA-Bench, a benchmark for tool-augmented agents, and found that tool use generally improves support quality and reduces hallucination, with gains dependent on model capacity.
Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents.