CLLGJul 31, 2024

Synth-Empathy: Towards High-Quality Synthetic Empathy Data

arXiv:2407.21669v212 citationsh-index: 11
AI Analysis

This work addresses the challenge of generating high-quality empathetic data for LLMs, which is incremental as it builds on existing methods for data synthesis and selection.

The paper tackles the problem of insufficient and costly human-labeled empathetic data by introducing Synth-Empathy, an LLM-based pipeline that automatically generates and selects high-quality synthetic empathetic data, achieving state-of-the-art results on multiple benchmarks and improving empathetic response performance.

In recent years, with the rapid advancements in large language models (LLMs), achieving excellent empathetic response capabilities has become a crucial prerequisite. Consequently, managing and understanding empathetic datasets have gained increasing significance. However, empathetic data are typically human-labeled, leading to insufficient datasets and wasted human labor. In this work, we present Synth-Empathy, an LLM-based data generation and quality and diversity selection pipeline that automatically generates high-quality empathetic data while discarding low-quality data. With the data generated from a low empathetic model, we are able to further improve empathetic response performance and achieve state-of-the-art (SoTA) results across multiple benchmarks. Moreover, our model achieves SoTA performance on various human evaluation benchmarks, demonstrating its effectiveness and robustness in real-world applications. Furthermore, we show the trade-off between data quantity and quality, providing insights into empathetic data generation and selection.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes