CLFeb 25, 2025

SYNTHEMPATHY: A Scalable Empathy Corpus Generated Using LLMs Without Any Crowdsourcing

arXiv:2502.17857v14 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the problem of expensive and unscalable empathy data creation for AI developers, though it is incremental as it builds on existing LLM-based generation methods.

The researchers tackled the lack of large empathetic dialogue corpora for fine-tuning LLMs by generating SYNTHEMPATHY, a corpus of 105k empathetic responses using LLMs without crowdsourcing, resulting in a fine-tuned Mistral 7B model that showed an increase in average empathy score.

Previous research has shown that humans are more receptive towards language models that that exhibit empathetic behavior. While empathy is essential for developing helpful dialogue agents, very few large corpora containing empathetic dialogues are available for fine-tune LLMs. The few existing corpora have largely relied on crowdsourcing to simulate empathetic conversations, a process that is expensive, time-consuming, and not scalable to larger datasets. We propose a data generation framework for developing SYNTHEMPATHY, a large corpus containing 105k empathetic responses to real-life situations compiled through LLM generation. A base Mistral 7B model fine-tuned on our SYNTHEMPATHY corpus exhibits an increase in the average empathy score.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes