AICLCVHCMMAug 18, 2025

E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

arXiv:2508.12854v13 citationsh-index: 14Has CodeMM
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of emotional intelligence in AI systems for users needing empathetic interactions, representing an incremental improvement by integrating existing models.

The paper tackles the challenge of generating empathetic responses in multimodal human-computer interactions by proposing E3RG, a system that decomposes the task into understanding, memory retrieval, and generation, achieving top performance in a benchmark challenge without additional training.

Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at https://github.com/RH-Lin/E3RG.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes