CVAIHCDec 23, 2024

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues

arXiv:2412.17292v13 citationsh-index: 19
Originality Incremental advance
AI Analysis

This addresses the need for more empathetic human-computer interactions, though it appears incremental as it builds on existing multimodal methods.

The paper tackled the problem of generating emotionally and contextually appropriate responses in dialogue systems by leveraging audio-visual inputs, and the result showed that AV-EmoDialog outperformed existing multimodal LLMs in experiments.

In human communication, both verbal and non-verbal cues play a crucial role in conveying emotions, intentions, and meaning beyond words alone. These non-linguistic information, such as facial expressions, eye contact, voice tone, and pitch, are fundamental elements of effective interactions, enriching conversations by adding emotional and contextual depth. Recognizing the importance of non-linguistic content in communication, we present AV-EmoDialog, a dialogue system designed to exploit verbal and non-verbal information from users' audio-visual inputs to generate more responsive and empathetic interactions. AV-EmoDialog systematically exploits the emotional cues in audio-visual dialogues; extracting speech content and emotional tones from speech, analyzing fine-grained facial expressions from visuals, and integrating these cues to generate emotionally aware responses in an end-to-end manner. Through extensive experiments, we validate that the proposed AV-EmoDialog outperforms existing multimodal LLMs in generating not only emotionally appropriate but also contextually appropriate responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes