CV SDMay 5, 2025

VAEmo: Efficient Representation Learning for Visual-Audio Emotion with Knowledge Injection

Hao Cheng, Zhiwei Zhao, Yichao He, Zhenzhen Hu, Jia Li, Meng Wang, Richang Hong

arXiv:2505.02331v211.89 citationsh-index: 3Has CodeMM

Originality Incremental advance

AI Analysis

This work addresses the problem of ambiguous and scarce data in emotion recognition for applications like human-computer interaction, though it appears incremental by building on existing self-supervised methods.

The paper tackles the challenge of audiovisual emotion recognition by proposing VAEmo, a two-stage framework that uses knowledge injection to learn efficient joint visual-audio representations, achieving state-of-the-art performance on multiple benchmarks.

Audiovisual emotion recognition (AVER) aims to infer human emotions from nonverbal visual-audio (VA) cues, offering modality-complementary and language-agnostic advantages. However, AVER remains challenging due to the inherent ambiguity of emotional expressions, cross-modal expressive disparities, and the scarcity of reliably annotated data. Recent self-supervised AVER approaches have introduced strong multimodal representations, yet they predominantly rely on modality-specific encoders and coarse content-level alignment, limiting fine-grained emotional semantic modeling. To address these issues, we propose VAEmo, an efficient two-stage framework for emotion-centric joint VA representation learning with external knowledge injection. In Stage~1, a unified and lightweight representation network is pre-trained on large-scale speaker-centric VA corpora via masked reconstruction and contrastive objectives, mitigating the modality gap and learning expressive, complementary representations without emotion labels. In Stage~2, multimodal large language models automatically generate detailed affective descriptions according to our well-designed chain-of-thought prompting for only a small subset of VA samples; these rich textual semantics are then injected by aligning their corresponding embeddings with VA representations through dual-path contrastive learning, further bridging the emotion gap. Extensive experiments on multiple downstream AVER benchmarks show that VAEmo achieves state-of-the-art performance with a compact design, highlighting the benefit of unified cross-modal encoding and emotion-aware semantic guidance for efficient, generalizable VA emotion representations.

View on arXiv PDF Code

Similar