CL AIOct 7, 2025

Centering Emotion Hotspots: Multimodal Local-Global Fusion and Cross-Modal Alignment for Emotion Recognition in Conversations

Yu Liu, Hanlei Shi, Haoxun Li, Yuqing Sun, Yuxuan Ding, Linlin Gong, Leyuan Qu, Taihao Li

arXiv:2510.08606v12.7h-index: 8

Originality Highly original

AI Analysis

This work addresses the challenge of sparse and asynchronous multimodal evidence in ERC, offering a new perspective for future research.

The paper tackled the problem of Emotion Recognition in Conversations (ERC) by focusing on emotion hotspots, achieving consistent gains over strong baselines on standard benchmarks.

Emotion Recognition in Conversations (ERC) is hard because discriminative evidence is sparse, localized, and often asynchronous across modalities. We center ERC on emotion hotspots and present a unified model that detects per-utterance hotspots in text, audio, and video, fuses them with global features via Hotspot-Gated Fusion, and aligns modalities using a routed Mixture-of-Aligners; a cross-modal graph encodes conversational structure. This design focuses modeling on salient spans, mitigates misalignment, and preserves context. Experiments on standard ERC benchmarks show consistent gains over strong baselines, with ablations confirming the contributions of HGF and MoA. Our results point to a hotspot-centric view that can inform future multimodal learning, offering a new perspective on modality fusion in ERC.

View on arXiv PDF

Similar