Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition
This addresses a critical bottleneck in emotion-aware human-computer interaction systems by improving alignment for multimodal data, though it appears incremental as it builds on existing LSTM and attention methods.
The paper tackles the problem of temporal alignment between speech and text in multimodal emotion recognition by proposing a Gated Bidirectional Alignment Network (GBAN) with attention-based alignment and group gated fusion, achieving state-of-the-art performance on the IEMOCAP dataset.
Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.