Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment
This work addresses the problem of improving emotion recognition accuracy for applications in human-computer interaction, though it appears incremental as it builds on existing multimodal fusion methods.
The paper tackled the challenge of multimodal affective computing by introducing a hierarchical architecture with attention and word-level fusion for sentiment and emotion classification from text and audio, achieving state-of-the-art performance on published datasets.
Multimodal affective computing, learning to recognize and interpret human affects and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract level, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utter-ance-level sentiment and emotion from text and audio data. Our introduced model outperforms the state-of-the-art approaches on published datasets and we demonstrated that our model is able to visualize and interpret the synchronized attention over modalities.