Exploring Attention Mechanisms for Multimodal Emotion Recognition in an Emergency Call Center Corpus
This work addresses emotion detection for emergency call centers, but it is incremental as it applies existing methods to a new dataset.
The paper tackled multimodal emotion recognition on the CEMO emergency call center corpus by exploring fusion strategies with cross-attention mechanisms, achieving an absolute gain of 4-9% over single-modality approaches and finding that audio encodes more emotive information than text.
The emotion detection technology to enhance human decision-making is an important research issue for real-world applications, but real-life emotion datasets are relatively rare and small. The experiments conducted in this paper use the CEMO, which was collected in a French emergency call center. Two pre-trained models based on speech and text were fine-tuned for speech emotion recognition. Using pre-trained Transformer encoders mitigates our data's limited and sparse nature. This paper explores the different fusion strategies of these modality-specific models. In particular, fusions with and without cross-attention mechanisms were tested to gather the most relevant information from both the speech and text encoders. We show that multimodal fusion brings an absolute gain of 4-9% with respect to either single modality and that the Symmetric multi-headed cross-attention mechanism performed better than late classical fusion approaches. Our experiments also suggest that for the real-life CEMO corpus, the audio component encodes more emotive information than the textual one.