CLSDASMar 31, 2022

MMER: Multimodal Multi-task Learning for Speech Emotion Recognition

arXiv:2203.16794v528 citations
Originality Incremental advance
AI Analysis

This addresses emotion recognition from speech, which is important for applications like human-computer interaction, but appears incremental as it builds on existing multimodal and multi-task learning techniques.

The paper tackles speech emotion recognition by proposing MMER, a multimodal multi-task learning approach that uses early-fusion and cross-modal self-attention between text and acoustic modalities with three auxiliary tasks, achieving state-of-the-art performance on the IEMOCAP benchmark.

In this paper, we propose MMER, a novel Multimodal Multi-task learning approach for Speech Emotion Recognition. MMER leverages a novel multimodal network based on early-fusion and cross-modal self-attention between text and acoustic modalities and solves three novel auxiliary tasks for learning emotion recognition from spoken utterances. In practice, MMER outperforms all our baselines and achieves state-of-the-art performance on the IEMOCAP benchmark. Additionally, we conduct extensive ablation studies and results analysis to prove the effectiveness of our proposed approach.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes