LGSDASMar 27, 2024

Fusion approaches for emotion recognition from speech using acoustic and text-based features

arXiv:2403.18635v155 citationsh-index: 35ICASSP
Originality Synthesis-oriented
AI Analysis

This work addresses emotion recognition for speech analysis applications, but it is incremental as it builds on existing fusion methods and highlights dataset-specific validation issues.

The paper tackles emotion recognition from speech by combining acoustic and text-based features, finding that fusion improves performance on IEMOCAP and MSP-PODCAST datasets, though fusion strategies yield only subtle differences, and it reveals that standard cross-validation folds in IEMOCAP lead to optimistic text-based performance estimates.

In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes