LGOct 13, 2021

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

arXiv:2110.06650v221 citations
Originality Incremental advance
AI Analysis

This work addresses emotion recognition from speech, an incremental improvement in multimodal fusion for domain-specific applications.

The paper tackles speech emotion recognition by proposing a multistage fusion method that integrates text and audio features in multiple layers of a deep neural network, outperforming baselines on MSP-Podcast and IEMOCAP datasets with better quantitative performance.

In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes