CLLGApr 1, 2022

Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

arXiv:2204.00400v235 citationsh-index: 105Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of understanding model behavior in SER for researchers and practitioners, but it is incremental as it probes existing models rather than proposing new methods.

The study investigated whether transformer models fine-tuned for speech emotion recognition (SER) exploit linguistic information from their pre-training, finding that valence predictions are highly responsive to sentiment content and negations but not to intensifiers or reducers, with no impact on arousal or dominance.

Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes