CLAILGSep 9, 2023

Speech Emotion Recognition with Distilled Prosodic and Linguistic Affect Representations

arXiv:2309.04849v210 citationsh-index: 14
Originality Incremental advance
AI Analysis

This work addresses speech emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing distillation and multimodal techniques.

The paper tackles speech emotion recognition by proposing EmoDistill, a framework that uses cross-modal knowledge distillation to learn prosodic and linguistic representations from speech, achieving state-of-the-art performance with 77.49% unweighted accuracy and 78.91% weighted accuracy on the IEMOCAP benchmark.

We propose EmoDistill, a novel speech emotion recognition (SER) framework that leverages cross-modal knowledge distillation during training to learn strong linguistic and prosodic representations of emotion from speech. During inference, our method only uses a stream of speech signals to perform unimodal SER thus reducing computation overhead and avoiding run-time transcription and prosodic feature extraction errors. During training, our method distills information at both embedding and logit levels from a pair of pre-trained Prosodic and Linguistic teachers that are fine-tuned for SER. Experiments on the IEMOCAP benchmark demonstrate that our method outperforms other unimodal and multimodal techniques by a considerable margin, and achieves state-of-the-art performance of 77.49% unweighted accuracy and 78.91% weighted accuracy. Detailed ablation studies demonstrate the impact of each component of our method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes