SDCLASSep 29, 2024

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

arXiv:2409.19585v27 citationsh-index: 55
Originality Incremental advance
AI Analysis

This addresses the problem of speech emotion recognition in noisy environments for applications like human-computer interaction, but it is incremental as it builds on existing TSE and SER methods.

The paper tackles robust speech emotion recognition in noisy conditions with human speech noise by proposing a two-stage framework that cascades target speaker extraction and SER, achieving a 14.33% improvement in unweighted accuracy over a baseline without TSE.

Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes