ASAILGJun 5, 2024

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

arXiv:2406.03272v35 citations
Originality Incremental advance
AI Analysis

This addresses robustness issues for emotion recognition systems in adverse acoustic environments, but it is incremental as it adapts an existing transformer model to multi-channel inputs.

The study tackled performance degradation of speech emotion recognition in real-life reverberant conditions by processing multi-microphone signals, achieving superior accuracy compared to single-channel baselines.

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes