AS AI LGJun 5, 2024

Multi-Microphone Speech Emotion Recognition using the Hierarchical Token-semantic Audio Transformer Architecture

Ohad Cohen, Gershon Hazan, Sharon Gannot

arXiv:2406.03272v32.35 citations

Originality Incremental advance

AI Analysis

This addresses robustness issues for emotion recognition systems in adverse acoustic environments, but it is incremental as it adapts an existing transformer model to multi-channel inputs.

The study tackled performance degradation of speech emotion recognition in real-life reverberant conditions by processing multi-microphone signals, achieving superior accuracy compared to single-channel baselines.

The performance of most emotion recognition systems degrades in real-life situations ('in the wild' scenarios) where the audio is contaminated by reverberation. Our study explores new methods to alleviate the performance degradation of SER algorithms and develop a more robust system for adverse conditions. We propose processing multi-microphone signals to address these challenges and improve emotion classification accuracy. We adopt a state-of-the-art transformer model, the HTS-AT, to handle multi-channel audio inputs. We evaluate two strategies: averaging mel-spectrograms across channels and summing patch-embedded representations. Our multi-microphone model achieves superior performance compared to single-channel baselines when tested on real-world reverberant environments.

View on arXiv PDF

Similar