SDLGASSep 14, 2024

Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment

arXiv:2409.09545v32 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses improved emotion recognition for applications in noisy settings, but it is incremental as it combines existing methods.

The paper tackled emotion recognition in reverberant environments by integrating multi-channel audio and video modalities, achieving superior performance over uni-modal and single-microphone approaches.

This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions. Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis. We evaluate our proposed method on a reverberated version of the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset using synthetic and real-world Room Impulse Responsess (RIRs). Our results demonstrate that integrating audio and video modalities yields superior performance compared to uni-modal approaches, especially in challenging acoustic conditions. Moreover, we show that the multimodal (audiovisual) approach that utilizes multiple microphones outperforms its single-microphone counterpart.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes