CV SD ASAug 25, 2022

Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data

Puneet Kumar, Sarthak Malik, Balasubramanian Raman

arXiv:2208.11868v24.845 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

It addresses emotion recognition for applications like human-computer interaction, but is incremental as it builds on existing multimodal fusion methods.

The paper tackled multimodal emotion recognition by developing a hybrid fusion system for speech and image data, achieving 83.29% accuracy on a new dataset.

This paper proposes a multimodal emotion recognition system based on hybrid fusion that classifies the emotions depicted by speech utterances and corresponding images into discrete classes. A new interpretability technique has been developed to identify the important speech & image features leading to the prediction of particular emotion classes. The proposed system's architecture has been determined through intensive ablation studies. It fuses the speech & image features and then combines speech, image, and intermediate fusion outputs. The proposed interpretability technique incorporates the divide & conquer approach to compute shapely values denoting each speech & image feature's importance. We have also constructed a large-scale dataset (IIT-R SIER dataset), consisting of speech utterances, corresponding images, and class labels, i.e., 'anger,' 'happy,' 'hate,' and 'sad.' The proposed system has achieved 83.29% accuracy for emotion recognition. The enhanced performance of the proposed system advocates the importance of utilizing complementary information from multiple modalities for emotion recognition.

View on arXiv PDF Code

Similar