Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts
This addresses emotion and sentiment analysis for conversational AI applications, but it is incremental as it combines existing methods.
The paper tackled emotion recognition and sentiment analysis in multi-party conversations by integrating text, speech, facial, and video modalities, achieving accuracies of 66.36% and 72.15% respectively.
Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.