MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal Emotion Recognition
This work addresses the challenge of high annotation costs and label ambiguity in multimodal emotion recognition, offering a novel approach that could improve efficiency and accuracy in applications like human-computer interaction, though it appears incremental by building on existing pre-training and prompt-based techniques.
The paper tackles the problem of limited labeled data for multimodal emotion recognition by proposing MEmoBERT, a pre-training model that uses self-supervised learning on large-scale unlabeled video data and a prompt-based method to reformulate classification as masked text prediction, resulting in significant performance enhancements on benchmark datasets like IEMOCAP and MSP-IMPROV.
Multimodal emotion recognition study is hindered by the lack of labelled corpora in terms of scale and diversity, due to the high annotation cost and label ambiguity. In this paper, we propose a pre-training model \textbf{MEmoBERT} for multimodal emotion recognition, which learns multimodal joint representations through self-supervised learning from large-scale unlabeled video data that come in sheer volume. Furthermore, unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction one, bringing the downstream task closer to the pre-training. Extensive experiments on two benchmark datasets, IEMOCAP and MSP-IMPROV, show that our proposed MEmoBERT significantly enhances emotion recognition performance.