MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
This work addresses the challenge of improving emotion recognition in human-computer interaction by enhancing pre-training knowledge, though it appears incremental as it builds on existing methods with specific architectural innovations.
The paper tackled the problem of Speech Emotion Recognition (SER) by proposing a Multi-perspective Fusion Search Network (MFSN) that partitions speech knowledge into textual and acoustic perspectives and uses architecture search to leverage them, achieving superior results on multiple datasets.
Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.