Construction and Evaluation of Mandarin Multimodal Emotional Speech Database
This provides a new database for Mandarin multimodal emotional speech analysis, which is incremental as it extends existing resources to a specific language and modality set.
The researchers constructed a Mandarin multimodal emotional speech database with articulatory kinematics, acoustics, glottal, and facial micro-expression data, labeled with discrete and dimensional emotion categories. They validated the database by achieving an 82% average recognition rate for seven emotions using acoustic data alone, with lower rates for glottal (72%) and kinematics (55.7%) data.
A multi-modal emotional speech Mandarin database including articulatory kinematics, acoustics, glottal and facial micro-expressions is designed and established, which is described in detail from the aspects of corpus design, subject selection, recording details and data processing. Where signals are labeled with discrete emotion labels (neutral, happy, pleasant, indifferent, angry, sad, grief) and dimensional emotion labels (pleasure, arousal, dominance). In this paper, the validity of dimension annotation is verified by statistical analysis of dimension annotation data. The SCL-90 scale data of annotators are verified and combined with PAD annotation data for analysis, so as to explore the internal relationship between the outlier phenomenon in annotation and the psychological state of annotators. In order to verify the speech quality and emotion discrimination of the database, this paper uses 3 basic models of SVM, CNN and DNN to calculate the recognition rate of these seven emotions. The results show that the average recognition rate of seven emotions is about 82% when using acoustic data alone. When using glottal data alone, the average recognition rate is about 72%. Using kinematics data alone, the average recognition rate also reaches 55.7%. Therefore, the database is of high quality and can be used as an important source for speech analysis research, especially for the task of multimodal emotional speech analysis.