Human Feedback Driven Dynamic Speech Emotion Recognition
This work addresses the problem of animating emotional 3D avatars with more accurate and dynamic emotion recognition, though it appears incremental by building on existing methods.
The paper tackles dynamic speech emotion recognition by modeling sequences of emotions over time, using a Dirichlet-based approach for emotional mixtures and human feedback for improvement, achieving enhanced model quality and simplified annotation.
This work proposes to explore a new area of dynamic speech emotion recognition. Unlike traditional methods, we assume that each audio track is associated with a sequence of emotions active at different moments in time. The study particularly focuses on the animation of emotional 3D avatars. We propose a multi-stage method that includes the training of a classical speech emotion recognition model, synthetic generation of emotional sequences, and further model improvement based on human feedback. Additionally, we introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution. The models are evaluated based on ground-truth emotions extracted from a dataset of 3D facial animations. We compare our models against the sliding window approach. Our experimental results show the effectiveness of Dirichlet-based approach in modeling emotional mixtures. Incorporating human feedback further improves the model quality while providing a simplified annotation procedure.