Estimating the Uncertainty in Emotion Class Labels with Utterance-Specific Dirichlet Priors
This addresses the ambiguity in emotion labels for AI systems interacting with humans, though it is incremental as it builds on existing methods for uncertainty modeling.
The paper tackles the problem of label uncertainty in emotion recognition by proposing a Bayesian training loss with utterance-specific Dirichlet priors, achieving state-of-the-art classification results on the IEMOCAP dataset and demonstrating effective detection of high-uncertainty utterances with an area under the precision-recall curve metric.
Emotion recognition is a key attribute for artificial intelligence systems that need to naturally interact with humans. However, the task definition is still an open problem due to the inherent ambiguity of emotions. In this paper, a novel Bayesian training loss based on per-utterance Dirichlet prior distributions is proposed for verbal emotion recognition, which models the uncertainty in one-hot labels created when human annotators assign the same utterance to different emotion classes. An additional metric is used to evaluate the performance by detection test utterances with high labelling uncertainty. This removes a major limitation that emotion classification systems only consider utterances with labels where the majority of annotators agree on the emotion class. Furthermore, a frequentist approach is studied to leverage the continuous-valued "soft" labels obtained by averaging the one-hot labels. We propose a two-branch model structure for emotion classification on a per-utterance basis, which achieves state-of-the-art classification results on the widely used IEMOCAP dataset. Based on this, uncertainty estimation experiments were performed. The best performance in terms of the area under the precision-recall curve when detecting utterances with high uncertainty was achieved by interpolating the Bayesian training loss with the Kullback-Leibler divergence training loss for the soft labels. The generality of the proposed approach was verified using the MSP-Podcast dataset which yielded the same pattern of results.