Investigating the Impact of Word Informativeness on Speech Emotion Recognition
This work addresses a key challenge in speech emotion recognition for applications like human-computer interaction, but it is incremental as it builds on existing methods with a novel segment selection approach.
The paper tackles the problem of identifying speech segments with relevant acoustic variations for emotion recognition by using word informativeness from a pre-trained language model to select semantically important segments, resulting in notable improvement in recognition performance.
In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach.