ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition
This work addresses the challenge of using human-annotated text in SER by exploring ASR adaptation to emotional speech, which is incremental for improving real-world SER systems.
The study investigated the mutual impact of Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) by analyzing ASR performance on emotional speech across four systems and three corpora, and conducting text-based SER on ASR transcripts with varying error rates, aiming to uncover their relationship for practical applications.
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec2, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.