Are Paralinguistic Representations all that is needed for Speech Emotion Recognition?
This work addresses the gap in assessing paralinguistic representations for speech emotion recognition in non-English languages, though it is incremental as it focuses on benchmarking existing models.
The paper tackled the problem of evaluating paralinguistic pre-trained model representations for speech emotion recognition across multiple languages, finding that TRILLsson representations performed best by effectively capturing pitch and tone.
Availability of representations from pre-trained models (PTMs) have facilitated substantial progress in speech emotion recognition (SER). Particularly, representations from PTM trained for paralinguistic speech processing have shown state-of-the-art (SOTA) performance for SER. However, such paralinguistic PTM representations haven't been evaluated for SER in linguistic environments other than English. Also, paralinguistic PTM representations haven't been investigated in benchmarks such as SUPERB, EMO-SUPERB, ML-SUPERB for SER. This makes it difficult to access the efficacy of paralinguistic PTM representations for SER in multiple languages. To fill this gap, we perform a comprehensive comparative study of five SOTA PTM representations. Our results shows that paralinguistic PTM (TRILLsson) representations performs the best and this performance can be attributed to its effectiveness in capturing pitch, tone and other speech characteristics more effectively than other PTM representations.