Universal Paralinguistic Speech Representations Using Self-Supervised Conformers
This work addresses the need for robust paralinguistic understanding in speech applications, offering a universal representation that is broadly applicable but incremental in method.
The paper tackles the problem of extracting paralinguistic speech features like emotion and speaker traits by introducing a self-supervised Conformer-based model with 600M+ parameters, achieving state-of-the-art results where linear classifiers on its representations outperform previous methods, sometimes by large margins, and showing that 2-second context windows achieve 96% of full-context performance on most tasks.
Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.