Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction
This work addresses the challenge of speaker-specific emotion prediction in speech processing, offering an incremental improvement for applications like human-computer interaction.
The paper tackles the problem of personalizing emotional vocalization prediction for individual speakers using only a few unlabeled samples, achieving a 2.5% improvement in Concordance Correlation Coefficient (CCC) from 0.634 to 0.650 on the ExVo Few-Shot dataset.
In this work, we explore a novel few-shot personalisation architecture for emotional vocalisation prediction. The core contribution is an `enrolment' encoder which utilises two unlabelled samples of the target speaker to adjust the output of the emotion encoder; the adjustment is based on dot-product attention, thus effectively functioning as a form of `soft' feature selection. The emotion and enrolment encoders are based on two standard audio architectures: CNN14 and CNN10. The two encoders are further guided to forget or learn auxiliary emotion and/or speaker information. Our best approach achieves a CCC of $.650$ on the ExVo Few-Shot dev set, a $2.5\%$ increase over our baseline CNN14 CCC of $.634$.