Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations
This work addresses the problem of enhancing emotional expressiveness in speech synthesis for applications like human-computer interaction, though it appears incremental as it builds on existing methods with a novel training strategy.
The paper tackled the challenge of generating diverse and contextually aligned nonverbal vocalizations (NVs) like laughter in emotional speech synthesis, resulting in a framework that produces more expressive and diverse NVs while maintaining speech naturalness.
Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.