Controllable Emphasis with zero data for text-to-speech
This provides a scalable, zero-data solution for controllable emphasis in TTS, benefiting applications requiring expressive speech synthesis across multiple languages and styles, though it is incremental as it builds on existing phoneme duration models.
The paper tackled the problem of generating emphasized speech in text-to-speech systems without needing recordings or annotations, by increasing the predicted phoneme duration of emphasized words, resulting in a 7.3% improvement in naturalness and a 40% increase in correct identification of emphasized words.
We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles.