Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS
This work addresses the need for more versatile and controllable prosody in TTS systems, which is important for applications requiring varied speech styles, but it is incremental as it builds on existing non-autoregressive TTS methods.
The paper tackled the problem of limited prosodic variation and control in neural text-to-speech synthesis by training a non-autoregressive model hierarchically conditioned on acoustic features to learn a latent prosody space. The result showed effective control over prosodic dimensions, generation of diverse speaking styles, and word-wise emphasis control while maintaining or improving quality compared to a baseline.
Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS front-end model hierarchically conditioned on both coarse and fine-grained acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension, generate a wide variety of speaking styles, and provide word-wise emphasis control, while maintaining equal or better quality to the baseline model.