A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness
This addresses the challenge of improving expressiveness in TTS systems for applications like voice assistants, but it is incremental as it builds on existing methods without achieving clear SOTA gains.
The study tackled the problem of enhancing expressiveness control in Text-to-Speech models by augmenting a frozen pretrained model with a Diffusion Model conditioned on joint semantic audio/text embeddings, but the results only offered insights into the complexities without concrete numerical improvements.
This report explores the challenge of enhancing expressiveness control in Text-to-Speech (TTS) models by augmenting a frozen pretrained model with a Diffusion Model that is conditioned on joint semantic audio/text embeddings. The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.