Learning Robust Latent Representations for Controllable Speech Synthesis
This work addresses the challenge of improving controllability and expressiveness in text-to-speech synthesis for applications requiring precise speech manipulation, representing an incremental advancement over prior VAE methods.
The paper tackles the problem of learning robust latent representations for controllable speech synthesis, where existing VAEs fail to learn distinct speaker attributes on limited or noisy datasets, and proposes RTI-VAE, which reduces cluster overlap of speaker attributes by at least 30% over LSTM-VAE and 7% over Transformer-VAE.
State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled latent representations give impressive results in discovering features like pitch, pause duration, and accent in speech data, leading to highly controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs fail to learn latent clusters of speaker attributes when trained on either limited or noisy datasets. Further, different latent variables start encoding the same features, limiting the control and expressiveness during speech synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer with Information reduction VAE) where we minimize the mutual information between different latent variables and devise a modified Transformer architecture with layer reordering to learn controllable latent representations in speech data. We show that RTI-VAE reduces the cluster overlap of speaker attributes by at least 30\% over LSTM-VAE and by at least 7\% over vanilla Transformer-VAE.