Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder
This addresses the problem of generating varied intonations in voice conversion for applications like speech synthesis, though it is incremental as it builds on existing CVAE methods.
The paper tackled the limitation of conventional voice conversion models that produce only one output per source input by proposing a conditional variational autoencoder (CVAE) approach to generate diverse intonations. The result showed that the converted voice achieved better sound quality and more diverse intonations compared to models without CVAE.
Voice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different intonations, conventional voice conversion models were limited to producing only one result per source input. To overcome this limitation, we propose a novel approach for voice conversion with diverse intonations using conditional variational autoencoder (CVAE). Experiments have shown that the speaker's style feature can be mapped into a latent space with Gaussian distribution. We have also been able to convert voices with more diverse intonation by making the posterior of the latent space more complex with inverse autoregressive flow (IAF). As a result, the converted voice not only has a diversity of intonations, but also has better sound quality than the model without CVAE.