Boosting Star-GANs for Voice Conversion with Contrastive Discriminator
This work addresses training issues in voice conversion models, which is an incremental improvement for applications in speech synthesis and audio processing.
The paper tackled the challenge of training instability and discriminator overfitting in nonparallel multi-domain voice conversion models like StarGAN-VCs by incorporating contrastive learning with a Siamese network structure into the discriminator. The results showed that SimSiam-StarGAN-VC significantly outperformed existing methods on the VCC 2018 dataset in both objective and subjective metrics.
Nonparallel multi-domain voice conversion methods such as the StarGAN-VCs have been widely applied in many scenarios. However, the training of these models usually poses a challenge due to their complicated adversarial network architectures. To address this, in this work we leverage the state-of-the-art contrastive learning techniques and incorporate an efficient Siamese network structure into the StarGAN discriminator. Our method is called SimSiam-StarGAN-VC and it boosts the training stability and effectively prevents the discriminator overfitting issue in the training process. We conduct experiments on the Voice Conversion Challenge (VCC 2018) dataset, plus a user study to validate the performance of our framework. Our experimental results show that SimSiam-StarGAN-VC significantly outperforms existing StarGAN-VC methods in terms of both the objective and subjective metrics.