Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations
This addresses the problem of limited data for voice conversion in speech synthesis, offering an incremental improvement in efficiency for applications like personalized voice assistants.
The paper tackles any-to-one voice conversion for unseen speakers by using self-supervised discrete speech representations in a sequence-to-sequence framework, achieving generalization with only 5 minutes of target speaker data and outperforming models trained with parallel data.
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize vq-wav2vec (VQW2V), a discretized self-supervised speech representation that was learned from massive unlabeled data, which is assumed to be speaker-independent and well corresponds to underlying linguistic contents. Given a training dataset of the target speaker, we extract VQW2V and acoustic features to estimate a seq2seq mapping function from the former to the latter. With the help of a pretraining method and a newly designed postprocessing technique, our model can be generalized to only 5 min of data, even outperforming the same model trained with parallel data.