Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters
This addresses noise robustness in zero-shot TTS for applications like voice cloning or assistive technologies, but it is incremental as it builds on existing SSL-based methods with adapters and enhancement.
The paper tackled the problem of speech synthesis quality degradation in zero-shot text-to-speech when reference speech contains noise, proposing a method that incorporated adapters into a self-supervised learning model and used a speech enhancement front-end, achieving high-quality synthesis with noisy reference speech as confirmed by objective and subjective evaluations.
The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method. We incorporated adapters into the SSL model, which we fine-tuned with the TTS model using noisy reference speech. In addition, to further improve performance, we adopted a speech enhancement (SE) front-end. With these improvements, our proposed SSL-based zero-shot TTS achieved high-quality speech synthesis with noisy reference speech. Through the objective and subjective evaluations, we confirmed that the proposed method is highly robust to noise in reference speech, and effectively works in combination with SE.