ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

arXiv:2603.04219v12.2h-index: 10

Originality Incremental advance

AI Analysis

This work addresses the problem of maintaining speaker similarity during fine-tuning for personalized speech synthesis when using synthetic data augmentation, which is a common challenge for researchers and developers working with limited speech data.

This paper explores using zero-shot text-to-speech (ZS-TTS) for data augmentation in low-resource personalized speech synthesis. The authors propose ZeSTA, a domain-conditioned training framework that uses a lightweight domain embedding and real-data oversampling to prevent speaker similarity degradation when mixing synthetic and real speech. Experiments show improved speaker similarity while maintaining intelligibility and perceptual quality compared to naive synthetic augmentation.

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.

View on arXiv PDF

Similar